N. Abe,
A. W. Biermann,
and P. M. Long.
Reinforcement learning with immediate rewards and linear hypotheses.
Algorithmica, 37(4): 263-293, 2003.
(Available in Postscript and PDF formats.)
Abstract
We consider the design and analysis
of algorithms that learn from the consequences of their actions
with the goal of maximizing their cumulative reward, when
- the consequence of a given action is felt immediately, and
- a linear function, which is unknown a priori, (approximately) relates a feature vector for each
action/state pair to the (expected) associated reward.
We focus on two cases,
one in which a continuous-valued reward is (approximately) given by
applying the unknown
linear function, and another in which the probability of receiving the larger
of binary-valued rewards is
obtained. For these cases, we provide bounds on the per-trial regret for
our algorithms that go
to zero as the number of trials approaches infinity. We also provide lower
bounds that show that the rate of convergence is nearly optimal.