Upper Confidence Bound (UCB) is a family of algorithms in machine learning and statistics for solving the multi-armed bandit problem and addressing the explorationâÂÂexploitation trade-off. UCB methods select actions by computing an upper confidence estimate of each actionâÂÂs potential reward, thus balancing exploration of uncertain options with exploitation of those known to perform well. Confidence-bound methods for stochastic bandits date back to work of Tze Leung Lai and Herbert Robbins in 1985, and the first UCB algorithm is due to Lai (1987). UCB and its variants have become standard techniques in reinforcement learning, online advertising, recommender systems, clinical trials, and Monte Carlo tree search.
The multi-armed bandit problem models a scenario where an agent chooses repeatedly among options ("arms"), each yielding stochastic rewards, with the goal of maximizing the sum of collected rewards over time. The main challenge is the explorationâÂÂexploitation trade-off: the agent must explore lesser-tried arms to learn their rewards, yet exploit the best-known arm to maximize payoff. Traditional -greedy or softmax strategies use randomness to force exploration; UCB algorithms instead use statistical confidence bounds to guide exploration more efficiently.
UCB1 is a widely used bounded-reward variant of UCB introduced by Auer, Cesa-Bianchi and Fischer (2002). It maintains for each arm :
At round , it selects the arm maximizing:
Arms with are initially played once. The bonus term shrinks as grows, ensuring exploration of less-tried arms and exploitation of high-mean arms.
<pre> for each arm i: n[i] â 0; Q[i] â 0 for t from 1 to T do: for each arm i do if n[i] = 0 then select arm i else index[i] â Q[i] + sqrt((2 * ln t) / n[i]) select arm a with highest index[a] observe reward r n[a] â n[a] + 1 Q[a] â Q[a] + (r - Q[a]) / n[a] </pre>
Auer et al. proved that UCB1 achieves logarithmic regret: after rounds, the expected regret satisfies
where is the gap between the optimal armâÂÂs mean and arm âÂÂs mean. Thus, average regret per round tend to as , and UCB1 is near-optimal against the Lai-Robbins lower bound.
Several extensions improve or adapt UCB to different settings:
Introduced in the same paper as UCB1, UCB2 divides plays into epochs controlled by a parameter , reducing the constant in the regret bound at the cost of more complex scheduling.
Incorporates empirical variance to tighten the bonus:
This often outperforms UCB1 in practice but lacks a simple regret proof.
Replaces HoeffdingâÂÂs bound with a KullbackâÂÂLeibler divergence condition, yielding asymptotically optimal regret (constant = ) for Bernoulli rewards.
Computes the -quantile of a Bayesian posterior (e.g. Beta for Bernoulli) as the index. Proven asymptotically optimal under certain priors.
Extends UCB to contextual bandits by estimating a linear reward model and confidence ellipsoids in parameter space. Widely used in news recommendation.
UCB algorithmsâ simplicity and strong guarantees make them popular in: