Fundamentals of Reinforcement Learning: Estimating the Action-Value Function

In this article, we introduce fundamental concepts of reinforcement learning—including the k-armed bandit problem, estimating the action-value function, and the exploration vs. exploitation dilemma.

3 years ago   •   9 min read

By Peter Foy

In this article, we're going to introduce the fundamental concepts of reinforcement learning including the k-armed bandit problem, estimating the action-value function, and the exploration vs. exploitation dilemma.

Before we get into the fundamentals concepts of RL, let's first review the differences between supervised, unsupervised, and reinforcement learning at a high-level:

  • Supervised learning makes use of labeled examples, which provides the learner the correct answer to improve its accuracy over time
  • Unsupervised learning does not have any labeled examples, but instead is focused on extracting the underlying structure of data
  • Reinforcement learning provides the agent with an idea of how good or bad its actions were based on a system of rewards

Reinforcement learning also differs for other learning algorithms as the agents are often dealing with an ever-changing environment, meaning it needs to constant adapt its actions and learn new behaviours.

This article is based on notes from Week 1 of this course on the Fundamentals of Reinforcement Learning and is organized as follows:

  • Sequential Decision Making with Evaluative Feedback
  • Action-Value Function
  • Estimating Action-Values with the Sample Average Method
  • Estimating Action Values Incrementally
  • Exploration vs. Exploration Tradeoff

Stay up to date with AI

We're an independent group of machine learning engineers, quantitative analysts, and quantum computing enthusiasts. Subscribe to our newsletter and never miss our articles, latest news, etc.

Great! Check your inbox and click the link.
Sorry, something went wrong. Please try again.

Sequential Decision Making with Evaluative Feedback

A unique feature of reinforcement learning is that it creates its own training data by interacting with the environment. The agent then improves its decision making through trial and error.

Let's start by reviewing this evaluative aspect of reinforcement learning in a simplified setting called bandits.

In particular, in a k-armed bandit problem we have an agent, or decision maker, who chooses between $k$ different actions. The agent then receives a reward based on the action it chose.

In order to decide which action to choose at each timestep, we must define the value of each action, which is referred to as the action-value function.

Action-Value Function

We can define the action-value function more formally as the value of the expected reward of taking that action.

Mathematically we can describe this as:

$$q*(a) \doteq \mathbb{E}[R_t|A_t = a] \:\forall a \in {1,...,k}$$

This can be read as $q*$ of $a$ is defined as the expectation of $R_t$ given a selected action $A_t$ for each possible action 1 through $k$.

This conditional expectation is defined as a sum over all possible rewards:

$$\sum\limits_{r} p(r|a) r$$

Inside the sum we have multiplied the possible reward by the probability of observing that reward. This can be extended to a continuous reward situation by switching the summation to an integral.

The goal of an agent is to maximize the expected reward.

To do so, the agent selects the action that has the highest value at each timestep. This procedure is referred to as the "argmax", or the argument that maximizes the function $q*$:


The challenge of maximizing rewards and estimating values are important subsets of both bandits and reinforcement learning.

To summarize, decision making under uncertainty can be formalized by the k-armed bandit problem. In bandits, we see the core concepts of reinforcement learning—including actions, rewards, and value functions.

Estimating Action-Values with the Sample Average Method

There are many ways to estimate the action-value function, although in this section we'll look at the sample-average method.

We'll also define key RL concepts such as the the greedy-action selection and the exploration-exploitation dilemma.

Recall that the value of $q*(a)$ is not known by the agent, so it must be estimated at each timestep.

The sample average method estimates the action-value by recording the total reward for each action and divide it by the number of times that action has been selected:

$$Q_t(a) \doteq \frac{\text{sum of rewards when action $a$ taken prior to time $t$}}{\text{number of times $a$ taken prior to time $t$}}$$

$$= \frac{\sum_{i=1}^{t-1}R_i}{t - 1}$$

After the value of $q*(a)$ has been estimated, the agent will then choose the action with the highest estimated value, which is referred to as greedy action.

Selecting the greedy action means the agent is exploiting its current knowledge of the environment.

We can compute the greedy action by taking the $\text{argmax}$ of the estimated values:

$$a_g = \text{argmax}\:Q(a)$$

The agent can also choose a non-greedy action and continue to explore the environment, which sacrifices immediate reward but may lead to larger rewards in the future.

The agent can't choose to explore and exploit knowledge at the same time, which is referred to as the exploration-exploitation dilemma—more on this concept later.

Estimating Action Values Incrementally

In this section we'll look at how action-values can be estimated incrementally using the sample-average method.

We'll then derive the incremental update rule, which is an instance of a more general update rule. We'll also identify when the general update rule can be used to solve a non-stationary bandit problem.

Incremental Update Rule

The sample-average method can be written recursively in order to avoid storing all the previous data.

$$Q_{n+1} = \frac{1}{n}\sum\limits_{i=1}^n R_i$$

To do so, we pull the current reward out of the sum so that now the sum only goes until the previous reward:

$$= \frac{1}{n}(R_n + \sum\limits_{i=1}^{n-1}R_i)$$

We write the next value estimate in terms of our previous value estimate by multiplying and dividing the current sum by $n-1$:

$$= \frac{1}{n}(R_n + (n - 1)\frac{1}{n-1}\sum\limits_{i=1}^{n-1}R_i)$$

This equation can be simplified with our current value estimate $Q_n$:

$$= \frac{1}{n}(R_n + (n - 1) Q_n$$

We can simplify this equation further by distributing $Q_n$:

$$= \frac{1}{n}(R_n + nQ_n - Q_n$$

Finally, we pull $nQ_n$ out of the sum and multiply by $\frac{1}{n}$. We now have an incremental update rule for estimating values:

$$= Q_n + \frac{1}{n}(R_n - Q_n)$$

We can define the terms of this equation as follows:

$$NewEstimate \leftarrow OldEstimate + StepSize(Target - OldEstimate)$$

  • The error and the estimate is the difference between old estimate and the new target $(R_n - Q_n)$
  • The new reward $R_n$ is our target
  • The size of the step $\frac{1}{n}$ is determined by the step size parameter and the error of our old estimate

We now have a general rule for incrementally updating the estimate:

$$Q_{n+1} = Q_n + a_n(R_n - Q_n)$$

The step size can be a function of $n$ that produces a number from 0 to 1:

$$a_n \rightarrow [0, 1]$$

In the case of the sample-average method, the step size is equal to $\frac{1}{n}$.

Non-Stationary Bandit Problem

Non-stationary bandit problems are like the bandit problem discussed earlier, although the the distribution of reward changes over time.

An option to adapt to these changes is to use a fixed step size. If $a_n$ is constant, for example 0.1, then the most recent rewards $R_n$ affect the estimate more than older ones.

Decaying Past Rewards

In order to see how more recent rewards affect the value estimate more than older ones, let's look at the incremental update equation again.

$$Q_{n+1} = Q_n + a_n(R_n - Q_n)$$

We can rearrange the equation by distributing $a$ as follows:

$$= aR_n + Q_n - aQ_n$$

We can write the next value $Q_n - aQ_n$ as the weighted sum of the reward and the last value:

$$= aR_n + (1 - a)Q_n$$

We can then substitute the definition of $Q_n$ into the equation:

$$= aR_n + (1 - a)[aR_{n-1} + (1-a)Q_{n-1}]$$

We can then distribute $1-a$ in order to see the relationship between $R_n$ and $R_{n-1}$:

$$= aR_n = (1-a)aR_{n-1} + (1-a)^2 Q_{n-1}$$

We can look at this recursive relationship further by continually replacing $Q_n$ all the way back until we get to $Q_1$:

$$= (1-a)^n Q_1 + \sum\limits_{i=1}^n a(1 - a)^{n-1}R_i$$

This can be understood as our initial value of $Q_1$ plus the weighted value of the sum of rewards over time.

This equation relates our current estimate of the value of $Q_{n+1}$ to $Q_1$ and all the observed rewards.

The first term tells us that the contribution of $Q$ decreases exponentially over time and the second term tells us that older rewards contribute exponentially less to the sum.

This tells us that our initialization of $Q_1$ goes to 0 with more and more data.

Exploration vs. Exploration Tradeoff

Now that we've discussed a method to estimate the action-value function, let's discuss the exploration vs. exploitation tradeoff in more detail.

Exploration allows an agent to improve its knowledge of the environment for a higher potential long-term benefit.

Exploitation, on the other hand, exploits the agent's current knowledge for a short-term benefit. Since the action-value is estimated, however, it may not get the highest possible reward.

This is the heart of the exploration-exploitation dilemma—exploitation can lead to suboptimal behavior, whereas exploration can lead to lower short-term rewards.

So how do we choose when to explore and when to exploit?

One simple way is to simply choose randomly between exploration and exploitation.

This method is referred to as the epsilon-greedy action selection, where epsilon refers to the probability of choosing to explore.

Epsilon-Greedy Action Selection

The action we select at timestep $t$, $A_t$ is the greedy action $\underset{a}{\text{argmax}}\:Q_t(a)$ with probability $1 - \varepsilon$ or is a random action $a ~ Uniform({a_1,...,a_k})$ with probability $\varepsilon$:

$$A_t \leftarrow \begin{cases} \underset{a}{\text{argmax}}Q_t(a)\:\text{with probability 1 - $\epsilon$}\\a \sim Uniform(\{a_1...a_k\})\:\text{with probability $\epsilon$}\end{cases} $$

In summary, epsilon-greed is a simple method for balancing exploration vs. exploitation.

Optimistic Initial Values

Another common strategy for balancing exploration vs. exploitation is to use the concept of optimistic initial values.

The principle of optimistic initial values encourages exploration early in the learning phase and then exploitation later on.

Using optimistic initial values, however, is not necessarily the optimal way to balance exploration and exploitation. A few of the limitations of this strategy include:

  • It only encourages exploration early in learning, meaning the agent will not continue exploring after some time. This lack of exploration later on means the strategy is not well-suited for non-stationary problems.
  • Another limitation is that we may not always know what the optimistic initial values should be.

Despite these limitations, this approach can be a very useful heuristic in reinforcement learning.

Upper-Confidence Bounce (UCB) Action Selection

Another method to balance exploration and exploitation is referred to as Upper-Confidence Bound (UCB) action-selection.

Due to the fact that we're estimating the action-value from sample rewards, there's inherent uncertainty in the accuracy of the estimate. UCB action-selection uses this inherent uncertainty in its estimates to drive exploration.

Recall the epsilon-greedy action-selection method, which chooses exploratory actions epsilon percentage of time—this means that exploratory actions are chosen uniformly.

UCB incorporate uncertainty in the action-value estimate by using an upper bound and lower bound around $Q(a)$. The region in between these two bounds is called the confidence interval and represents the inherent uncertainty.

If the confidence interval is large, meaning the estimate is uncertain, the agent should optimistically choose the action with the highest upper bound. This makes sense as either the action does have the highest reward, or the gent learns about an action that it knows the least about.

The Upper-Confidence Bound action selection can be described mathematically as:

$$A_t \doteq \text{argmax}\left[Q_t(a) + c \sqrt{\frac{ln\:t}{N_t(a)}}\right]$$

In other words, UCB selects the action with the highest estimated value plus the upper-confidence bound exploration term $c \sqrt{\frac{ln\:t}{N_t(a)}}\right]$, where the $c$ parameter is a user-specified parameter that controls the amount of exploration.

We can see that he first term of the equation $Q_t(a)$ represents the exploitation part and the second part of the equation represents exploration.

Summary: Estimating the RL Action-Value Function

In this article we introduced a subset of the reinforcement learning problem called the k-armed bandit problem.

We first defined key reinforcement learning concepts such as how an agent takes actions and receives rewards from the environment based on the action selected.

We then introduced how to estimate the value of each action as the expected reward from taking a particular action.

The value function $q*(a)$ is unknown to the agent and one way to estimate it is with the sample-average method. This method uses the sum of rewards received when taking action $a$ prior to time $t$ divided by the number of times $a$ was taken prior to $t$.

We then defined an incremental update rule by finding a recursive definition of the sample-average method. This allows us to only store the number of times each action has been taken and the previous estimate.

Finally, we discussed the exploration vs. exploitation dilemma in reinforcement learning. Exploration gives the agent a chance to improve its knowledge for a long-term benefit, whereas exploitation means taking advantage of its current knowledge for short-term gain.

Since the agent cannot explore and exploit knowledge simultaneously there are several methods for balancing these including the epsilon-greedy action selection, optimistic initial values, and the upper-confidence bound action selection method.


Spread the word

Keep reading