An Overview of Deep Reinforcement Learning for Trading

In this article we provide an overview of deep reinforcement learning for trading. Reinforcement learning is the computational science of decision making.

5 years ago • 9 min read

By Peter Foy

One of the most exciting areas of applied AI research is in the field of deep reinforcement learning for trading.

Given the fact that trading and investing is an iterative process of trial and error, deep reinforcement learning likely has huge potential in finance.

In particular, trading and investing is an iterative process of testing new ideas, receiving feedback from the market in the form of profit/loss, and optimizing the strategy over time.

This trial-and-error approach to decision making is exactly what reinforcement learning attempts to solve, and has also been referred to as "the computational science of decision making".

In this article, we'll review the available research, papers, and open-source code repositories to get a better understanding of deep reinforcement learning and trading, including:

What is Reinforcement Learning?
Introduction to Reinforcement Learning for Trading
Introduction to Q-Learning
J.P. Morgan's Guide to Reinforcement Learning

If you want to learn more about this topic, check out our other guides on deep reinforcement learning for trading below:

Stay up to date with AI

What is Reinforcement Learning?

If you want to read a more complete guide to reinforcement learning, check out our article called What is Reinforcement Learning? A Complete Guide for Beginners.

In this article, we'll just summarize the RL framework.

Reinforcement Learning is a framework for an agent to learn to operate in an uncertain environment through interaction.

Let's break reinforcement learning down step-by-step:

We have an agent, who is our decision-maker/learner
The agent operates in an environment
As the agent take actions, the environment provides feedback in the form of a rewards
From these rewards, or labels, the agent gets a new observation and then must select another action, at the next time step.
The observation is called a state
Since the problem needs to be solved now, but the rewards come in the future, we need to define a decision policy, which is our strategy for maximizing the long-term expected reward.

To summarize, reinforcement learning is a framework for a feedback loop of state -> action -> reward provided by the environment.

The presence of a feedback loop from an environment is unique to the RL framework, as these loops are typically not found in supervised or unsupervised learning.

The goal of the agent is thus to maximize the cumulative expected reward.

The agent is looking to find a set of actions for which the expected cumulative reward is expected to be high.

Specifically, we want our agent to learn a policy, which the agent can use to perform actions and maximize its rewards given certain circumstances.

Since we are dealing with time-series data, we also have a discount factor—ɣ— which determines the importance of future rewards.

A discount factor of 0 would tell the agent to only consider immediate rewards, and a discount factor of 1 tells the agent to focus on long-term rewards.

As discussed in our Guide to Reinforcement Learning:

It is the powerful combination of pattern-recognition networks and real-time environment based learning frameworks called deep reinforcement learning that makes this such an exciting area of research.

The deep part of deep reinforcement learning is a more advanced implementation or RL that usees a deep neural network to approximate the best possible action given the current state.

Introduction to Reinforcement Learning for Trading

There are two types of tasks that an agent can attempt to solve in reinforcement learning:

Episodic Tasks are tasks that end at some time step T
Continuing Tasks are tasks where the interaction continues without an end-point

Since the markets never really have an end-point, trading is a continuing task.

Also, since we are dealing with other agents (traders) in the market, which we can't observe (things like account size, open orders, etc.).

This essentially makes trading a partially observable Markov Decision Process.

A partially observable MDP is where we don't know what the true state looks like, but we can observe part of it (i.e. our P&L, bid-ask volume, and so on).

Since it is partially observable and we don't know the full state, we also don't know what the reward function and transition probability looks like.

If we knew these 2 variables we would use Dynamic Programming to compute the optimal policy.

Since we don't in the case of trading, we can instead use a model-free reinforcement learning algorithm like Q-Learning.

Q-Learning

Q-Learning allows us to compute a policy without needing to build a full model of our environment.

In Q-Learning, the possible states and actions are represented by a Q-table, and the equation for how these values are updated is shown below from this article:

A Q-table is simply a table where the states are rows, and actions are columns. The purpose of using a Q-table is to try and determine the best action to take for each given state.

Q of $s_t$ and $a_t$ represents the maximum discounted future reward when we perform an action in state $s$ and continue optimally from then on.

We can think of this function as the maximum possible account balance we can achieve at the end of a training episode after we perform action $a$ in state $s$.

For the purpose of simplification, we can assume the three possible actions for trading include:

Buy
Sell
Hold

The Q function will rate each of the possible actions and will pick the one that has the highest Q value.

Q-Learning is the process of learning what the Q-table is, without needing to learn the reward function or the transition probability.

Let's now look at 2 Github repos on this topic:

Q-Trader

Let's look at an example of using deep reinforcement learning for trading from this Q-Trader Github repository.

The model is:

An implementation of Q-learning applied to (short-term) stock trading. The model uses n-day windows of closing prices to determine if the best action to take at a given time is to buy, sell or sit.

As a result of the short-term state representation, the model is not very good at making decisions over long-term trends, but is quite good at predicting peaks and troughs.

Let's take a look at the agent.py file:

import keras
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Dense
from keras.optimizers import Adam

import numpy as np
import random
from collections import deque

class Agent:
	def __init__(self, state_size, is_eval=False, model_name=""):
		self.state_size = state_size # normalized previous days
		self.action_size = 3 # sit, buy, sell
		self.memory = deque(maxlen=1000)
		self.inventory = []
		self.model_name = model_name
		self.is_eval = is_eval

		self.gamma = 0.95
		self.epsilon = 1.0
		self.epsilon_min = 0.01
		self.epsilon_decay = 0.995

		self.model = load_model("models/" + model_name) if is_eval else self._model()

	def _model(self):
		model = Sequential()
		model.add(Dense(units=64, input_dim=self.state_size, activation="relu"))
		model.add(Dense(units=32, activation="relu"))
		model.add(Dense(units=8, activation="relu"))
		model.add(Dense(self.action_size, activation="linear"))
		model.compile(loss="mse", optimizer=Adam(lr=0.001))

		return model

	def act(self, state):
		if not self.is_eval and np.random.rand() <= self.epsilon:
			return random.randrange(self.action_size)

		options = self.model.predict(state)
		return np.argmax(options[0])

	def expReplay(self, batch_size):
		mini_batch = []
		l = len(self.memory)
		for i in xrange(l - batch_size + 1, l):
			mini_batch.append(self.memory[i])

		for state, action, reward, next_state, done in mini_batch:
			target = reward
			if not done:
				target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])

			target_f = self.model.predict(state)
			target_f[0][action] = target
			self.model.fit(state, target_f, epochs=1, verbose=0)

		if self.epsilon > self.epsilon_min:
			self.epsilon *= self.epsilon_decay

We can then train our agent using this script:

from agent.agent import Agent
from functions import *
import sys

if len(sys.argv) != 4:
	print "Usage: python train.py [stock] [window] [episodes]"
	exit()

stock_name, window_size, episode_count = sys.argv[1], int(sys.argv[2]), int(sys.argv[3])

agent = Agent(window_size)
data = getStockDataVec(stock_name)
l = len(data) - 1
batch_size = 32

for e in xrange(episode_count + 1):
	print "Episode " + str(e) + "/" + str(episode_count)
	state = getState(data, 0, window_size + 1)

	total_profit = 0
	agent.inventory = []

	for t in xrange(l):
		action = agent.act(state)

		# sit
		next_state = getState(data, t + 1, window_size + 1)
		reward = 0

		if action == 1: # buy
			agent.inventory.append(data[t])
			print "Buy: " + formatPrice(data[t])

		elif action == 2 and len(agent.inventory) > 0: # sell
			bought_price = agent.inventory.pop(0)
			reward = max(data[t] - bought_price, 0)
			total_profit += data[t] - bought_price
			print "Sell: " + formatPrice(data[t]) + " | Profit: " + formatPrice(data[t] - bought_price)

		done = True if t == l - 1 else False
		agent.memory.append((state, action, reward, next_state, done))
		state = next_state

		if done:
			print "--------------------------------"
			print "Total Profit: " + formatPrice(total_profit)
			print "--------------------------------"

		if len(agent.memory) > batch_size:
			agent.expReplay(batch_size)

	if e % 10 == 0:
		agent.model.save("models/model_ep" + str(e))

In order to test this agent, we download a training and test CSV files from Yahoo Finance into data/.

We then train the agent on Facebook (FB)—the training period ranges 4 years - from and the testing period will be 1 year.

Since the Github repo uses Python2 we will need to update the print function to Python3 format. We will also need to change xrange() to range() since it was renamed in Python3.

To train the model we will use the following commands for FB, training it on a window size of 10, and 200 episodes:

mkdir models
python train.py FB_train 10 200

After the training phase we evaluate the model with:

python evaluate.py FB_test model_ep200

This test ending showing a small loss, but this is still a good starting point for understanding deep reinforcement learning in trading.

Q-Learning for Trading

Let's now look at code from ShuaiW's Github.

To summarize this repo, here is how the author formulated the problem:

State

At any given point, the state is represented as an array of [# of stock owned, current stock prices, cash in hand].
For example, if we have 50 shares of FB $165 and at and 40 shares of Amazon at $1700, and $10,000 cash on hand - the state array would be [50, 40, 165, 1700, 10000].

Action

We have three possible actions: BUY, SELL, or HOLD

Reward

There are several ways this is formulated, although the one that is chosen is: +/- $ amount of current value compared with the previous step

To test this agent the author uses three stocks: MSFT, IBM, and QCOM.

4629 days of data are used for training while the last 1000 days are used for testing, and the Deep Q Network is trained for 2000 epochs.

From the author's results below, we can see the portfolio values are incredibly volatile:

Of course, this variance is far too high and cannot be ignored, but this provides another solid base to build off in order to continue researching this topic.

In order to improve our own system, we could also combine the RL algorithm with more feature-engineering such as sentiment analysis, predictive equity ranking, and ML-based estimates as both examples only used stock price as the only feature.

J.P. Morgan's Guide to Reinforcement Learning

If you want to read more about practical applications of reinforcement learning in finance check out J.P. Morgan's new paper: Idiosyncrasies and challenges of data driven learning in electronic trading.

Here's the outline of the paper:

We outline the idiosyncrasies of neural information processing and machine learning in quantitative finance. We also present some of the approaches we take towards solving the fundamental challenges we face.

In addition to discussing supervised and unsupervised learning in finance, this paper:

shows the interplay between the agent’s constraints and rewards in one practical application of reinforcement learning.

The paper also discusses inverse reinforcement learning (IRL), which is the field of study that focuses on learning an agent’s objectives, values, or rewards by observing its behavior.

We also believe that inverse reinforcement learning is very promising: leveraging the massive history of rollouts of human and algo policies on financial markets in order to build local rewards is an active field of research.

The paper also mentions several open source reinforcement learning frameworks that you can make use of, including OpenAI baselines, dopamine, deepmind/trfl, and Ray RLlib.

Summary: Deep Reinforcement Learning for Trading

In this guide, we introduced how you can apply the deep Q-learning algorithm to the continuous reinforcement learning task of trading. If you want to learn more about the topic you can find additional resources below.

Articles

Papers

public

Introduction to Generative Adversarial Networks (GANs): Intuition & Theory

public

Stay up to date with AI

What is Reinforcement Learning?

Introduction to Reinforcement Learning for Trading

Q-Learning

Q-Trader

Q-Learning for Trading

J.P. Morgan's Guide to Reinforcement Learning

Summary: Deep Reinforcement Learning for Trading

Articles

Spread the word

Introduction to Generative Adversarial Networks (GANs): Intuition & Theory

SQL for Data Science: Subqueries and Joins

Keep reading

Deepseek R1: The Training Breakthrough That Has AI Investors Worried

Guide to Deep Reinforcement Learning: Key Concepts & Use Cases

Deep Reinforcement Learning for Trading: Deploying the Algorithm at Interactive Brokers

Subscribe to our newsletter