**q:fin** (quantitative finance) classification. Moreover, an early survey of RL-based portfolios is compiled in Sato (2019) (see also Zhang, Zohren, and Roberts (2020)) and general financial applications are discussed in Kolm and Ritter (2019b), Meng and Khushi (2019), Charpentier, Elie, and Remlinger (2020) and Mosavi et al. (2020). This shows that RL has recently gained traction among the quantitative finance community.

**Markov Decision Process** (MDP, see Chapter 3 in Sutton and Barto (2018)).

**agent** (e.g., a trader or portfolio manager) and an **environment** (e.g., a financial market). The agent performs **actions** that may alter the state of environment and gets a reward (possibly negative) for each action. This short sequence can be repeated an arbitrary number of times, as is shown in Figure 16.1.

*finite*, the MDP is logically called finite. In a financial framework, this is somewhat unrealistic and we discuss this issue later on. It nevertheless is not hard to think of simplified and discretized financial problems. For instance, the reward can be binary: win money versus lose money. In the case of only one asset, the action can also be dual: investing versus not investing. When the number of assets is sufficiently small, it is possible to set fixed proportions that lead to a reasonable number of combinations of portfolio choices, etc.

**transition probability**:

\begin{equation}
\tag{16.1}
p(s',r|s,a)=\mathbb{P}\left[S_t=s',R_t=r | S_{t-1}=s,A_{t-1}=a \right],
\end{equation}

$\mathcal{S}$ and $\mathcal{A}$ henceforth. Sometimes, this probability is averaged over the set of rewards which gives the following decomposition:

\begin{align}
G_t&=\sum_{k=0}^T\gamma^kR_{t+k+1} \nonumber \\ \tag{16.3}
&=R_{t+1} +\gamma G_{t+1},
\end{align}

*episodic* and, otherwise, it is said to be *continuous*.

**policy** $\pi$, which drives the actions of the agent. More precisely, $\pi(a,s)=\mathbb{P}[A_t=a|S_t=s]$, that is, $\pi$ equals the probability of taking action $a$ if the state of the environment is $s$. This means that actions are subject to randomness, just like for mixed strategies in game theory. While this may seem disappointing because an investor would want to be sure to take the best action, it is also a good reminder that *the* best way to face random outcomes may well be to randomize actions as well.

*best* policy, one key indicator is the so-called value function:

\begin{equation}
\tag{16.4}
v_\pi(s)=\mathbb{E}_\pi\left[ G_t | S_t=s \right],
\end{equation}

\begin{equation}
\tag{16.5}
q_\pi(s,a)=\mathbb{E}_\pi\left[ G_t | S_t=s, \ A_t=a \right].
\end{equation}

The optimal $v_\pi$ and $q_\pi$ are straightforwardly defined as

\begin{align*}
q_\pi(s,a) &= \mathbb{E}_\pi[G_t|S_t=s,A_t=a] \\
&= \mathbb{E}_\pi[R_{t+1}+ \gamma G_{t+1}|S_t=s,A_t=a],
\end{align*}

\begin{align}
q_\pi(s,a) &=\sum_{a',r, s'}\pi(a',s')p(s',r|s,a) \left[ r+\gamma \mathbb{E}_\pi[ G_{t+1}|S_t=s',A_t=a']\right] \nonumber \\ \tag{16.6}
&=\sum_{a',r,s'}\pi(a',s')p(s',r|s,a) \left[ r+\gamma q_\pi(s',a')\right].
\end{align}

\begin{align}
q_*(s,a) &= \underset{a'}{\max} \sum_{r,s'}p(s',r|s,a) \left[ r+\gamma q_*(s',a')\right], \\
&= \mathbb{E}_{\pi^*}[r|s,a]+ \gamma \, \sum_{r,s'}p(s',r|s,a) \left( \underset{a'}{\max} q_*(s',a') \right) \tag{16.7}
\end{align}

Initialize values $Q(s,a)$ for all states $s$ and actions $a$. For each episode:

\begin{equation}
\tag{16.8}
Q_{i+1}(s_i,a_i) \longleftarrow Q_i(s_i,a_i) + \eta \left(\underbrace{r_{i+1}+\gamma \, \underset{a}{\max} \, Q_i(s_{i+1},a)}_{\text{echo of } (\ref{eq:bellmanq})}-Q_i(s_i,a_i) \right)
\end{equation}

\begin{equation}
\tag{16.9}
\text{New estimate} \leftarrow \text{Old estimate + Step size (}i.e., \text{ learning rate)} \times (\text{Target - Old estimate}),
\end{equation}

*temporal difference*’ learning because it is driven by the improvement yielded by estimates that are known at time $t+1$ (target) versus those known at time $t$.

**QL**) is the second one where the action $a_i$ is picked. In RL, the best algorithms combine two features: **exploitation** and **exploration**. Exploitation is when the machine uses the current information at its disposal to choose the next action. In this case, for a given state $s_i$, it chooses the action $a_i$ that maximizes the expected reward $Q_i(s_i,a_i)$. While obvious, this choice is not optimal if the current function $Q_i$ is relatively far from the true $Q$. Repeating the locally optimal strategy is likely to favor a limited number of actions, which will narrowly improve the accuracy of the $Q$ function.

*off-policy*. *On-policy* algorithms seek to improve the estimation of the action-value function $q_π$ by continuously acting according to the policy $π$. One canonical example of on-policy learning is the SARSA method which requires two consecutive states and actions **SA**R**SA**. The way the quintuple $(S_t,A_t,R_{t+1}, S_{t+1}, A_{t+1})$ is processed is presented below.

The main difference between $Q$ learning and SARSA is the update rule. In SARSA, it is given by

\begin{equation}
\tag{16.11}
Q_{i+1}(s_i,a_i) \longleftarrow Q_i(s_i,a_i) + \eta \left(r_{i+1}+\gamma \, Q_i(s_{i+1},a_{i+1})-Q_i(s_i,a_i) \right)
\end{equation}

**local** point $Q_i(s_{i+1},a_{i+1})$ that is based on the new states and actions ($s_{i+1},a_{i+1}$), whereas in $Q$-learning, it comes from all possible actions of which only the best is retained $\underset{a}{\max} \, Q_i(s_{i+1},a)$.

*expected* SARSA in which the target $Q$ function is averaged over all actions:

\begin{equation}
\tag{16.12}
Q_{i+1}(s_i,a_i) \longleftarrow Q_i(s_i,a_i) + \eta \left(r_{i+1}+\gamma \, \sum_a \pi(a,s_{i+1}) Q_i(s_{i+1},a) -Q_i(s_i,a_i) \right)
\end{equation}

\begin{equation}
\tag{16.13}
\mathbb{S}_N=\left\{ \mathbf{x} \in \mathbb{R}^N\left|\sum_{n=1}^Nx_n=1, \ x_n\ge 0, \ \forall n=1,\dots,N \right.\right\}
\end{equation}

**parametrization**. When $a$ and $s$ can take discrete values, action-value functions must be computed for all pairs $(a,s)$, which can be prohibitively cumbersome. An elegant way to circumvent this problem is to assume that the policy is driven by a relatively modest number of parameters. The learning process is then focused on optimizing this set of parameters $\theta$. We then write $\pi_{\boldsymbol{\theta}}(a,s)$ for the probability of choosing action $a$ in state $s$. One intuitive way to define $\pi_{\boldsymbol{\theta}}(a,s)$ is to resort to a soft-max form:

\begin{equation}
\tag{16.14}
\pi_{\boldsymbol{\theta}}(a,s) = \frac{e^{\boldsymbol{\theta}'\textbf{h}(a,s)}}{\sum_{b}e^{\boldsymbol{\theta}'\textbf{h}(b,s)}},
\end{equation}

\begin{equation}
\tag{16.15}
\nabla \mathbb{E}_{\boldsymbol{\theta}}[G_t]=\mathbb{E}_{\boldsymbol{\theta}} \left[G_t\frac{\nabla \pi_{\boldsymbol{\theta}}}{\pi_{\boldsymbol{\theta}}} \right].
\end{equation}

**gradient ascent**: when seeking to maximize a quantity, the parameter change must go in the upward direction:

\begin{equation}
\tag{16.16}
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \eta \nabla \mathbb{E}_{\boldsymbol{\theta}}[G_t].
\end{equation}

**REINFORCE** algorithm. One improvement of this simple idea is to add a baseline, and we refer to section 13.4 of Sutton and Barto (2018) for a detailed account on this topic.

**actor-critic** (AC) method which combines policy gradient with $Q$- or $v$-learning. The AC algorithm can be viewed as some kind of mix between policy gradient and SARSA. A central requirement is that the state-value function $v(\cdot)$ be a differentiable function of some parameter vector $\textbf{w}$ (it is often taken to be a neural network). The update rule is then

\begin{equation}
\tag{16.17}
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \eta \left(R_{t+1}+\gamma v(S_{t+1},\textbf{w})-v(S_t,\textbf{w}) \right)\frac{\nabla \pi_{\boldsymbol{\theta}}}{\pi_{\boldsymbol{\theta}}},
\end{equation}

\begin{equation}
\tag{16.18}
\pi_{\boldsymbol{\theta}} = f_{\boldsymbol{\Omega}(s,\boldsymbol{\theta})}(a),
\end{equation}

\begin{equation}
f_{\boldsymbol{\alpha}}(w_1,\dots,w_n)=\frac{1}{B(\boldsymbol{\alpha})}\prod_{n=1}^Nw_n^{\alpha_n-1},
\end{equation}

where $B(\boldsymbol{\alpha})$ is the multinomial beta function:

\begin{equation}
B(\boldsymbol{\alpha})=\frac{\prod_{n=1}^N\Gamma(\alpha_n)}{\Gamma\left(\sum_{n=1}^N\alpha_n \right)}.
\end{equation}

\begin{equation}
(\textbf{F1}) \quad \alpha_{n,t}=\theta_{0,t} + \sum_{k=1}^K \theta_{t}^{(k)}x_{t,n}^{(k)},
\end{equation}

\begin{equation}
\boldsymbol{\theta}= \underset{\textbf{z} \in \Theta(\textbf{x}_t)}{\min} ||\boldsymbol{\theta}^*-\textbf{z}||^2,
\end{equation}

\begin{equation}
(\textbf{F2}) \quad \alpha_{n,t}=\exp \left(\theta_{0,t} + \sum_{k=1}^K \theta_{t}^{(k)}x_{t,n}^{(k)}\right),
\end{equation}

\begin{align*}
\frac{\nabla_{\boldsymbol{\theta}_t} \pi^1_{\boldsymbol{\theta}_t}}{\pi^1_{\boldsymbol{\theta}_t}}&= \sum_{n=1}^N \left( \digamma \left( \textbf{1}'\textbf{X}_t\boldsymbol{\theta}_t \right) - \digamma(\textbf{x}_{t,n}\boldsymbol{\theta}_t) + \ln w_n \right) \textbf{x}_{t,n}' \\
\frac{\nabla_{\boldsymbol{\theta}_t} \pi^2_{\boldsymbol{\theta}_t}}{\pi^2_{\boldsymbol{\theta}_t}}&= \sum_{n=1}^N \left( \digamma \left( \textbf{1}'e^{\textbf{X}_{t}\boldsymbol{\theta}_t} \right) - \digamma(e^{\textbf{x}_{t,n}\boldsymbol{\theta}_t}) + \ln w_n \right) e^{\textbf{x}_{t,n}\boldsymbol{\theta}_t} \textbf{x}_{t,n}'
\end{align*}

where $e^{\textbf{X}}$ is the element-wise exponential of a matrix $\textbf{X}$.

In [2]:

```
from statsmodels.tsa.arima_process import ArmaProcess # Sub-library for generating AR(1)
import numpy as np
import pandas as pd
n_sample = 10**5 # Number of samples to be generated
rho=0.8 # Autoregressive parameter
sd=0.4 # Std. dev. of noise
a=0.06*(1-rho) # Scaled mean of returns
#
ar1 = np.array([1, -rho]) # template for ar param, note that you need to inverse the sign of rho
AR_object1 = ArmaProcess(ar1) # Creating the AR object
simulated_data_AR1 = AR_object1.generate_sample(nsample=n_sample,scale=sd) # generate sample from AR object
#
returns=a/rho+simulated_data_AR1 # Returns via AR(1) simulation
action = np.round(np.random.uniform(size=n_sample)*4) / 4 # Random action (portfolio)
state = np.where(returns < 0, "neg", "pos") # Coding of state
reward = returns * action # Reward = portfolio return
#
data_RL = pd.DataFrame([returns, action, state, reward]).T # transposing for display consistency
data_RL.columns = ['returns', 'action', 'state', 'reward'] # naming the columns for future table print
data_RL['new_state'] = data_RL['state'].shift(-1) # Next state using lag
data_RL = data_RL.dropna(axis=0).reset_index(drop=True) # Remove one missing new state, last row
data_RL.head() # Show first lines
```

Out[2]:

returns | action | state | reward | new_state | |
---|---|---|---|---|---|

0 | 0.063438 | 0.75 | pos | 0.047579 | neg |

1 | -0.602351 | 0.25 | neg | -0.150588 | pos |

2 | 0.112012 | 0.5 | pos | 0.056006 | pos |

3 | 0.220316 | 0.5 | pos | 0.110158 | pos |

4 | 0.437028 | 0.75 | pos | 0.327771 | pos |

There are 3 parameters in the implementation of the Q-learning algorithm:

- $\eta$, which is the learning rate in the updating Equation (16.8).
- $\gamma$, the discounting rate for the rewards (also shown in Equation (16.8));
- and $\epsilon$, which controls the rate of exploration versus exploitation (see Equation (16.10)).

In [5]:

```
alpha = 0.1 # Learning rate
gamma = 0.7 # Discount factor for rewards
epsilon = 0.5 # Exploration rate
def looping_w_counters(obj_array): # create util function for loop with counter
_dict = {z:i for i,z in enumerate(obj_array)} # Dictionary comprehensions
return _dict
s =looping_w_counters(data_RL['state'].unique()) # Dict for states
a =looping_w_counters(data_RL['action'].unique()) # Dict for actions
fit_RL = np.zeros(shape=(len(s),len(a))) # Placeholder for Q matrix
r_final = 0
for z, row in data_RL.iterrows(): # loop for Q-learning
act = a[row.action]
r = row.reward
s_current = s[row.state]
s_new = s[row.new_state]
if np.random.uniform(size=1) < epsilon:
best_new = a[np.random.choice(list(a.keys()))] # Explore action space
else:
best_new = np.argmax(fit_RL[s_new,]) # Exploit learned values
r_final += r
fit_RL[s_current, act] += alpha * (r + gamma * fit_RL[s_new, best_new] - fit_RL[s_current, act])
fit_RL=pd.DataFrame(fit_RL, index=s.keys(), columns=a.keys()).sort_index(axis = 1)
print(fit_RL)
print(f'Reward (last iteration): {r_final}')
```

*resp.* negative) returns are more likely to follow positive (*resp.* negative) returns. While this is somewhat reassuring, it is by no means impressive, and much simpler tools would yield similar conclusions and guidance.

The second application is based on the financial dataset. To reduce the dimensionality of the problem, we will assume that:

- only one feature (price-to-book ratio) captures the state of the environment. This feature is processed so that is has only a limited number of possible values;
- actions take values over a discrete set consisting of three positions: +1 (buy the market), -1 (sell the market) and 0 (hold no risky positions);
- only two assets are traded: those with stock_id equal to 3 and 4 - they both have 245 days of trading data.

The construction of the dataset is unelegantly coded below.

In [12]:

```
return_3 = pd.Series(data_ml.loc[data_ml['stock_id']==3, 'R1M_Usd'].values) # Return of asset 3
return_4 = pd.Series(data_ml.loc[data_ml['stock_id']==4, 'R1M_Usd'].values) # Return of asset 4
pb_3 = pd.Series(data_ml.loc[data_ml['stock_id']==3, 'Pb'].values) # P/B ratio of asset 3
pb_4 = pd.Series(data_ml.loc[data_ml['stock_id']==4, 'Pb'].values) # P/B ratio of asset 4
action_3 = pd.Series(np.floor(np.random.uniform(size=len(pb_3))*3) - 1) # Action for asset 3 (random)
action_4 = pd.Series(np.floor(np.random.uniform(size=len(pb_4))*3) - 1) # Action for asset 4 (random)
RL_data = pd.concat([return_3, return_4, pb_3, pb_4, action_3, action_4],axis=1) # Building the dataset
RL_data.columns = ['return_3','return_4', 'Pb_3','Pb_4','action_3', 'action_4'] # Adding columns names
RL_data['action']=RL_data.action_3.astype(int).apply(str)+" "+RL_data.action_4.astype(int).apply(str) # Uniting actions
RL_data['Pb_3'] = np.round(5*RL_data['Pb_3']) # Simplifying states (P/B)
RL_data['Pb_4'] = np.round(5*RL_data['Pb_4']) # Simplifying states (P/B)
RL_data['state'] = RL_data.Pb_3.astype(int).apply(str)+" "+RL_data.Pb_4.astype(int).apply(str) # Uniting states
RL_data['new_state'] = RL_data['state'].shift(-1) # Infer new state
RL_data['reward'] = RL_data.action_3*RL_data.return_3+RL_data.action_4*RL_data.return_4 # Computing rewards
RL_data = RL_data[['action','state','reward','new_state']].dropna(axis=0).reset_index(drop=True) # Remove one missing new state, last row
RL_data.head() # Show first lines
```

Out[12]:

action | state | reward | new_state | |
---|---|---|---|---|

0 | -1 0 | 1 1 | -0.077 | 1 1 |

1 | 0 -1 | 1 1 | -0.000 | 1 1 |

2 | -1 0 | 1 1 | -0.018 | 1 1 |

3 | 1 1 | 1 1 | 0.016 | 1 1 |

4 | 0 1 | 1 1 | 0.014 | 1 1 |

In [13]:

```
alpha = 0.1 # Learning rate
gamma = 0.7 # Discount factor for rewards
epsilon = 0.1 # Exploration rate
s =looping_w_counters(RL_data['state'].unique()) # Dict for states
a =looping_w_counters(RL_data['action'].unique()) # Dict for actions
fit_RL2 = np.zeros(shape=(len(s),len(a))) # Placeholder for Q matrix
r_final = 0
for z, row in RL_data.iterrows(): # loop for Q-learning
act = a[row.action]
r = row.reward
s_current = s[row.state]
s_new = s[row.new_state]
if np.random.uniform(size=1) < epsilon: # Explore action space
best_new = a[np.random.choice(list(a.keys()))]
else:
best_new = np.argmax(fit_RL2[s_new,]) # Exploit learned values
r_final += r
fit_RL2[s_current, act] += alpha * (r + gamma * fit_RL2[s_new, best_new] - fit_RL2[s_current, act])
fit_RL2=pd.DataFrame(fit_RL2, index=s.keys(), columns=a.keys()).sort_index(axis = 1)
print(fit_RL2)
print(f'Reward (last iteration): {r_final}')
```

**no impact on the environment** (unless the agent is able to perform massive trades, which is rare and ill-advised because it pushes prices in the wrong direction). This lack of impact of actions may possibly mitigate the efficiency of traditional RL approaches.

- Keeping the same 2 assets as in Section 16.4.2, increases the size of RL_data by testing
**all possible action combinations**for each original data point. Re-run the $Q$-learning function and see what happens.

Aboussalah, Amine Mohamed, and Chi-Guhn Lee. 2020. “Continuous Control with Stacked Deep Dynamic Recurrent Reinforcement Learning for Portfolio Optimization.” Expert Systems with Applications 140: 112891.

Almahdi, Saud, and Steve Y Yang. 2019. “A Constrained Portfolio Trading System Using Particle Swarm Algorithm and Recurrent Reinforcement Learning.” Expert Systems with Applications 130: 145–56.

Bertoluzzo, Francesco, and Marco Corazza. 2012. “Testing Different Reinforcement Learning Configurations for Financial Trading: Introduction and Applications.” Procedia Economics and Finance 3: 68–77.

Bertsekas, Dimitri P. 2017. Dynamic Programming and Optimal Control - Volume Ii, Fourth Edition. Athena Scientific.

Chaouki, Ayman, Stephen Hardiman, Christian Schmidt, Joachim de Lataillade, and others. 2020. “Deep Deterministic Portfolio Optimization.” arXiv Preprint, no. 2003.06497.

Charpentier, Arthur, Romuald Elie, and Carl Remlinger. 2020. “Reinforcement Learning in Economics and Finance.” arXiv Preprint, no. 2003.10014.

Garcı́a-Galicia, Mauricio, Alin A Carsteanu, and Julio B Clempner. 2019. “Continuous-Time Reinforcement Learning Approach for Portfolio Management with Time Penalization.” Expert Systems with Applications 129: 27–36.

Halperin, Igor, and Ilya Feldshteyn. 2018. “Market Self-Learning of Signals, Impact and Optimal Trading: Invisible Hand Inference with Free Energy.” arXiv Preprint, no. 1805.06126.

Kolm, Petter N, and Gordon Ritter. 2019b. “Modern Perspectives on Reinforcement Learning in Finance.” Journal of Machine Learning in Finance 1 (1).

Kong, Weiwei, Christopher Liaw, Aranyak Mehta, and D Sivakumar. 2019. “A New Dog Learns Old Tricks: RL Finds Classic Optimization Algorithms.” Proceedings of the ICLR Conference, 1–25.

Meng, Terry Lingze, and Matloob Khushi. 2019. “Reinforcement Learning in Financial Markets.” Data 4 (3): 110.

Moody, John, and Lizhong Wu. 1997. “Optimization of Trading Systems and Portfolios.” In Proceedings of the Ieee/Iafe 1997 Computational Intelligence for Financial Engineering (Cifer), 300–307. IEEE.

Moody, John, Lizhong Wu, Yuansong Liao, and Matthew Saffell. 1998. “Performance Functions and Reinforcement Learning for Trading Systems and Portfolios.” Journal of Forecasting 17 (5-6): 441–70.

Mosavi, Amir, Pedram Ghamisi, Yaser Faghan, Puhong Duan, and Shahab Shamshirband. 2020. “Comprehensive Review of Deep Reinforcement Learning Methods and Applications in Economics.” arXiv Preprint, no. 2004.01509.

Neuneier, Ralph. 1996. “Optimal Asset Allocation Using Adaptive Dynamic Programming.” In Advances in Neural Information Processing Systems, 952–58.

Neuneier, Ralph. 1998. “Enhancing Q-Learning for Optimal Asset Allocation.” In Advances in Neural Information Processing Systems, 936–42.

Pendharkar, Parag C, and Patrick Cusatis. 2018. “Trading Financial Indices with Reinforcement Learning Agents.” Expert Systems with Applications 103: 1–13.

Powell, Warren B, and Jun Ma. 2011. “A Review of Stochastic Algorithms with Continuous Value Function Approximation and Some New Approximate Policy Iteration Algorithms for Multidimensional Continuous Applications.” Journal of Control Theory and Applications 9 (3): 336–52.

Sato, Yoshiharu. 2019. “Model-Free Reinforcement Learning for Financial Portfolios: A Brief Survey.” arXiv Preprint, no. 1904.04973.

Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature 529: 484–89.

Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction (2nd Edition). MIT Press.

Wang, Haoran, and Xun Yu Zhou. 2019. “Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework.” SSRN Working Paper 3382932.

Watkins, Christopher JCH, and Peter Dayan. 1992. “Q-Learning.” Machine Learning 8 (3-4): 279–92.

Xiong, Zhuoran, Xiao-Yang Liu, Shan Zhong, Hongyang Yang, and Anwar Walid. 2018. “Practical Deep Reinforcement Learning Approach for Stock Trading.” arXiv Preprint, no. 1811.07522.

Yang, Steve Y, Yangyang Yu, and Saud Almahdi. 2018. “An Investor Sentiment Reward-Based Trading System Using Gaussian Inverse Reinforcement Learning Algorithm.” Expert Systems with Applications 114: 388–401.

Yu, Pengqian, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Dasgupta. 2019. “Model-Based Deep Reinforcement Learning for Dynamic Portfolio Optimization.” arXiv Preprint, no. 1901.08740.

Zhang, Zihao, Stefan Zohren, and Stephen Roberts. 2020. “Deep Reinforcement Learning for Trading.” Journal of Financial Data Science 2 (2): 25–40.

footnotes