Preface

This book is intended to cover some advanced modelling techniques applied to equity investment strategies that are built on firm characteristics. The content is threefold. First, we try to simply explain the ideas behind most mainstream machine learning algorithms that are used in equity asset allocation. Second, we mention a wide range of academic references for the readers who wish to push a little further. Finally, we provide hands-on R code samples that show how to apply the concepts and tools on a realistic dataset which we share to encourage reproducibility.

What this book is not about

This book deals with machine learning (ML) tools and their applications in factor investing. Factor investing is a subfield of a large discipline that encompasses asset allocation, quantitative trading and wealth management. Its premise is that differences in the returns of firms can be explained by the characteristics of these firms. Thus, it departs from traditional analyses which rely on price and volume data only, like classical portfolio theory à la Markowitz (1952), or high frequency trading. For a general and broad treatment of Machine Learning in Finance, we refer to Matthew F. Dixon, Halperin, and Bilokon (2020).

The topics we discuss are related to other themes that will not be covered in the monograph. These themes include:

Applications of ML in other financial fields, such as fraud detection or credit scoring. We refer to Ngai et al. (2011) and Baesens, Van Vlasselaer, and Verbeke (2015) for general purpose fraud detection, to Bhattacharyya et al. (2011) for a focus on credit cards and to Ravisankar et al. (2011) and Abbasi et al. (2012) for studies on fraudulent financial reporting. On the topic of credit scoring, G. Wang et al. (2011) and Brown and Mues (2012) provide overviews of methods and some empirical results. Also, we do not cover ML algorithms for data sampled at higher (daily or intraday) frequencies (microstructure models, limit order book). The chapter from Kearns and Nevmyvaka (2013) and the recent paper by Sirignano and Cont (2019) are good introductions on this topic.
Use cases of alternative datasets that show how to leverage textual data from social media, satellite imagery, or credit card logs to predict sales, earning reports, and, ultimately, future returns. The literature on this topic is still emerging (see, e.g., Blank, Davis, and Greene (2019), Jha (2019) and Z. T. Ke, Kelly, and Xiu (2019)) but will likely blossom in the near future.
Technical details of machine learning tools. While we do provide some insights on specificities of some approaches (those we believe are important), the purpose of the book is not to serve as reference manual on statistical learning. We refer to Hastie, Tibshirani, and Friedman (2009), Cornuejols, Miclet, and Barra (2018) (written in French), James et al. (2013) (coded in R!) and Mohri, Rostamizadeh, and Talwalkar (2018) for a general treatment on the subject.¹ Moreover, K.-L. Du and Swamy (2013) and Goodfellow et al. (2016) are solid monographs on neural networks particularly and Sutton and Barto (2018) provide a self-contained and comprehensive tour in reinforcement learning.
Finally, the book does not cover methods of natural language processing (NLP) that can be used to evaluate sentiment which can in turn be translated into investment decisions. This topic has nonetheless been trending lately and we refer to Loughran and McDonald (2016), Cong, Liang, and Zhang (2019a), Cong, Liang, and Zhang (2019b) and Gentzkow, Kelly, and Taddy (2019) for recent advances on the matter.

The targeted audience

Who should read this book? This book is intended for two types of audiences. First, postgraduate students who wish to pursue their studies in quantitative finance with a view towards investment and asset management. The second target groups are professionals from the money management industry who either seek to pivot towards allocation methods that are based on machine learning or are simply interested in these new tools and want to upgrade their set of competences. To a lesser extent, the book can serve scholars or researchers who need a manual with a broad spectrum of references both on recent asset pricing issues and on machine learning algorithms applied to money management. While the book covers mostly common methods, it also shows how to implement more exotic models, like causal graphs (Chapter 14), Bayesian additive trees (Chapter 9), and hybrid autoencoders (Chapter 7).

The book assumes basic knowledge in algebra (matrix manipulation), analysis (function differentiation, gradients), optimization (first and second order conditions, dual forms), and statistics (distributions, moments, tests, simple estimation method like maximum likelihood). A minimal financial culture is also required: simple notions like stocks, accounting quantities (e.g., book value) will not be defined in this book. Lastly, all examples and illustrations are coded in R. A minimal culture of the language is sufficient to understand the code snippets which rely heavily on the most common functions of the tidyverse (Wickham et al. (2019), www.tidyverse.org), and piping (Bache and Wickham (2014), Mailund (2019)).

How this book is structured

The book is divided into four parts.

Part I gathers preparatory material and starts with notations and data presentation (Chapter 1), followed by introductory remarks (Chapter 2). Chapter 3 outlines the economic foundations (theoretical and empirical) of factor investing and briefly sums up the dedicated recent literature. Chapter 4 deals with data preparation. It rapidly recalls the basic tips and warns about some major issues.

Part II of the book is dedicated to predictive algorithms in supervised learning. Those are the most common tools that are used to forecast financial quantities (returns, volatilities, Sharpe ratios, etc.). They range from penalized regressions (Chapter 5), to tree methods (Chapter 6), encompassing neural networks (Chapter 7), support vector machines (Chapter 8) and Bayesian approaches (Chapter 9).

The next portion of the book bridges the gap between these tools and their applications in finance. Chapter 10 details how to assess and improve the ML engines defined beforehand. Chapter 11 explains how models can be combined and often why that may not be a good idea. Finally, one of the most important chapters (Chapter 12) reviews the critical steps of portfolio backtesting and mentions the frequent mistakes that are often encountered at this stage.

The end of the book covers a range of advanced topics connected to machine learning more specifically. The first one is interpretability. ML models are often considered to be black boxes and this raises trust issues: how and why should one trust ML-based predictions? Chapter 13 is intended to present methods that help understand what is happening under the hood. Chapter 14 is focused on causality, which is both a much more powerful concept than correlation and also at the heart of many recent discussions in Artificial Intelligence (AI). Most ML tools rely on correlation-like patterns and it is important to underline the benefits of techniques related to causality. Finally, Chapters 15 and 16 are dedicated to non-supervised methods. The latter can be useful, but their financial applications should be wisely and cautiously motivated.

Companion website

This book is entirely available at http://www.mlfactor.com. It is important that not only the content of the book be accessible, but also the data and code that are used throughout the chapters. They can be found at https://github.com/shokru/mlfactor.github.io/tree/master/material. The online version of the book will be updated beyond the publication of the printed version.

Why R?

The supremacy of Python as the dominant ML programming language is a widespread belief. This is because almost all applications of deep learning (which is as of 2020 one of the most fashionable branches of ML) are coded in Python via Tensorflow or Pytorch. The fact is that R has a lot to offer as well. First of all, let us not forget that one of the most influencial textbooks in ML (Hastie, Tibshirani, and Friedman (2009)) is written by statisticians who code in R. Moreover, many statistics-orientated algorithms (e.g., BARTs in Section 9.5) are primarily coded in R and not always in Python. The R offering in Bayesian packages in general (https://cran.r-project.org/web/views/Bayesian.html) and in Bayesian learning in particular is probably unmatched.

There are currently several ML frameworks available in R.

caret: https://topepo.github.io/caret/index.html, a compilation of more than 200 ML models;
tidymodels: https://github.com/tidymodels, a recent collection of packages for ML workflow (developed by Max Kuhn at RStudio, which is a token of high quality material!);
rtemis: https://rtemis.netlify.com, a general purpose package for ML and visualization;
mlr3: https://mlr3.mlr-org.com/index.html, also a simple framework for ML models;
h2o: https://github.com/h2oai/h2o-3/tree/master/h2o-r, a large set of tools provided by h2o (coded in Java);
Open ML: https://github.com/openml/openml-r, the R version of the OpenML (www.openml.org) community.

Moreover, via the reticulate package, it is possible (but not always easy) to benefit from Python tools as well. The most prominent example is the adaptation of the tensorflow and keras libraries to R. Thus, some very advanced Python material is readily available to R users. This is also true for other resources, like Stanford’s CoreNLP library (in Java) which was adapted to R in the package coreNLP (which we will not use in this book).

Coding instructions

One of the purposes of the book is to propose a large-scale tutorial of ML applications in financial predictions and portfolio selection. Thus, one keyword is REPRODUCIBILITY! In order to duplicate our results (up to possible randomness in some learning algorithms), you will need running versions of R and RStudio on your computer. The best books to learn R are also often freely available online. A short list can be found here https://rstudio.com/resources/books/. The monograph R for Data Science is probably the most crucial.

In terms of coding requirements, we rely heavily on the tidyverse, which is a collection of packages (or libraries). The three packages we use most are dplyr which implements simple data manipulations (filter, select, arrange), tidyr which formats data in a tidy fashion, and ggplot, for graphical outputs.

A list of the packages we use can be found in Table 0.1 below. Packages with a star \(*\) need to be installed via bioconductor.² Packages with a plus \(^+\) need to be installed manually.³

TABLE 0.1: List of all packages used in the book.
Package	Purpose	Chapter(s)
BART	Bayesian additive trees	10
broom	Tidy regression output	5
CAM\(^+\)	Causal Additive Models	15
caTools	AUC curves	11
CausalImpact	Causal inference with structural time series	15
cowplot	Stacking plots	4 & 13
breakDown	Breakdown interpretability	14
dummies	One-hot encoding	8
e1071	Support Vector Machines	9
factoextra	PCA visualization	16
fastAdaboost	Boosted trees	7
forecast	Autocorrelation function	4
FNN	Nearest Neighbors detection	16
ggpubr	Combining plots	11
glmnet	Penalized regressions	6
iml	Interpretability tools	14
keras	Neural networks	8
lime	Interpretability	14
lmtest	Granger causality	15
lubridate	Handling dates	All (or many)
naivebayes	Naive Bayes classifier	10
pcalg	Causal graphs	15
quadprog	Quadratic programming	12
quantmod	Data extraction	4, 12
randomForest	Random forests	7
rBayesianOptimization	Bayesian hyperparameter tuning	11
ReinforcementLearning	Reinforcement Learning	17
Rgraphviz\(^*\)	Causal graphs	15
rpart and rpart.plot	Simple decision trees	7
spBayes	Bayesian linear regression	10
tidyverse	Environment for data science, data wrangling	All
xgboost	Boosted trees	7
xtable	Table formatting	4

Of all of these packages (or collections thereof), the tidyverse and lubridate are compulsory in almost all sections of the book. To install a new package in R, just type

install.packages(“name_of_the_package”)

in the console. Sometimes, because of function name conflicts (especially with the select() function), we use the syntax package::function() to make sure the function call is from the right source. The exact version of the packages used to compile the book is listed in the “renv.lock” file available on the book’s GitHub web page https://github.com/shokru/mlfactor.github.io. One minor comment is the following: while the functions gather() and spread() from the dplyr package have been superseded by pivot_longer() and pivot_wider(), we still use them because of their much more compact syntax.

As much as we could, we created short code chunks and commented each line whenever we felt it was useful. Comments are displayed at the end of a row and preceded with a single hastag #.

The book is constructed as a very big notebook, thus results are often presented below code chunks. They can be graphs or tables. Sometimes, they are simple numbers and are preceded with two hashtags ##. The example below illustrates this formatting.

1+2 # Example

## [1] 3

The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on previously defined variables. When replicating parts of the code (via online code), please make sure that the environment includes all relevant variables. One best practice is to always start by running all code chunks from Chapter 1. For the exercises, we often resort to variables created in the corresponding chapters.

Acknowledgments

The core of the book was prepared for a series of lectures given by one of the authors to students of master’s degrees in finance at EMLYON Business School and at the Imperial College Business School in the Spring of 2019. We are grateful to those students who asked fruitful questions and thereby contributed to improve the content of the book.

We are grateful to Bertrand Tavin and Gautier Marti for their thorough screening of the book. We also thank Eric André, Aurélie Brossard, Alban Cousin, Frédérique Girod, Philippe Huber, Jean-Michel Maeso, Javier Nogales and for friendly reviews; Christophe Dervieux for his help with bookdown; Mislav Sagovac and Vu Tran for their early feedback; John Kimmel for making this happen and Jonathan Regenstein for his availability, no matter the topic. Lastly, we are grateful for the anonymous reviews collected by John Kimmel, our original editor.

Future developments

Machine learning and factor investing are two immense research domains and the overlap between the two is also quite substantial and developing at a fast pace. The content of this book will always constitute a solid background, but it is naturally destined to obsolescence. Moreover, by construction, some subtopics and many references will have escaped our scrutiny. Our intent is to progressively improve the content of the book and update it with the latest ongoing research. We will be grateful to any comment that helps correct or update the monograph. Thank you for sending your feedback directly (via pull requests) on the book’s website which is hosted at https://github.com/shokru/mlfactor.github.io.