# Machine Learning for Factor Investing

*2020-04-05*

# Chapter 1 Preface

This book is intended to cover some advanced modelling techniques applied to equity **investment strategies** that are built on **firm characteristics**. The content is threefold. First, we try to simply explain the ideas behind most mainstream machine learning algorithms that are used in equity asset allocation. Second, we mention a wide range of academic references for the readers who wish to push a little further. Finally, we provide hands-on **R** code samples that show how to apply the concepts and tools on a realistic dataset which we share to encourage **reproducibility**.

## 1.1 What this book is not about

This book deals with machine learning (ML) tools and their applications in factor investing. Factor investing is a subfield of a large discipline that encompasses asset allocation, quantitative trading and wealth management. Its premise is that differences in the returns of firms can be explained by the characteristics of these firms. Thus, it departs from traditional analyses which rely on price and volume data only, like classical portfolio theory à la Markowitz (1952), or high frequency trading. For a general and broad treatment of Machine Learning in Finance, we refer to Dixon, Halperin, and Bilokon (2020).

The topics we discuss are related to other themes that will not be covered in the monograph. These themes include:

- Applications of ML in
**other financial fields**, such as**fraud detection**or**credit scoring**. We refer to Ngai et al. (2011) and Baesens, Van Vlasselaer, and Verbeke (2015) for general purpose fraud detection, to Bhattacharyya et al. (2011) for a focus on credit cards and to Ravisankar et al. (2011) and Abbasi et al. (2012) for studies on fraudulent financial reporting. On the topic of credit scoring, Wang et al. (2011) and Brown and Mues (2012) provide overviews of methods and some empirical results. Also, we do not cover ML algorithms for data sampled at higher (daily or intraday) frequencies (microstructure models, limit order book). The chapter from Kearns and Nevmyvaka (2013) and the recent paper by Sirignano and Cont (2019) are good introductions on this topic.

**Use cases of alternative datasets**that show how to leverage textual data from social media, satellite imagery, or credit card logs to predict sales, earning reports, and, ultimately, future returns. The literature on this topic is still emerging (see, e.g., Blank, Davis, and Greene (2019), Jha (2019) and Ke, Kelly, and Xiu (2019)) but will likely blossom in the near future.

**Technical details**of machine learning tools. While we do provide some insights on specificities of some approaches (those we believe are important), the purpose of the book is not to serve as reference manual on statistical learning. We refer to Hastie, Tibshirani, and Friedman (2009), Cornuejols, Miclet, and Barra (2018) (written in French), James et al. (2013) (coded in R!) and Mohri, Rostamizadeh, and Talwalkar (2018) for a general treatments on the subject.^{1}Moreover, Du and Swamy (2013) and Goodfellow et al. (2016) are solid monographs on neural networks particularly and Sutton and Barto (2018) provides a self-contained and comprehensive tour in reinforcement learning.

- Finally, the book does not cover methods of
**natural language processing**(NLP) that can be used to evaluate sentiment which can in turn be translated into investment decisions. This topic has nonetheless been trending lately and we refer to Loughran and McDonald (2016), Cong, Liang, and Zhang (2019a), Cong, Liang, and Zhang (2019b) and Gentzkow, Kelly, and Taddy (2019) for recent advances on the matter.

## 1.2 The targeted audience

Who should read this book? This book is intended for two types of audiences. First, **postgraduate students** who wish to pursue their studies in quantitative finance with a view towards investment and asset management. The second target groups are **professionals from the money management industry** who either seek to pivot towards allocation methods that are based on machine learning or are simply interested in these new tools and want to upgrade their set of competences. To a lesser extent, the book can serve to **scholars or researchers** who need a manual with a broad spectrum of references both on recent asset pricing issues and on machine learning algorithms applied to money management. While the book covers mostly common methods, it also shows how to implement more exotic models, like causal graphs (Chapter 15, Bayesian additive trees (Chapter 10), and hybrid auto-encoders (Chapter 8)).

The book assumes basic knowledge in **algebra** (matrix manipulation), **analysis** (function differentiation, gradients), **optimization** (first and second order conditions, dual forms), and **statistics** (distributions, moments, tests, simple estimation method like maximum likelihood). A minimal **financial culture** is also required: simple notions like stocks, accounting quantities (e.g., book value) will not be defined in this book. Lastly, all examples and illustrations are coded in R. A minimal culture of the language is sufficient to understand the code snippets which rely heavily on the most common functions of the tidyverse (Wickham et al. (2019), www.tidyverse.org), and piping (Bache and Wickham (2014), Mailund (2019)).

## 1.3 How this book is structured

The book is divided into four parts.

The first part gathers preparatory material and starts with notations and data presentation (Chapter 2), followed by introductory remarks (Chapter 3). Chapter 4 outlines the economic foundations (theoretical and empirical) of factor investing and briefly sums up the dedicated recent literature. Chapter 5 deals with data preparation. It rapidly recalls the basic tips and warns about some major issues.

The second part of the book is dedicated to predictive algorithms in supervised learning. Those are the most common tools that are used to forecast financial quantities (returns, volatilities, Sharpe ratios, etc.). They range from penalized regressions (Chapter 6), to tree methods (Chapter 7), encompassing neural networks (Chapter 8), support vector machines (Chapter 9) and Bayesian approaches (Chapter 10).

The next portion of the book bridges the gap between these tools and their applications in finance. Chapter 11 details how to assess and improve the ML engines defined beforehand. Chapter 12 explains how models can be combined and often why that may not be a good idea. Finally, one of the most important chapters (number 13) reviews the critical steps of portfolio backtesting and mentions the frequent mistakes that are often encountered at this stage.

The end of the book covers a range of advanced topics connected to machine learning more specifically. The first one is **interpretability**. ML models are often considered to be black boxes and this raises trust issues: how and why should one trust ML-based predictions? Chapter 14 is intended to present methods that help understand what is happening under the hood. Chapter 15 is focused on **causality**, which is both a much more powerful concept than correlation and also at the heart of many recent discussions in Artificial Intelligence (AI). Most ML tools rely on correlation-like patterns and it is important to underline the benefits of techniques related to causality. Finally, chapters 16 and 17 are dedicated to non supervised methods. The latter can be useful, but their financial applications should be wisely and cautiously motivated.

## 1.4 Companion website

This book is entirely available at http://www.mlfactor.com. It is important that not only the content of the book be accessible, but also the data and code that are used throughout the chapters. They can be found at https://github.com/shokru/mlfactor.github.io/tree/master/material. The online version of the book will be updated beyond the publication of the printed version.

## 1.5 Why R?

The supremacy of Python as *the* dominant ML programming language is a widespread belief. This is because almost all applications of deep learning (which is as of 2020 one of the most fashionable branches of ML) are coded in Python via Tensorflow or Pytorch.
The fact is that **R** has a **lot** to offer as well. First of all, let us not forget that one of the most influencial textbooks in ML (Hastie, Tibshirani, and Friedman (2009)) is written by statisticians who code in R. Moreover, many statistics-orientated algorithms (e.g. BARTs in Section 10.5) are primarily coded in R and not always in Python. The R offering in Bayesian packages in general (https://cran.r-project.org/web/views/Bayesian.html) and in Bayesian learning in particular is probably unmatched.

There are currently several ML frameworks available in R.

**caret**: https://topepo.github.io/caret/index.html, a compilation of more than 200 ML models;

**tidymodels**: https://github.com/tidymodels, a recent collection of packages for ML workflow (developed by Max Kuhn at RStudio, which is a token of high quality material!);

**rtemis**: https://rtemis.netlify.com, a general purpose package for ML and visualization;

**mlr3**: https://mlr3.mlr-org.com/index.html, also a simple framework for ML models;

**h2o**: https://github.com/h2oai/h2o-3/tree/master/h2o-r, a large set of tools provided by h2o (coded in Java);

**Open ML**: https://github.com/openml/openml-r, the R version of the OpenML (www.openml.org) community.

Moreover, via the *reticulate* package, it is possible (but not always easy) to benefit from Python tools as well. The most prominent example is the adaptation of the *tensorflow* and *keras* libraries to R. Thus, some very advanced Python material is readily available to R users. This is also true for other resources, like Stanford’s CoreNLP library (in Java) which was adapted to R in the package *coreNLP* (which we will not use in this book).

## 1.6 Coding instructions

One of the purposes of the book is to propose a large scale tutorial of ML applications in financial predictions and portfolio selection. Thus, one keyword is **REPRODUCIBILITY**! In order to duplicate our results (up to possible randomness in some learning algorithms), you will need running versions of R and RStudio on your computer. The best books to learn R are also often freely available online. A short list can be found here https://rstudio.com/resources/books/. The monograph *R for Data Science* is probably the most crucial.

In terms of coding requirements, we rely heavily on the **tidyverse**, which is a collection of **packages** (or libraries). The three packages we use most are **dplyr** which implements simple data manipulations (filter, select, arrange), **tidyr** which formats data in a tidy fashion, and **ggplot**, for graphical outputs.

A list of the packages we use can be found in Table 1.1 below. Packages with a star need to be installed via *bioconductor*.^{2} Packages with a plus \(^+\) need to be installed **manually**.^{3}

Package |
Purpose | Chapter(s) |
---|---|---|

adabag |
Boosted trees | 7 |

BART |
Bayesian additive trees | 10 |

broom |
Tidy regression output | 5 |

CAM\(^+\) |
Causal Additive Models | 15 |

caTools |
AUC curves | 11 |

CausalImpact |
Causal inference with structural time-series | 15 |

cowplot |
Stacking plots | 4 & 13 |

breakDown |
Breakdown interpretability | 14 |

dummies |
One-hot encoding | 8 |

e1071 |
Support Vector Machines | 9 |

factoextra |
PCA visualization | 16 |

forecast |
Autocorrelation function | 4 |

FNN |
Nearest Neighbors detection | 16 |

ggpubr |
Combining plots | 11 |

glmnet |
Penalized regressions | 6 |

iml |
Interpretability tools | 14 |

keras |
Neural networks | 8 |

lime |
Interpretability | 14 |

lmtest |
Granger causality | 15 |

lubridate |
Handling dates | All (or many) |

naivebayes |
Naive Bayes classifier | 10 |

pcalg |
Causal graphs | 15 |

quadprog |
Quadratic programming | 12 |

quantmod |
Data extraction | 4, 12 |

randomForest |
Random forests | 7 |

rBayesianOptimization |
Bayesian hyperparameter tuning | 11 |

ReinforcementLearning |
Reinforcement Learning | 17 |

Rgraphviz\(^*\) |
Causal graphs | 15 |

rpart and rpart.plot |
Simple decision trees | 7 |

spBayes |
Bayesian linear regression | 10 |

tidyverse |
Environment for data science, data wrangling | All |

xgboost |
Boosted trees | 7 |

xtable |
Table formatting | 4 |

Of all of these packages (or collections thereof), the **tidyverse** and **lubridate** are compulsory in almost all sections of the book. To install a new package in R, just type install.packages(“name_of_the_package”) in the console. Sometimes, because of function name conflicts (especially with the select() function), we use the syntax package::function() to make sure the function call is from the right source. The exact version of the packages used to compile the book are listed in the “*renv.lock*” file available on the book’s GitHub web page https://github.com/shokru/mlfactor.github.io. One minor comment: while the functions *gather()* and *spread()* from the *dplyr* package have been superseded by *pivot_longer()* and *pivot_wider()*, we still use them because of their much more compact syntax.

As much as we could, we created short **code chunks** and commented each line whenever we felt it was useful. Comments are displayed at the end of a row and preceded with a single hastag #.

The book is constructed as a very big notebook, thus results are often presented below code chunks. They can be graphs or tables. Sometimes, they are simple numbers and are preceded with two hashtags ##. The example below illustrates this formatting.

`## [1] 3`

The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on previously defined variables. When replicating parts of the code (via online code), please make sure that **the environment includes all relevant variables**. One best practice is to always start by running all code chunks from Chapter 2. For the exercises, we often resort to variables created in the corresponding chapters.

## 1.7 Acknowledgements

The core of the book was prepared for a series of lectures given by one of the authors to students of Masters Degrees in Finance at EMLYON Business School and at the Imperial College Business School in the Spring of 2019. We are grateful to those students who asked fruitful questions and thereby contributed to improve the content of the book.

We also thank Eric André, Aurélie Brossard, Alban Cousin, Jean-Michel Maeso, Javier Nogales and Bertand **T**avin for friendly reviews; Christophe Dervieux for his help with bookdown; John Kimmel for making this happen and Jonathan Regenstein for his availability, no matter the topic. Lastly, we are grateful for the anonymous reviews collected by John.

## 1.8 Future developments

Machine learning and factor investing are two immense research domains and the overlap between the two is also quite substantial and developing at a fast pace. The content of this book will always constitute a solid background but it is naturally destined to obsolescence. Moreover, by construction, some subtopics and many references will have escaped our scrutiny. Our intent is to progressively improve the content of the book and update it with the latest ongoing research. We will be grateful to any comment that helps correct or update the monograph. Thank you for sending your feedback directly (via pull requests) on the book’s website which is hosted at https://github.com/shokru/mlfactor.github.io.

### References

Abbasi, Ahmed, Conan Albrecht, Anthony Vance, and James Hansen. 2012. “Metafraud: A Meta-Learning Framework for Detecting Financial Fraud.” *MIS Quarterly*, 1293–1327.

Bache, Stefan Milton, and Hadley Wickham. 2014. “Magrittr: A Forward-Pipe Operator for R.” *R Package Version* 1 (1).

Baesens, Bart, Veronique Van Vlasselaer, and Wouter Verbeke. 2015. *Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection*. John Wiley & Sons.

Bhattacharyya, Siddhartha, Sanjeev Jha, Kurian Tharakunnel, and J Christopher Westland. 2011. “Data Mining for Credit Card Fraud: A Comparative Study.” *Decision Support Systems* 50 (3): 602–13.

Blank, Herbert, Richard Davis, and Shannon Greene. 2019. “Using Alternative Research Data in Real-World Portfolios.” *Journal of Investing* 28 (4): 95–103.

Brown, Iain, and Christophe Mues. 2012. “An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets.” *Expert Systems with Applications* 39 (3): 3446–53.

Cong, Lin William, Tengyuan Liang, and Xiao Zhang. 2019a. “Analyzing Textual Information at Scale.” *SSRN Working Paper* 3449822.

Cong, Lin William, Tengyuan Liang, and Xiao Zhang. 2019b. “Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information.” *SSRN Working Paper* 3307057.

Cornuejols, Antoine, Laurent Miclet, and Vincent Barra. 2018. *Apprentissage Artificiel: Deep Learning, Concepts et Algorithmes*. Eyrolles.

Dixon, Matthew F., Igor Halperin, and Paul Bilokon. 2020. *Machine Learning in Finance: From Theory to Practice*. Springer.

Du, Ke-Lin, and Madisetti NS Swamy. 2013. *Neural Networks and Statistical Learning*. Springer Science & Business Media.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” *Journal of Economic Literature* 57 (3): 535–74.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. *Deep Learning*. MIT press Cambridge.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. *The Elements of Statistical Learning*. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. *An Introduction to Statistical Learning*. Vol. 112. Springer.

Jha, Vinesh. 2019. “Implementing Alternative Data in an Investment Process.” In *Big Data and Machine Learning in Quantitative Investment*, 51–74. Wiley.

Ke, Zheng Tracy, Bryan T Kelly, and Dacheng Xiu. 2019. “Predicting Returns with Text Data.” *SSRN Working Paper* 3388293.

Kearns, Michael, and Yuriy Nevmyvaka. 2013. “Machine Learning for Market Microstructure and High Frequency Trading.” *High Frequency Trading: New Realities for Traders, Markets, and Regulators*.

Loughran, Tim, and Bill McDonald. 2016. “Textual Analysis in Accounting and Finance: A Survey.” *Journal of Accounting Research* 54 (4): 1187–1230.

Mailund, Thomas. 2019. “Pipelines: Magrittr.” In *R Data Science Quick Reference*, 71–81. Springer.

Markowitz, Harry. 1952. “Portfolio Selection.” *Journal of Finance* 7 (1): 77–91.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. *Foundations of Machine Learning*. MIT press.

Ngai, Eric WT, Yong Hu, YH Wong, Yijun Chen, and Xin Sun. 2011. “The Application of Data Mining Techniques in Financial Fraud Detection: A Classification Framework and an Academic Review of Literature.” *Decision Support Systems* 50 (3): 559–69.

Ravisankar, Pediredla, Vadlamani Ravi, G Raghava Rao, and Indranil Bose. 2011. “Detection of Financial Statement Fraud and Feature Selection Using Data Mining Techniques.” *Decision Support Systems* 50 (2): 491–500.

Sirignano, Justin, and Rama Cont. 2019. “Universal Features of Price Formation in Financial Markets: Perspectives from Deep Learning.” *Quantitative Finance* 19 (9): 1449–59.

Sutton, Richard S, and Andrew G Barto. 2018. *Reinforcement Learning: An Introduction (2nd Edition)*. MIT press.

Wang, Gang, Jinxing Hao, Jian Ma, and Hongbing Jiang. 2011. “A Comparative Assessment of Ensemble Learning for Credit Scoring.” *Expert Systems with Applications* 38 (1): 223–30.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, L McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” *Journal of Open Source Software* 4 (43): 1686.

For a list of online resources, we recommend the curated page https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md.↩

One example: https://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html↩

By copy-pasting the content of the package in the library folder. To get the address of the folder, execute the command

*.libPaths()*in the R console.↩