**causality** relationships between features and labels because they are mostly focused (by design) to capture correlations. Correlations are much weaker than causality because they characterize a two-way relationship ($\textbf{X}\leftrightarrow \textbf{y}$), while causality specifies a direction $\textbf{X}\rightarrow \textbf{y}$ or $\textbf{X}\leftarrow \textbf{y}$ . One fashionable example is sentiment. Many academic articles seem to find that sentiment (irrespectively of its definition) is a significant driver of future returns. A high sentiment for a particular stock may increase the demand for this stock and push its price up (though contrarian reasonings may also apply: if sentiment is high, it is a sign that mean-reversion is possibly about to happen). The reverse causation is also plausible: returns may well cause sentiment. If a stock experiences a long period of market growth, people become bullish about this stock and sentiment increases (this notably comes from extrapolation, see Barberis et al. (2015) for a theoretical model). In Coqueret (2020), it is found (in opposition to most findings in this field), that the latter relationship (returns $\rightarrow$ sentiment) is more likely. This result is backed by causality driven tests (see Section 14.1.1).

**invariance**.

**changing environments** are likely to stem from (and signal) causality. One counter-example is the following (related in Beery, Van Horn, and Perona (2018)): training a computer vision algorithm to discriminate between cows and camels will lead the algorithm to focus on grass versus sand! This is because most camels are pictured in the desert while cows are shown in green fields of grass. Thus, a picture of a camel on grass will be classified as cow, while a cow on sand would be labelled “camel”. It is only with pictures of these two animals in different contexts (environments) that the learner will end up truly finding what makes a cow and a camel. A camel will remain a camel no matter where it is pictured: it should be recognized as such by the learner. If so, the representation of the camel becomes invariant over all datasets and the learner has discovered causality, i.e., the true attributes that make the camel a camel (overall silhouette, shape of the back, face, color (possibly misleading!), etc.).

**non-stationarity** (see Section 1.1 for a definition of stationarity). In Chapter 12, we advocate to do that by updating models as frequently as possible with rolling training sets: this allows the predictions to be based on the most recent trends. In Section 14.2 below, we introduce other theoretical and practical options.

*directions* for these relationships. One typical example is the linear regression. If we write $y=a+bx+\epsilon$, then it is also true that $x=b^{-1}(y-a-\epsilon)$, which is of course also a linear relationship (with respect to $y$). These equations do not define causation whereby $x$ would be a clear determinant of $y$ ($x \rightarrow y$, but the opposite could be false).

\begin{align}
(Y_{t+1},\dots,Y_{t+k})|(\mathcal{F}_{Y,t}\cup \mathcal{F}_{X,t}) \quad \overset{d}{\neq} \quad (Y_{t+1},\dots,Y_{t+k})|\mathcal{F}_{Y,t},
\end{align}

\begin{align*}
X_t&=\sum_{j=1}^ma_jX_{t-j}+\sum_{j=1}^mb_jY_{t-j} + \epsilon_t \\
Y_t&=\sum_{j=1}^mc_jX_{t-j}+\sum_{j=1}^md_jY_{t-j} + \nu_t
\end{align*}

In [7]:

```
from statsmodels.tsa.stattools import grangercausalitytests
granger = training_sample.loc[training_sample["stock_id"]==1, # X variable = stock nb 1
["R1M_Usd", # Y variable = stock nb 1
"Mkt_Cap_6M_Usd"]] # & Market cap
fit_granger = grangercausalitytests(granger,maxlag=[6],verbose=True) # Maximmum lag
```

**do-calculus** developed by Pearl. Whereas traditional probabilities $P[Y|X]$ link the odds of $Y$ conditionally on **observing $X$** take some value $x$, the do($⋅$) **forces $X$** to take value $x$. This is a *looking* versus *doing* dichotomy. One classical example is the following. Observing a barometer gives a clue what the weather will be because high pressures are more often associated with sunny days:

but if you hack the barometer (force it to display some value),

\begin{align*}
X&=\epsilon_X \\
Y&=f(X,\epsilon_Y),
\end{align*}

\begin{align*}
X&=\epsilon_X \\
Y&=f(X,\epsilon_Y),\\
Z&=g(Y,\epsilon_Z)
\end{align*}

\begin{array}{ccccccc} X & &&&\\
&\searrow & &&\\
&&Y&\rightarrow&Z. \\
&\nearrow &&\nearrow& \\
\epsilon_Y & &\epsilon_Z
\end{array}

*vertices* (or *nodes*) and the arrows as *edges*. Because arrows have a direction, they are called directed edges. When two vertices are connected via an edge, they are called *adjacent*. A sequence of adjacent vertices is called a *path*, and it is directed if all edges are arrows. Within a directed path, a vertex that comes first is a parent node and the one just after is a child node.

\begin{align}
\textbf{A}=\begin{bmatrix}
0 & 1 & 0 \\
0 & 0 & 1 \\
0& 0&0
\end{bmatrix}.
\end{align}

**cycle** is a particular type of path that creates a loop, i.e., when the first vertex is also the last. The sequence $X \rightarrow Y \rightarrow Z \rightarrow X$ is a cycle. Technically, cycles pose problems. To illustrate this, consider the simple sequence $X \rightarrow Y \rightarrow X$. This would imply that a realization of $X$ causes $Y$ which in turn would cause the realization of $Y$. While Granger causality can be viewed as allowing this kind of connection, general causal models usually avoid cycles and work with **directed acyclic graphs** (DAGs). Formal graph manipulations (possibly linked to do-calculus) can be computed via the *causaleffect* package Tikka and Karvanen (2017). Direct acyclic graphs can also be created and manipulated with the *dagitty* (textor2016robust) and ggdag packages.

Equipped with these tools, we can explicitize a very general form of models:

\begin{equation}
\tag{14.1}
X_j=f_j\left(\textbf{X}_{\text{pa}_D(j)},\epsilon_j \right),
\end{equation}

\begin{equation}
\tag{14.2}
X_j=\sum_{k\in \text{pa}_D(j)}f_{j,k}\left(\textbf{X}_{k} \right)+\epsilon_j,
\end{equation}

*‘additive’*. Note that there is no time index there. In contrast to Granger causality, there is no natural ordering. Such models are very complex and hard to estimate. The details can be found in Bühlmann et al. (2014). Fortunately, the authors have developed an R package that determines the DAG $D$.

In [8]:

```
import icpy as icpy
B=training_sample[['Mkt_Cap_12M_Usd','Vol1Y_Usd']].values # Node B1 and B2
C=training_sample['R1M_Usd'].values # Node C
ExpInd=np.round(np.random.uniform(size=training_sample.shape[0])) # "Environment"
icpy.invariant_causal_prediction(X=B,y=C,z=ExpInd,alpha=0.1) # test if A or B are parents of C
```

Out[8]:

ICP(S_hat=array([0, 1], dtype=int64), q_values=array([1.34146064e-215, 1.34146064e-215]), p_value=1.341460638715696e-215)

*pcalg* package (Kalisch et al. (2012)).29 Below, an estimation via the so-called PC (named after its authors **P**eter Spirtes and **C**lark Glymour) is performed. The details of the algorithm are out of the scope of the book, and the interested reader can have a look at section 5.4 of Spirtes et al. (2000) or section 2 from Kalisch et al. (2012) for more information on this subject.

In [14]:

```
import cdt
import networkx as nx
import matplotlib.pyplot as plt
data_caus = training_sample[features_short+["R1M_Usd"]]
dm = np.array(data_caus)
cm = np.corrcoef(dm.T) # Compute correlations
df=pd.DataFrame(cm)
glasso = cdt.independence.graph.Glasso() # intialize graph lasso
skeleton = glasso.predict(df) # apply graph lasso to dataset
model_pc = cdt.causality.graph.PC() # PC algo. from pcalg R library
graph_pc = model_pc.predict(df, skeleton) # Estimate model
fig=plt.figure(figsize=[10,6])
nx.draw_networkx(model_pc) # Plot model
```

FIGURE 14.1: Representation of a directed graph.

**structural time series**. Because we illustrate their relevance for a particular kind of causal inference, we closely follow the notations of Brodersen et al. (2015). The model is driven by two equations:

\begin{align*}
y_t&=\textbf{Z}_t'\boldsymbol{\alpha}_t+\epsilon_t \\
\boldsymbol{\alpha}_{t+1}& =\textbf{T}_t\boldsymbol{\alpha}_{t}+\textbf{R}_t\boldsymbol{\eta}_t.
\end{align*}

*CausalImpact* module, python version.

The time series associated with the model are shown in Figure 14.2.

In [9]:

```
from causalimpact import CausalImpact
stock1_data = data_ml.loc[data_ml["stock_id"]==1, :] # Data of first stock
struct_data = stock1_data[["Advt_3M_Usd"]+features_short] # Combine label and features
struct_data.index = pd.RangeIndex(start=0, stop=228, step=1) # Setting index as int
pre_period = [0, 99] # Pre-break period (pre-2008)
post_period = [100, 199] # Post-break period
impact = CausalImpact(struct_data, pre_period, post_period) # Causal model created
impact.run() # run!
print(impact.summary()) # Summary analysis
impact.plot() # Plot!
```

FIGURE 14.2: Output of the causal impact study.

There are several ways to define changes in environments. If we denote with $\mathbb{P}_{XY}$ the multivariate distribution of all variables (features and label), with $\mathbb{P}_{XY}=\mathbb{P}_{X}\mathbb{P}_{Y|X}$, then two simple changes are possible:

**covariate shift**: $\mathbb{P}_{X}$ changes but $\mathbb{P}_{Y|X}$ does not: the features have a fluctuating distribution, but their relationship with $Y$ holds still;**concept drift**: $\mathbb{P}_{Y|X}$ changes but $\mathbb{P}_{X}$ does not: feature distributions are stable, but their relation to $Y$ is altered.

FIGURE 14.3: Different flavors of concept change.

**stability-plasticity dilemma**. This dilemma is a trade-off between model **reactiveness** (new instances have an important impact on updates) versus **stability** (these instances may not be representative of a slower trend and they may thus shift the model in a suboptimal direction).

**weighted least squares** wherein errors are weighted inside the loss:

\begin{align}
L=\sum_{i=1}^Iw_i(y_i-\textbf{x}_i\textbf{b})^2.
\end{align}

In [10]:

```
data_ml["year"] = pd.to_datetime(data_ml['date']).dt.year # Adding a year column for later groupby
data_ml.groupby("year")["R1M_Usd"].mean().plot.bar(figsize=[16,6]) # Agregatting and plotting
```

Out[10]:

<AxesSubplot:xlabel='year'>

FIGURE 14.4: Average monthly return on a yearly basis.

In [11]:

```
import statsmodels.api as sm
def regress(df): # To avoid loop and keep...
model = sm.OLS(df['R6M_Usd'], exog=sm.add_constant(df[['Mkt_Cap_6M_Usd']])) # ... the groupby structure we use...
return model.fit().params[1] # ... a function with statsmodel
beta_cap = data_ml.groupby('year').apply(regress) # Perform regression
beta_cap=pd.DataFrame(beta_cap,columns=['beta_cap']).reset_index() # Format into df
beta_cap.groupby("year").mean().plot.bar(figsize=[16,6]) # Plot
```

Out[11]:

<AxesSubplot:xlabel='year'>

FIGURE 14.5: Variations in betas with respect to 6-month market capitalization.

**size effect** again). Sometimes it is markedly negative, sometimes, not so much. The ability of capitalization to explain returns is time-varying and models must adapt accordingly.

\begin{align}
\textbf{b}_{t+1} \longleftarrow \textbf{b}_t-\eta (\textbf{x}_t\textbf{b}-y_t)\textbf{x}_t',
\end{align}

\begin{equation}
\tag{14.3}
R_T=\sum_{t=1}^TL_t(\boldsymbol{\theta}_t)-\underset{\boldsymbol{\theta}^*\in \boldsymbol{\Theta}}{\inf} \ \sum_{t=1}^TL_t(\boldsymbol{\theta}^*).
\end{equation}

\begin{equation}
\tag{14.4}
\textbf{z}_{t+1}=\boldsymbol{\theta}_t-\eta_t\nabla L_t(\boldsymbol{\theta}_t),
\end{equation}

\begin{equation}
\tag{14.5}
\boldsymbol{\theta}_{t+1}=\Pi_S(\textbf{z}_{t+1}), \quad \text{with} \quad \Pi_S(\textbf{u}) = \underset{\boldsymbol{\theta}\in S}{\text{argmin}} \ ||\boldsymbol{\theta}-\textbf{u}||_2.
\end{equation}

\begin{align}
R_T \le \frac{G^2}{2H}(1+\log(T)),
\end{align}

where $H$ is a scaling factor for the learning rate (also called step sizes): $\eta_t=(Ht)^{-1}$.

\begin{align}
L_\epsilon(\boldsymbol{\theta})=\left\{ \begin{array}{ll}
0 & \text{if } \ |\boldsymbol{\theta}'\textbf{x}-y|\le \epsilon \quad (\text{close enough prediction}) \\
|\boldsymbol{\theta}'\textbf{x}-y|- \epsilon & \text{if } \ |\boldsymbol{\theta}'\textbf{x}-y| > \epsilon \quad (\text{prediction too far})
\end{array}\right.,
\end{align}

\begin{align}
\boldsymbol{\theta}_{t+1}= \underset{\boldsymbol{\theta}}{\text{argmin}} ||\boldsymbol{\theta}-\boldsymbol{\theta}_t||_2^2, \quad \text{subject to} \quad L_\epsilon(\boldsymbol{\theta})=0,
\end{align}

hence the new parameter values are chosen such that two conditions are satisfied:

- the loss is zero (by the definition of the loss, this means that the model is close enough to the true value);
- and, the parameter is as close as possible to the previous parameter values.

**universal portfolios** originally coined by Cover (1991) in particular. The setting is the following. The function $f$ is assumed to be linear $\boldsymbol{\theta}'\textbf{1}_N=1$ and the data $\mathbf{x}_t$ consists of asset returns, thus, the values are portfolio returns as long as $\boldsymbol{\theta}'\textbf{1}_N=1$ (the budget constraint). The loss functions $L_t$ correspond to a concave utility function (e.g., logarithmic) and the regret is reversed:

\begin{align}
R_T=\underset{\boldsymbol{\theta}^*\in \boldsymbol{\Theta}}{\sup} \ \sum_{t=1}^TL_t(\textbf{r}_t'\boldsymbol{\theta}^*)-\sum_{t=1}^TL_t(\textbf{r}_t'\boldsymbol{\theta}_t),
\end{align}

*side information*’). In the latter article, it is proven that constantly rebalanced portfolios distributed according to two random distributions achieve growth rates that are close to the unattainable optimal rates. The two distributions are the uniform law (equally weighting, once again) and the Dirichlet distribution with constant parameters equal to 1/2. Under this universal distribution, Cover and Ordentlich (1996) show that the wealth obtained is bounded below by:

\begin{align}
\text{wealth universal} \ge \frac{\text{wealth from optimal strategy}}{2(n+1)^{(m-1)/2}},
\end{align}

where $m$ is the number of assets and $n$ is the number of periods.

\begin{align}
\epsilon_S(f,h)=\mathbb{E}_S[|f(\textbf{x})-h(\textbf{x})|].
\end{align}

Then,

\begin{equation}
\small
\epsilon_T(f_T,h)\le \epsilon_S(f_S,h)+\underbrace{2 \sup_B|P_S(B)-P_T(B)|}_{\text{ difference between domains }} + \underbrace{ \min\left(\mathbb{E}_S[|f_S(\textbf{x})-f_T(\textbf{x})|],\mathbb{E}_T[|f_S(\textbf{x})-f_T(\textbf{x})|]\right)}_{\text{difference between the two learning tasks}}, \nonumber
\end{equation}

where $P_S$ and $P_T$ denote the distribution of the two domains. The above inequality is a bound on the generalization performance of $h$. If we take $f_S$ to be the best possible classifier for $S$ and $f_T$ the best for $T$, then the error generated by $h$ in $T$ is smaller than the sum of three components:

- the error in the $S$ space;
- the distance between the two domains (by how much the data space has shifted);
- the distance between the two best models (generators).

\begin{align*}
\epsilon_T(f)=\mathbb{E}_T\left[L(\text{y},f(\textbf{X})) \right],
\end{align*}

\begin{align*}
\epsilon_T(f)&=\mathbb{E}_T \left[\frac{P_S(\textbf{y},\textbf{X})}{P_S(\textbf{y},\textbf{X})} L(\text{y},f(\textbf{X})) \right] \\
&=\sum_{\textbf{y},\textbf{X}}P_T(\textbf{y},\textbf{X})\frac{P_S(\textbf{y},\textbf{X})}{P_S(\textbf{y},\textbf{X})} L(\text{y},f(\textbf{X})) \\
&=\mathbb{E}_S \left[\frac{P_T(\textbf{y},\textbf{X})}{P_S(\textbf{y},\textbf{X})} L(\text{y},f(\textbf{X})) \right]
\end{align*}

Agarwal, Amit, Elad Hazan, Satyen Kale, and Robert E Schapire. 2006. “Algorithms for Portfolio Management Based on the Newton Method.” In Proceedings of the 23rd International Conference on Machine Learning, 9–16. ACM.

Arjovsky, Martin, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. “Invariant Risk Minimization.” arXiv Preprint, no. 1907.02893.

Aronow, Peter M., and Fredrik Sävje. 2019. “Book Review. The Book of Why: The New Science of Cause and Effect.” Journal of the American Statistical Association 115 (529): 482–85.

Barberis, Nicholas, Robin Greenwood, Lawrence Jin, and Andrei Shleifer. 2015. “X-CAPM: An Extrapolative Capital Asset Pricing Model.” Journal of Financial Economics 115 (1): 1–24.

Basak, Jayanta. 2004. “Online Adaptive Decision Trees.” Neural Computation 16 (9): 1959–81.

Beery, Sara, Grant Van Horn, and Pietro Perona. 2018. “Recognition in Terra Incognita.” In Proceedings of the European Conference on Computer Vision (Eccv), 456–73.

Ben-David, Shai, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. “A Theory of Learning from Different Domains.” Machine Learning 79 (1-2): 151–75.

Blum, Avrim, and Adam Kalai. 1999. “Universal Portfolios with and Without Transaction Costs.” Machine Learning 35 (3): 193–205.

Brodersen, Kay H, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L Scott, and others. 2015. “Inferring Causal Impact Using Bayesian Structural Time-Series Models.” Annals of Applied Statistics 9 (1): 247–74.

Bühlmann, Peter, Jonas Peters, Jan Ernest, and others. 2014. “CAM: Causal Additive Models, High-Dimensional Order Search and Penalized Regression.” Annals of Statistics 42 (6): 2526–56.

Chow, Ying-Foon, John A Cotsomitis, and Andy CC Kwan. 2002. “Multivariate Cointegration and Causality Tests of Wagner’s Hypothesis: Evidence from the UK.” Applied Economics 34 (13): 1671–7.

Cont, Rama. 2007. “Volatility Clustering in Financial Markets: Empirical Facts and Agent-Based Models.” In Long Memory in Economics, 289–309. Springer.

Coqueret, Guillaume. 2020. “Stock Specific Sentiment and Return Predictability.” Quantitative Finance Forthcoming.

Coqueret, Guillaume, and Tony Guida. 2020. “Training Trees on Tails with Applications to Portfolio Choice.” Annals of Operations Research 288: 181–221.

Cornuejols, Antoine, Laurent Miclet, and Vincent Barra. 2018. Apprentissage Artificiel: Deep Learning, Concepts et Algorithmes. Eyrolles.

Cover, Thomas M. 1991. “Universal Portfolios.” Mathematical Finance 1 (1): 1–29.

Cover, Thomas M, and Erik Ordentlich. 1996. “Universal Portfolios with Side Information.” IEEE Transactions on Information Theory 42 (2): 348–63.

Crammer, Koby, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. “Online Passive-Aggressive Algorithms.” Journal of Machine Learning Research 7 (Mar): 551–85.

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation.” Econometrica, 987–1007.

Granger, Clive WJ. 1969. “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods.” Econometrica, 424–38.

Hahn, P Richard, Jared S Murray, and Carlos Carvalho. 2019. “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects.” arXiv Preprint, no. 1706.09523.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” Machine Learning 69 (2-3): 169–92.

Hazan, Elad, and others. 2016. “Introduction to Online Convex Optimization.” Foundations and Trends in Optimization 2 (3-4): 157–325.

Heinze-Deml, Christina, Jonas Peters, and Nicolai Meinshausen. 2018. “Invariant Causal Prediction for Nonlinear Models.” Journal of Causal Inference 6 (2).

Hiemstra, Craig, and Jonathan D Jones. 1994. “Testing for Linear and Nonlinear Granger Causality in the Stock Price-Volume Relation.” Journal of Finance 49 (5): 1639–64.

Hoi, Steven CH, Doyen Sahoo, Jing Lu, and Peilin Zhao. 2018. “Online Learning: A Comprehensive Survey.” arXiv Preprint, no. 1802.02871.

Hünermund, Paul, and Elias Bareinboim. 2019. “Causal Inference and Data-Fusion in Econometrics.” arXiv Preprint, no. 1912.09104.

Kalisch, Markus, Martin Mächler, Diego Colombo, Marloes H Maathuis, Peter Bühlmann, and others. 2012. “Causal Inference Using Graphical Models with the R Package Pcalg.” Journal of Statistical Software 47 (11): 1–26.

Khedmati, Majid, and Pejman Azin. 2020. “An Online Portfolio Selection Algorithm Using Clustering Approaches and Considering Transaction Costs.” Expert Systems with Applications Forthcoming: 113546.

Koshiyama, Adriano, Sebastian Flennerhag, Stefano B Blumberg, Nick Firoozye, and Philip Treleaven. 2020. “QuantNet: Transferring Learning Across Systematic Trading Strategies.” arXiv Preprint, no. 2004.03445.

Li, Bin, and Steven CH Hoi. 2014. “Online Portfolio Selection: A Survey.” ACM Computing Surveys (CSUR) 46 (3): 35.

Li, Bin, and Steven Chu Hong Hoi. 2018. Online Portfolio Selection: Principles and Algorithms. CRC Press.

Maathuis, Marloes, Mathias Drton, Steffen Lauritzen, and Martin Wainwright. 2018. Handbook of Graphical Models. CRC Press.

Pan, Sinno Jialin, and Qiang Yang. 2009. “A Survey on Transfer Learning.” IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–59.

Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. Second Edition. Vol. 29. Cambridge University Press.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.

Quionero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset Shift in Machine Learning. MIT Press.

Regenstein, Jonathan K. 2018. Reproducible Finance with R: Code Flows and Shiny Apps for Portfolio Analysis. Chapman & Hall / CRC.

Spirtes, Peter, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, Prediction, and Search. MIT Press.

Tikka, Santtu, and Juha Karvanen. 2017. “Identifying Causal Effects with the R Package Causaleffect.” Journal of Statistical Software 76 (1): 1–30.

Weiss, Karl, Taghi M Khoshgoftaar, and DingDing Wang. 2016. “A Survey of Transfer Learning.” Journal of Big Data 3 (1): 9.

Widrow, Bernard, and Marcian E Hoff. 1960. “Adaptive Switching Circuits.” In IRE Wescon Convention Record, 4:96–104.

Wong, Steven YK, Jennifer Chan, Lamiae Azizi, and Richard YD Xu. 2020. “Time-Varying Neural Network for Stock Return Prediction.” arXiv Preprint, no. 2003.02515.

See for instance the papers on herding in factor investing: Krkoska and Schenk-Hoppé (2019) and Santi and Zwinkels (2018).↩︎

This book is probably the most complete reference for theoretical results in machine learning, but it is in French.↩︎