\begin{equation}
\boldsymbol{\Sigma}=\textbf{X}'\textbf{X}=\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}, \quad \boldsymbol{\Sigma}^{-1}=\frac{1}{1-\rho^2}\begin{bmatrix} 1 & -\rho \\ -\rho & 1 \end{bmatrix}.
\end{equation}

In [2]:

```
import statsmodels.api as sm # Package for clean regression output
stat = sm.OLS(training_sample['R1M_Usd'],sm.add_constant(training_sample[features])).fit() # Model: predict R1M_Usd
reg_thrhld=3 # Keep significant predictors only
boo_filter = np.abs(stat.tvalues) >= reg_thrhld # regressors significance threshold
estimate=stat.params[boo_filter] # estimate
std_error=stat.bse[boo_filter] # std.error
statistic=stat.tvalues[boo_filter] # statistic
p_value=stat.pvalues[boo_filter] # p.value
significant_regressors = pd.concat([estimate,std_error,statistic,p_value],axis=1) # Put output in clean format
significant_regressors.columns=['estimate','std.error','statistic','p.value'] # Renaming columns
print(significant_regressors)
```

TABLE 15.1: Significant predictors in the training sample.

In [4]:

```
import seaborn as sns # Package for plots
sns.set(rc={'figure.figsize':(16,16)}) # Setting the figsize in seaborn
sns.heatmap(training_sample[features].corr()) # Correlation matrix and plot
```

Out[4]:

<AxesSubplot:>

FIGURE 15.1: Correlation matrix of predictors.

**multicollinearity** (when predictors are correlated) can be much less a problem for ML tools than it is for pure statistical inference. In statistics, one central goal is to study the properties of $\beta$ coefficients. Collinearity perturbs this kind of analysis. In machine learning, the aim is to maximize out-of-sample accuracy. If having many predictors can be helpful, then so be it. One simple example can help clarify this matter. When building a regression tree, having many predictors will give more options for the splits. If the features make sense, then they can be useful. The same reasoning applies to random forests and boosted trees. What does matter is that the large spectrum of features helps improve the generalization ability of the model. Their collinearity is irrelevant.

- the first one aims at creating new variables that are uncorrelated with each other. Low correlation is favorable from an algorithmic point of view, but the new variables lack interpretability;
- the second one gathers predictors into homogeneous clusters and only one feature should be chosen out of this cluster. Here the rationale is reversed: interpretability is favored over statistical properties because the resulting set of features may still include high correlations, albeit to a lesser point compared to the original one.

The first method is a cornerstone in dimensionality reduction. It seeks to determine a smaller number of factors ($K'<K$) such that:

- i) the level of explanatory power remains as high as possible;
- ii) the resulting factors are linear combinations of the original variables;
- iii) the resulting factors are orthogonal.

\begin{equation}
\tag{15.1}
\textbf{X}=\textbf{U} \boldsymbol{\Delta} \textbf{V}',
\end{equation}

\begin{equation}
\tag{15.2}
\textbf{X}=\textbf{Q}\textbf{DQ}'.
\end{equation}

**diagonalization** (see chapter 7 in Meyer (2000)) and conveniently applies to covariance matrices.

**change of base**, which is a linear transformation of $X$ into $\textbf{Z}$, a matrix with identical dimension, via

\begin{equation}
\tag{15.3}
\textbf{Z}=\textbf{XP},
\end{equation}

- the first condition imposes that the covariance matrix be diagonal;
- the second condition imposes that the diagonal elements, when ranked in decreasing magnitude, see their value decline (sharply if possible).

The covariance matrix of $\textbf{Z}$ is

\begin{equation}
\tag{15.4}
\boldsymbol{\Sigma}_Y=\frac{1}{I-1}\textbf{Z}'\textbf{Z}=\frac{1}{I-1}\textbf{P}'\textbf{X}'\textbf{XP}=\frac{1}{I-1}\textbf{P}'\boldsymbol{\Sigma}_X\textbf{P}.
\end{equation}

In this expression, we plug the decomposition (15.2) of $\boldsymbol{\Sigma}_X$:

\begin{equation}
\boldsymbol{\Sigma}_Y=\frac{1}{I-1}\textbf{P}'\textbf{Q}\textbf{DQ}'\textbf{P},
\end{equation}

In [6]:

```
from sklearn import decomposition
pca = decomposition.PCA(n_components=7) # we impose the number of components
pca.fit(training_sample[features_short]) # Performs PCA on smaller number of predictors
print(pca.explained_variance_ratio_) # Cheking the variance explained per component
P=pd.DataFrame(pca.components_,columns=features_short).T # Rotation (n x k) = (7 x 7)
P.columns = ['P' + str(col) for col in P.columns] # tidying up columns names
P
```

[0.35718238 0.1940806 0.15561321 0.10434453 0.09601422 0.07017118 0.02259388]

Out[6]:

P0 | P1 | P2 | P3 | P4 | P5 | P6 | |
---|---|---|---|---|---|---|---|

Div_Yld | -0.271599 | 0.579099 | 0.045725 | -0.528956 | 0.226626 | 0.506566 | 0.032012 |

Eps | -0.420407 | 0.150082 | -0.024767 | 0.337373 | -0.771377 | 0.301883 | 0.011965 |

Mkt_Cap_12M_Usd | -0.523868 | -0.343239 | 0.172289 | 0.062495 | 0.252781 | 0.002987 | 0.714319 |

Mom_11M_Usd | -0.047238 | -0.057714 | -0.897160 | 0.241015 | 0.250559 | 0.258477 | 0.043179 |

Ocf | -0.532947 | -0.195890 | 0.185039 | 0.234371 | 0.357596 | 0.049015 | -0.676866 |

Pb | -0.152413 | -0.580806 | -0.221048 | -0.682136 | -0.308665 | 0.038675 | -0.168799 |

Vol1Y_Usd | 0.406890 | -0.381139 | 0.282162 | 0.155411 | 0.061575 | 0.762588 | 0.008632 |

In [7]:

```
from pca import pca
model = pca(n_components=7) # Initialize
results = model.fit_transform(training_sample[features_short], col_labels=features_short) # Fit transform and include the column labels and row labels
model.biplot(n_feat=7, PC=[0,1],cmap=None, label=None, legend=False) # Make biplot
```