NBER WORKING PAPER SERIES
EMPIRICAL ASSET PRICING VIA MACHINE LEARNING
Shihao Gu
Bryan Kelly
Dacheng Xiu
Working Paper 25398
http://www.nber.org/papers/w25398
NATIONAL BUREAU OF ECONOMIC RESEARCH
1050 Massachusetts Avenue
Cambridge, MA 02138
December 2018
We benefitted from discussions with Joseph Babcock, Si Chen (Discussant), Rob Engle, Andrea Frazzini, Amit Goyal (Discussant), Lasse Pedersen, Lin Peng (Discussant), Guofu Zhou
(Discussant), and seminar and conference participants at Erasmus School of Economics, National University of Singapore, Tsinghua PBC School of Finance, Fannie Mae, U.S. Securities and Exchange Commission, City University of Hong Kong, Shenzhen Finance Institute at CUHK
(SZ), NBER’s 2018 Summer Institute, New Methods for the Cross Section of Returns Conference, Chicago Quantitative Alliance Fall 2018 Conference, Norwegian Financial Research Conference, 45th annual meeting of European Finance Association, the 2018 China International Conference in Finance, the 10th World Congress of the Bachelier Finance Society, the 2018 Financial Engineering and Risk Management International Symposium, Toulouse Financial Econometrics Conference, the 2018 Chicago Conference on New Aspects of Statistics, Financial Econometrics, and Data Science, Tsinghua Workshop on Big Data and Internet Economics, and the 2017 Conference on Financial Predictability and Data Science. We gratefully acknowledge the computing support from the Research Computing Center at the University of Chicago. AQR Capital Management is a global investment management firm, which may or may not apply similar investment techniques or methods of analysis as described herein. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
At least one co-author has disclosed a financial relationship of potential relevance for this research. Further information is available online at http://www.nber.org/papers/w25398.ack NBER working papers are circulated for discussion and comment purposes. They have not been peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.
© 2018 by Shihao Gu, Bryan Kelly, and Dacheng Xiu. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source. Empirical Asset Pricing via Machine Learning
Shihao Gu, Bryan Kelly, and Dacheng Xiu
NBER Working Paper No. 25398
December 2018
JEL No. C45,C55,C58,G11,G12
ABSTRACT
We synthesize the field of machine learning with the canonical problem of empirical asset pricing: measuring asset risk premia. In the familiar empirical setting of cross section and time series stock return prediction, we perform a comparative analysis of methods in the machine learning repertoire, including generalized linear models, dimension reduction, boosted regression trees, random forests, and neural networks. At the broadest level, we find that machine learning offers an improved description of expected return behavior relative to traditional forecasting methods. Our implementation establishes a new standard for accuracy in measuring risk premia summarized by an unprecedented out-of-sample return prediction R2. We identify the best performing methods (trees and neural nets) and trace their predictive gains to allowance of nonlinear predictor interactions that are missed by other methods. Lastly, we find that all methods agree on the same small set of dominant predictive signals that includes variations on momentum, liquidity, and volatility. Improved risk premia measurement through machine learning can simplify the investigation into economic mechanisms of asset pricing and justifies its growing role in innovative financial technologies.
Shihao Gu
University of Chicago Booth
School of Business
5807 S. Woodlawn
Chicago IL 60637
********@************.***
Bryan Kelly
Yale School of Management
165 Whitney Ave.
New Haven, CT 06511
and NBER
*****.*****@****.***
Dacheng Xiu
Booth School of Business
University of Chicago
5807 South Woodlaswn Avenue
Chicago, IL 60637
*******@************.***
1 Introduction
In this article, we conduct a comparative analysis of machine learning methods for nance. We do so in the context of perhaps the most widely studied problem in nance, that of measuring equity risk premia.
1.1 Primary Contributions
Our primary contributions are two-fold. First, we provide a new benchmark of accuracy in measuring risk premia of the aggregate market and individual stocks. This accuracy is summarized two ways. The rst is an unprecedented high out-of-sample predictive R2 that is robust across a variety of machine learning speci cations. Second, portfolio strategies that leverage machine learning return forecasts earn annualized Sharpe ratios in excess of 2.0. Return prediction is economically meaningful. The fundamental goal of asset pricing is to un- derstand the behavior of risk premia.1 If expected returns were perfectly observed, we would still need theories to explain their behavior and empirical analysis to test those theories. But risk premia are notoriously di cult to measure market e ciency forces return variation to be dominated by unforecastable news that obscures risk premia. Our research highlights gains that can be achieved in prediction and identi es the most informative predictor variables. This helps resolve the prob- lem of risk premium measurement, which then facilitates more reliable investigation into economic mechanisms of asset pricing.
Second, we synthesize the empirical asset pricing literature with the eld of machine learning. Relative to traditional empirical methods in asset pricing, machine learning accommodates a far more expansive list of potential predictor variables and richer speci cations of functional form. It is this
exibility that allows us to push the frontier of risk premium measurement. Interest in machine learning methods for nance has grown tremendously in both academia and industry. This article provides a comparative overview of machine learning methods applied to the two canonical problems of empirical asset pricing: predicting returns in the cross section and time series. Our view is that the best way for researchers to understand the usefulness of machine learning in the eld of asset pricing is to apply and compare the performance of each of its methods in familiar empirical problems. 1.2 What is Machine Learning?
The de nition of \machine learning" is inchoate and is often context speci c. We use the term to describe (i) a diverse collection of high-dimensional models for statistical prediction, combined with
(ii) so-called \regularization" methods for model selection and mitigation of over t, and (iii) e cient algorithms for searching among a vast number of potential model speci cations. 1
Our focus is on measuring conditional expected stock returns in excess of the risk-free rate. Academic nance traditionally refers to this quantity as the \risk premium" due to its close connection with equilibrium compensation for bearing equity investment risk. We use the terms \expected return" and \risk premium" interchangeably. One may be interested in potentially distinguishing among di erent components of expected returns such as those due to systematic risk compensation, idiosyncratic risk compensation, or even due to mispricing. This paper does not make a distinction among their various origins and instead attempts to directly measure expected returns as precisely as possible.
2
The high-dimensional nature of machine learning methods (element (i) of this de nition) enhances their
exibility relative to more traditional econometric prediction techniques. This
exibility brings hope of better approximating the unknown and likely complex data generating process underlying equity risk premia. With enhanced
exibility, however, comes a higher propensity of over tting the data. Element (ii) of our machine learning de nition describes re nements in implementation that emphasize stable out-of-sample performance to explicitly guard against over t. Finally, with many predictors it becomes infeasible to exhaustively traverse and compare all model permutations. Element (iii) describes clever machine learning tools designed to approximate an optimal speci cation with manageable computational cost.
1.3 Why Apply Machine Learning to Asset Pricing?
A number of aspects of empirical asset pricing make it a particularly attractive eld for analysis with machine learning methods.
1) Two main research agendas have monopolized modern empirical asset pricing research. The rst seeks to describe and understand di erences in expected returns across assets. The second focuses on dynamics of the aggregate market equity risk premium. Measurement of an asset’s risk premium is fundamentally a problem of prediction the risk premium is the conditional expectation of a future realized excess return. Machine learning, whose methods are largely specialized for prediction tasks, is thus ideally suited to the problem of risk premium measurement. 2) The collection of candidate conditioning variables for the risk premium is large. The profession has accumulated a staggering list of predictors that various researchers have argued possess forecast- ing power for returns. The number of stock-level predictive characteristics reported in the literature numbers in the hundreds and macroeconomic predictors of the aggregate market number in the dozens.2 Additionally, predictors are often close cousins and highly correlated. Traditional predic- tion methods break down when the predictor count approaches the observation count or predictors are highly correlated. With an emphasis on variable selection and dimension reduction techniques, machine learning is well suited for such challenging prediction problems by reducing degrees of free- dom and condensing redundant variation among predictors. 3) Further complicating the problem is ambiguity regarding functional forms through which the high-dimensional predictor set enter into risk premia. Should they enter linearly? If nonlinearities are needed, which form should they take? Must we consider interactions among predictors? Such questions rapidly proliferate the set of potential model speci cations. The theoretical literature o ers little guidance for winnowing the list of conditioning variables and functional forms. Three aspects of machine learning make it well suited for problems of ambiguous functional form. The rst is its diversity. As a suite of dissimilar methods it casts a wide net in its speci cation search. Second, with methods ranging from generalized linear models to regression trees and neural networks, machine 2
Green et al. (2013) count 330 stock-level predictive signals in published or circulated drafts. Harvey et al. (2016) study 316 \factors," which include rm characteristics and common factors, for describing stock return behavior. They note that this is only a subset of those studied in the literature. Welch and Goyal (2008) analyze nearly 20 predictors for the aggregate market return. In both stock and aggregate return predictions, there presumably exists a much larger set of predictors that were tested but failed to predict returns and were thus never reported. 3
learning is explicitly designed to approximate complex nonlinear associations. Third, parameter penalization and conservative model selection criteria complement the breadth of functional forms spanned by these methods in order to avoid over t biases and false discovery. 1.4 What Speci c Machine Learning Methods Do We Study? We select a set of candidate models that are potentially well suited to address the three empirical challenges outlined above. They constitute the canon of methods one would encounter in a graduate level machine learning textbook.3 This includes linear regression, generalized linear models with pe- nalization, dimension reduction via principal components regression (PCR) and partial least squares
(PLS), regression trees (including boosted trees and random forests), and neural networks. This is not an exhaustive analysis of all methods. For example, it excludes methods like support vector machines that are more commonly employed in classi cation problems as opposed to continuous variable prediction. Nonetheless, our list is designed to be representative of predictive analytics tools from various branches of the machine learning toolkit. 1.5 Main Empirical Findings
We conduct a large scale empirical analysis, investigating nearly 30,000 individual stocks over 60 years from 1957 to 2016. Our predictor set includes 94 characteristics for each stock, interactions of each characteristic with eight aggregate time series variables, and 74 industry sector dummy variables, totaling more than 900 baseline signals. Some of our methods expand this predictor set much further by including nonlinear transformations and interactions of the baseline signals. We establish the following empirical facts about machine learning for return prediction. Machine learning shows great promise for empirical asset pricing. At the broadest level, our main empirical nding is that machine learning as a whole has the potential to improve our empirical understanding of expected asset returns. It digests our predictor data set, which is massive from the perspective of the existing literature, into a return forecasting model that dominates traditional approaches. The immediate implication is that machine learning aids in solving practical investments problems such as market timing, portfolio choice, and risk management, justifying its role in the business architecture of the ntech industry.
Consider as a benchmark a panel regression of individual stock returns onto three lagged stock- level characteristics: size, book-to-market, and momentum. This benchmark has a number of attrac- tive features. It is parsimonious and simple, and comparing against this benchmark is conservative because it is highly selected (the characteristics it includes are routinely demonstrated to be among the most robust return predictors). Lewellen (2015) demonstrates that this model performs about as well as larger and more complex stock prediction models studied in the literature. In our sample, which is longer and wider (more observations in terms of both dates and stocks) than that studied in Lewellen (2015), the out-of-sample R2 from the benchmark model is 0.16% per month for the panel of individual stock returns. When we expand the OLS panel model to include 3
See, for example, Hastie et al. (2009).
4
our set of 900+ predictors, predictability vanishes immediately the R2 drops deeply into negative territory. This is not surprising. With so many parameters to estimate, e ciency of OLS regression deteriorates precipitously and therefore produces forecasts that are highly unstable out-of-sample. This failure of OLS leads us to our next empirical fact. Vast predictor sets are viable for linear prediction when either penalization or dimension reduction is used. Our rst evidence that the machine learning toolkit aids in return prediction emerges from the fact that the \elastic net," which uses parameter shrinkage and variable selection to limit the regression’s degrees of freedom, solves the OLS ine ciency problem. In the 900+ predictor regression, elastic net pulls the out-of-sample R2 into positive territory at 0.09% per month. Principal component regression (PCR) and partial least square (PLS), which reduces the dimension of the predictor set to a few linear combinations of predictors, further raise the out-of-sample R2 to 0.28% and 0.18%, respectively. This is in spite of the presence of many likely \
uke" predictors that contribute pure noise to the large model. In other words, the high-dimensional predictor set in a simple linear speci cation is at least competitive with the status quo low-dimensional model, as long as over- parameterization can be controlled.
Allowing for nonlinearities substantially improves predictions. Next, we expand the model to accommodate nonlinear predictive relationships via generalized linear models, regression trees, and neural networks. We nd that trees and neural nets unambiguously improve return prediction with monthly stock-level R2’s between 0.27% and 0.39%. But the generalized linear model, which intro- duces nonlinearity via spline functions of each individual baseline predictor (but with no predictor interactions), fails to robustly outperform the linear speci cation. This suggests that allowing for
(potentially complex) interactions among the baseline predictors is a crucial aspect of nonlinearities in the expected return function. As part of our analysis, we discuss why generalized linear models are comparatively poorly suited for capturing predictor interactions. Shallow learning outperforms deeper learning. When we consider a range of neural networks from very shallow (a single hidden layer) to deeper networks (up to ve hidden layers), we nd that neural network performance peaks at three hidden layers then declines as more layers are added. Likewise, the boosted tree and random forest algorithms tend to select trees with few \leaves" (on average less than six leaves) in our analysis. This is likely an artifact of the relatively small amount of data and tiny signal-to-noise ratio for our return prediction problem, in comparison to the kinds of non- nancial settings in which deep learning thrives thanks to astronomical datasets and strong signals (such as computer vision).
The distance between nonlinear methods and the benchmark widens when predicting portfolio returns. We build bottom-up portfolio-level return forecasts from the stock-level forecasts produced by our models. Consider, for example, bottom-up forecasts of the S&P 500 portfolio return. By aggregating stock-level forecasts from the benchmark three-characteristic OLS model, we nd a monthly S&P 500 predictive R2 of 0:11%. The bottom-up S&P 500 forecast from the generalized linear model, in contrast, delivers an R2 of 0:86%. Trees and neural networks improve upon this further, generating monthly out-of-sample R2’s between 1.39% to 1.80% per month. The same pattern emerges for forecasting a variety of characteristic factor portfolios, such as those formed on 5
the basis of size, value, investment, pro tability, and momentum. In particular, neural networks produce a positive out-of-sample predictive R2 for every the factor portfolio we consider. More pronounced predictive power at the portfolio level versus the stock level is driven by the fact that individual stock returns behave erratically for some of the smallest and least liquid stocks in our sample. Aggregating into portfolios averages out much of the unpredictable stock-level noise while boosting the signal strength. This also helps detect the predictive bene ts of machine learning. The economic gains from machine learning forecasts are large. Our tests show clear statistical rejections of the panel regression benchmark and other linear models in favor of nonlinear machine learning tools. The evidence for economic gains from machine learning forecasts in the form of portfolio Sharpe ratios are likewise impressive. For example, an investor who times the S&P 500 based on bottom-up neural network forecasts enjoys a 21 percentage point increase in annualized out-of-sample Sharpe ratio, to 0.63, relative to the 0.42 Sharpe ratio of a buy-and-hold investor. And when we form a long-short decile spread directly sorted on stock return predictions from a neural network, the strategy earns an annualized out-of-sample Sharpe ratio of 2.35. In contrast, an analogous long-short strategy using forecasts from the benchmark panel regression delivers a Sharpe ratio of 0.89.
The most successful predictors are price trends, liquidity, and volatility. All of the methods we study produce a very similar ranking of the most informative stock-level predictors, which fall into three main categories. First, and most informative of all, are price trend variables including stock momentum, industry momentum, and short-term reversal. Next are liquidity variables including market value, dollar volume, and bid-ask spread. Finally, return volatility, idiosyncratic volatility, market beta, and beta squared are also among the leading predictors in all models we consider. An interpretation of machine learning facts through simulation. In Appendix A we performMonte Carlo simulations that support the above interpretations of our analysis. We apply machine learning to simulated data from two di erent data generating processes. Both produce data from a high dimensional predictor set. But in one, individual predictors enter only linearly and additively, while in the other predictors can enter through nonlinear transformations and via pairwise interactions. When we apply our machine learning repertoire to the simulated datasets, we nd that linear and generalized linear methods dominate in the linear and uninteracted setting, yet tree-based methods and neural networks signi cantly outperform in the nonlinear and interactive setting. 1.6 What Machine Learning Cannot Do
Machine learning has great potential for improving risk premium measurement, which is funda- mentally a problem of prediction. It amounts to best approximating the conditional expectation E(ri;t+1jFt), where ri;t+1 is an asset’s return in excess of the risk-free rate, and Ft is the true and unobservable information set of market participants. This is a domain in which machine learning algorithms excel.
But, ultimately, these improved predictions are only measurements. The measurements do not tell us about economic mechanisms or equilibria. Machine learning methods on their own do not identify deep fundamental associations among asset prices and conditioning variables. When the 6
objective is to understand economic mechanisms, machine learning may still be useful. It requires the economist to add structure to build a hypothesized mechanism into the estimation problem and decide how to introduce a machine learning algorithm subject to this structure. A nascent literature has begun to make progress marrying machine learning with equilibrium asset pricing (for example, Kelly et al., 2017; Feng et al., 2017), and this remains an exciting direction for future research.
1.7 Literature
Our work extends the empirical literature on stock return prediction, which comes in two basic strands. The rst strand models di erences in expected returns across stocks as a function of stock- level characteristics, and is exempli ed by Fama and French (2008) and Lewellen (2015). The typical approach in this literature runs cross-sectional regressions4 of future stock returns on a few lagged stock characteristics. The second strand forecasts the time series of returns and is surveyed by Koijen and Nieuwerburgh (2011) and Rapack and Zhou (2013). This literature typically conducts time series regressions of broad aggregate portfolio returns on a small number of macroeconomic predictor variables.
These traditional methods have potentially severe limitations that more advanced statistical tools in machine learning can help overcome. Most important is that regressions and portfolio sorts are ill-suited to handle the large numbers of predictor variables that the literature has accumulated over ve decades. The challenge is how to assess the incremental predictive content of a newly proposed predictor while jointly controlling for the gamut of extant signals (or, relatedly, handling the multiple comparisons and false discovery problem). Our primary contribution is to demonstrate potent return predictability that is harnessable from the large collection of existing variables when machine learning methods are used.
Machine learning methods have appeared sporadically in the asset pricing literature. Rapach et al.
(2013) apply LASSO to predict global equity market returns using lagged returns of all countries. Several papers apply neural-networks to forecast derivatives prices (Hutchinson et al., 1994; Yao et al., 2000, among others). Khandani et al. (2010) and Butaru et al. (2016) use regression trees to predict consumer credit card delinquencies and defaults. Sirignano et al. (2016) estimate a deep neural network for mortgage prepayment, delinquency, and foreclosure. Heaton et al. (2016) develop a deep learning neural network routine to automate portfolio selection. Recently, variations of machine learning methods have been used to study the cross section of stock returns. Harvey and Liu (2016) study the multiple comparisons problem using a bootstrap procedure. Giglio and Xiu (2016) and Kelly et al. (2017) use dimension reduction methods to estimate and test factor pricing models. Moritz and Zimmermann (2016) apply tree-based models to portfolio sorting. Kozak et al. (2017) and Freyberger et al. (2017) use shrinkage and selection methods to, respectively, approximate a stochastic discount factor and a nonlinear function for expected returns. The focus of our paper is to simultaneously explore a wide range of machine learning methods to 4
In addition to least squares regression, the literature often sorts assets into portfolios on the basis of characteristics and studies portfolio averages a form of non-parametric regression. 7
study the behavior of expected stock returns, with a particular emphasis on comparative analysis among methods.
2 Methodology
This section describes the collection of machine learning methods that we use in our analysis. In each subsection we introduce a new method and describe it in terms of its three fundamental ele- ments. First is the statistical model describing a method’s general functional form for risk premium predictions. The second is an objective function for estimating model parameters. All of our esti- mates share the basic objective of minimizing mean squared predictions error (MSE). Regularization is introduced through variations on the MSE objective, such as adding parameterization penalties and robusti cation against outliers. These modi cations are designed to avoid problems with over t and improve models’ out-of-sample predictive performance. Finally, even with a small number of predictors, the set of model permutations expands rapidly when one considers nonlinear predictor transformations. This proliferation is compounded in our already high dimension predictor set. The third element in each subsection describes computational algorithms for e ciently identifying the optimal speci cation among the permutations encompassed by a given method. As we present each method, we aim to provide a su ciently in-depth description of the statistical model so that a reader having no machine learning background can understand the basic model structure without needing to consult outside sources. At the same time, when discussing the com- putational methods for estimating each model, we are deliberately terse. There are many variants of each algorithm, and each has its own subtle technical nuances. To avoid bogging down the reader with programming details, we describe our speci c implementation choices in Appendix C and refer readers to original sources for further background. In its most general form, we describe an asset’s excess return as an additive prediction error model:
ri;t+1 = Et(ri;t+1)+ i;t+1; (1)
where
Et(ri;t+1) = g
(zi;t): (2)
Stocks are indexed as i = 1; Nt and months by t = 1; T . For ease of presentation, we assume a balanced panel of stocks, and defer the discussion on missing data to Section 3.1. Our objective is to isolate a representation of Et(ri;t+1) as a function of predictor variables that maximizes the out- of-sample explanatory power for realized ri;t+1. We denote those predictors as the P-dimensional vector zi;t, and assume the conditional expected return g? is a
exible function of these predictors. Despite its
exibility, this framework imposes some important restrictions. The g? function depends neither on i nor t. By maintaining the same form over time and across di erent stocks, the model leverages information from the entire panel which lends stability to estimates of risk premia for any individual asset. This is in contrast to standard asset pricing approaches that re-estimate a cross-sectional model each time period, or that independently estimate time series models for each 8
stock. Also, g? depends on z only through zi;t. This means our prediction does not use information from the history prior to t, or from individual stocks other than the ith. 2.1 Sample Splitting and Tuning via Validation
Important preliminary steps (prior to discussing speci c models and regularization approaches) are to understand how we design disjoint sub-samples for estimation and testing and to introduce the notion of \hyperparameter tuning."
The regularization procedures discussed below, which are machine learning’s primary defense against over tting, rely on a choice of hyperparameters (or, synonymously, \tuning parameters"). These are critical to the performance of machine learning methods as they control model complexity. Hyperparameters include, for example, the penalization parameters in LASSO and elastic net, the number of iterated trees in boosting, the number of random trees in a forest, and the depth of the trees. In most cases, there is little theoretical guidance for how to \tune" hyperparameters for optimized out-of-sample performance.
We follow the most common approach in the literature and select tuning parameters adaptively from the data in a validation sample. In particular, we divide our sample into three disjoint time periods that maintain the temporal ordering of the data. The rst, or