Log in
All resources

Minimum and recommended length of training data for prediction models

Determining the amount of data recommended for using a certain statistical method is not a simple problem, and there are no universally correct answers. Some rules of thumb exist, but any such rule is likely to be an oversimplification. This is because the amount of data required depends on the characteristics of the data in question. 

The most important factor is the signal-to-noise ratio in the data. If there exists a clear relationship between the input predictors and the target, which is not obscured by noise, a model may be able to identify the relationship with relatively little data. On the other hand, when there is much noise, you may require large amounts of data.  

Another important factor is the model you choose to use, and in particular the degree of freedom of the given model. For univariate linear regression with just one input predictor, just a handful of points can give good results. For each input predictor you add to this model, you increase the degree of freedom by 1, requiring more data points to avoid overfitting. For ARIMA most rules of thumb would recommend something closer to 30 data points, and if you include seasonal components as well, we might recommend 50 or even 100 data points as a minimum.

It also depends on your purpose with the analysis. If your aim is to quantify the relationships between the time series so well that you can use them for predicting future data points, you will typically need more data. An extreme example is the task of predicting stock prices. In an efficient market the available knowledge you could potentially put into the model has, to a large extent, already been priced into the market, and it is likely to be difficult to extract whatever signal remains in the data, as it is swamped in noise. In such cases you may need an extraordinary amount of data.

Note also that it is not always better to include more data. If your time series are stationary, meaning that the properties of the time series do not depend on the time when the time series are observed, it never hurts to add more data. However, many time series you encounter in the real world are not stationary. If you go sufficiently far back in time the situation may have been markedly different from what it is today, and relations that held a few years ago, may no longer be relevant to understanding the current market. 

A stark example of this is the financial crisis of 2007 and 2008. Many macroeconomic indicators and other economic time series show conspicuous behaviour during the crisis, and for many data sets you cannot expect behaviour from this period to be informative with regard to the current time. Worse, the magnitude of the time series movements during such chaotic periods are often large compared to those in other time periods, leading many statistical models to give them disproportionate weight. You should therefore consider leaving aberrant time periods out of the training data. 

Exabel is a finance technology company based in Oslo, Norway.