Methodology and Out-of-Sample Tests

The ultimate goal of a positive science is the development of a ”theory” or, “hypothesis” that yields valid and meaningful (i.e., not truistic) predictions about phenomena not yet observed. -Milton Friedman, The Methodology of Positive Economics (1966)

Since Friedman’s seminal methodology article, prediction has been a rhetorical mainstay of the science of economics. Friedman was explicit in the importance of validating theories with evidence “not yet observed”. He clarified though that evidence “not yet observed” didn’t have to mean “in the future”, but instead that the observations from which the model was developed to explain have to be independent from the observations from which the model is to be validated.Gigerenzer

Any given sample is full of noise that is unique to that sample. A curve that perfectly fit the sample would be unlikely to fit other samples. This concern of ‘over-fitting’ refers to the fact that a perfect model can be created to explain every single observation in a given sample. The problem with that model is that it will only explain that sample. The image to the right visually represents this argument. Gigerenzer and Brighton give a more rigorous explanation of this argument in terms of the bias/variance trade-off.

The importance of that argument for this research is that there has been very little concern for over-fitting within the econometric literature that estimates the spending multiplier. The conclusion from this is that less weight should be placed on those econometric estimations. It is not clear how much they add to our knowledge of the multiplier.

So what is happening in this literature? Most of the studies are done on the United States between 1920 and 2005.  By and large, researchers differentiate their analysis from the others by changing their model, as opposed to testing new samples. There are a few instances of foreign data sets being used with a model from a previous paper. However, the problem then is that there is no recognition of the analysis as an out-of-sample test and thus no evaluation of

out-of-sample fit or comparison to other models. Smets and Wouters (2007) provide an example of out-of-sample evaluation of model performance. Similar analyses are rarely performed in the multiplier

literature.  Futher, the question is always one of comparison between and selection from a set of various models and their ability to explain out-of-sample data. Because there will always be some error in prediction, without such a comparison we have no way of gauging what are “good” predictive outcomes. The Smets and Wouters example could be improved by comparing their out-of-sample performance to a “naive” or very simple model.  I should also note that prior to the 1980s, there was a concern fo
Again, the conclusion from this research is that we have very little evidence in favor of any one of these various models. Fit is not an adequate test of the model, especially if that model is to be used to derive policy advice. The only way to get beyond the concern for over-fitting is to test for out-of-sample predictive accuracy. This methodology is largely absent from econometric estimations of the spending multiplier.r the predicative ability of the large-scale macro models, however their results were fairly dismal. Post-1980, very little concern for predictive power have been expressed.