Which covariates should I include/exclude? Should I include interaction terms? higher order terms? What about the “noise”, is it really white? Which diagnostics should I use? What about “multi collinearity”, aka dependence among the “independent variables”? transformations? bias? convergence? efficiency? etc. These are the epicycles of regression, well known to every regression adept. The Ptolemaic cosmology introduced ad hoc modeling fixes called epicycles to keep the Earth at the center of the universe while “saving the phenomena” by predicting that the planets should be where they are. What if we don’t know where the planets are? What if we don’t know, say, the effect of duration of breast feeding on a child’s IQ? Which epicycles should we use to get the right answer?
A new RFF discussion paper called Vine Regression promises to dispel these epicycles (for a prose explanation of vines see https://en.wikipedia.org/wiki/Vine_copula). Regular vines are a tool for building complex high dimensional densities from bivariate and conditional bivariate pieces. The univariate distributions can be taken from the data and the vine is chosen to mimic the dependences. Once we have a joint density we can compute everything. We can compute the expected IQ of a child who was breast fed for 2 weeks, born in 1992 to a 29 year old mother who completed the 12th grade, with an IQ of 100.13 and a yearly family income of $21,500. That child’s expected IQ is 94.6. Had that same child been breast fed for 12 weeks, its expected IQ would be 96.1, and for 22 weeks, 96.8. We can do more than compute. We can sample the density which mimics our data and create a new data set for which we are knock down drag out certain which model is true. Turn all the epicycles loose on this new data set and see which ones come closest to the ground truth. The results for breast feeding and IQ are in the new discussion paper.