Regression analysis

In statistics, regression analysis is used to model relationships between variables and determine the magnitude of those relationships. The models can be used to make predictions.

Introduction

Regression analysis models the relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named $Y$), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named $X_1, ..., X_p$). Multivariate regression describes models that have more than one response variable.

Types of regression

Simple and multiple linear regression

Simple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Simple linear regression refers to a regression on two variables while multiple regression refers to a regression on more than two variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors).

Nonlinear regression models

If the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression.

Other models

Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression.

Linear models

Predictor variables may be defined quantitatively (i.e., continuous) or qualitatively (i.e., categorical). Categorical predictors are sometimes called factors. Although the method of estimating the model is the same for each case, different situations are sometimes known by different names for historical reasons:

The linear model usually assumes that the dependent variable is continuous. If least squares estimation is used, then if it is assumed that the error component is normally distributed, the model is fully parametric. If it is not assumed that the data are normally distributed, the model is semi-parametric. If the data are not normally distributed, there are often better approaches to fitting than least squares. In particular, if the data contain outliers, robust regression might be preferred.

If two or more independent variables are correlated, we say that the variables are multicollinear. Multicollinearity results in parameter estimates that are unbiased and consistent, but inefficient.

If the regression error is not normally distributed but is assumed to come from an exponential family, generalized linear models should be used. For example, if the response variable can take only binary values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors.

Regression and Bayesian statistics

Maximum likelihood is one method of estimating the parameters of a regression model, which behaves well for large samples. However, for small amounts of data, the estimates can have high variance or bias. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative, a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available. They are exact, not asymptotic, and thus work well for small data sets if some contextual information is available to be used in the prior. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior mode. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.

Examples

To illustrate the various goals of regression, we will give three examples.

Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

 Height (in) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Weight (lbs) 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function $\eta$ such that $Y=\eta(X)+\varepsilon$, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:

Image:Data plot women weight vs height.jpg

$\vec{X}$ will denote the vector containing all the measured heights ($\vec{X}=(58,59,60,\cdots)$) and $\vec{Y}=(115,117,120,\cdots)$ is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients $\theta^0, \theta^1$ and $\theta^2$ satisfying as well as possible (in the sense of the least-squares estimator) the equation:

$\vec{Y}=\theta^0 + \theta^1 \vec{X} + \theta^2 \vec{X}^3+\vec{\varepsilon}$

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables $1, X$ and $X^3$. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed ($X^3$). The realization of this matrix (i.e. for the data at hand) can be written:

 $1$ $x$ $x^3$ 1 58 195112 1 59 205379 1 60 216000 1 61 226981 1 62 238328 1 63 250047 1 64 262144 1 65 274625 1 66 287496 1 67 300763 1 68 314432 1 69 328509 1 70 343000 1 71 357911 1 72 373248

The matrix $(\mathbf{X}^t \mathbf{X})^{-1}$ (sometimes called "information matrix" or "dispersion matrix") is:

$\left[\begin{matrix} 1.9\cdot10^3&-45&3.5\cdot 10^{-3}\\ -45&1.0&-8.1\cdot 10^{-5}\\ 3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9} \end{matrix}\right]$

Vector $\widehat{\theta}_{LS}$ is therefore:

$\widehat{\theta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3\cdot 10^{-4})$

hence $\eta(X) = 147 - 2.0 X + 4.3\cdot 10^{-4} X^3$

A plot of this function shows that it lies quite closely to the data set:

Image:Plot regression women.jpg

The confidence intervals are computed using:

$[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]$

with:

$\widehat{\sigma}=0.52$
$s_1=1.\cdot 10^3, s_2=1.0, s_3=6.4\cdot 10^{-9}\;$
$\alpha=5\%$
$t_{n-p;1-\frac{\alpha}{2}}=2.2$

Therefore, we can say that the 95% confidence intervals are:

$\theta^0\in[112 , 181]$
$\theta^1\in[-2.8 , -1.2]$
$\theta^2\in[3.6\cdot 10^{-4} , 4.9\cdot 10^{-4}]$

References

• Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
• Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
• Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
• Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
• Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
• Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.