Data Science Texts

Discover what you don't know, and attack your weaknesses!

NB: We may earn a commission if you buy something via an affiliate link.

Linear Regression

Strongly Recommended Prerequisites

Recommended Prerequisites

Last Updated: 8/29/2021

Executive Summary

Regression is a technique that allows one to determine the value of one or more quantities based on the values of other quantities. Linear regression is a type of regression that assumes this determination can be made based upon a simple, linear relationship. In its simple form, linear regression models the relationship between a nonrandom, one-dimensional \(X\) that is known, and a random, one-dimensional \(Y\) as $$Y = \beta_1X + \beta_0 + \epsilon.$$ where \(\beta_1\) and \(\beta_0\) are unknown constants and \(\epsilon\) is a random variable which may represent measurement error or some other source of randomness. Simple linear regression is easily generalized to to allow for multiple predictors or a multi-dimensional \(Y\).

Linear regression is very significant for didactic and practical reasons. Linear regression is important from a didactic perspective because pretty much any important concept in statistics or machine learning is a facet of linear regression analysis, so it is frequently used as a simple illustration of such concepts. Linear regression is also very widely used in practice because the underlying models are very interpretable, they don't require much data to use, and many real relationships are approximately linear. Since linear regression has such foundational importance and practical utility, it is a subject worthy of its own book (or books). Despite its apparent simplicity, linear regression has so many applications and associated pitfalls that it requires careful study.

Incomplete List of Canonical Problems

This is a sample of the problems that arise and are dealt with in the subject of linear regression.
  1. Fitting Coefficients

    Clearly one must have a way to determine \(\beta_0\) and \(\beta_1\) and their generalizations if one is to make use of linear regression. Finding the best parameters can be done in a variety of ways that balance computational complexity with underlying model assumptions and desired properties of the model.
  2. Inference on Coefficients

    The "best" parameters for a given set of data are usually somewhat random. It is often of interest to determine how certain one is that those parameters represent some true, underlying relationship. This is (roughly speaking) known as model inference.
  3. Regression Diagnostics

    There are many assumptions that are made when using linear regression and one will usually wish to verify the validity of those assumptions. Diagnostics that can be used for this purpose are a major topic of applied linear regression analysis.
  4. Coercion

    Even if one's data does not meet the assumptions required by linear regression, there are many techniques for making it do so. Transformations can be applied to make relationships linear, robust models can be used for pathological randomness, and regularization techniques can be used in cases where fitting coefficients is difficult etc. There are many more ways in which data can be coerced.

The Effect of an Outlier on a Regression Fit

The red line is fit to the points including the red outlier, the black line is fit to just the black points, and the gray line is the true regression line. Clearly the outlying point greatly influences the regression fit. Dealing with outliers is just one of the many important topics in applied linear regression.

Recommended Books

  1. Introduction to Linear Regression Analysis

    Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining

    Book image of Introduction to Linear Regression Analysis.
    Check it out on Amazon!

    Key Features

    • In-text exercises
    • Solution manual available
    • R and SAS code examples

    Key Topics

    • Box-Cox Transformation
    • Confidence Intervals
    • Diagnostics
    • Generalized Linear Models
    • Hypothesis Testing
    • Least-Squares Estimation
    • Leverage and Influence
    • Logistic Regression
    • Maximum-Likelihood Estimation
    • Model Adequacy Checking
    • Model Validation
    • Multicollinearity
    • Multiple Linear Regression
    • Nonlinear Regression
    • Nonparametric Regression
    • Outliers
    • PRESS Statistic
    • Poisson Regression
    • Polynomial Regression
    • Prediction
    • Random Regressors
    • Residual Analysis
    • Robust Regression
    • Simple Linear Regression
    • Time Series
    • Transformations
    • Variable Selection
    • Variance-Stabilizing Transformations
    • Weighted Least-Squares


    This book gives a fairly standard introduction to simple and multiple linear regression, and then it devotes most of the text to dealing with their practical problems. Detecting and dealing with multicolinearity and outliers as well as many diagnostics and other practical topics occupy the majority of the book. Generalized linear models are introduced, but they really need their own treatment (we recommend some here ).