This note isllm-generated Accesible at: https://chatgpt.com/share/67a35cf5-2f58-800d-9816-3b27a8c4e95f

Role of the Gram Matrix in Linear Regression

The Gram matrix plays a crucial role in linear regression, particularly in the normal equations and in understanding collinearity among features. Let’s break it down step by step, including derivations.


1. Linear Regression and Normal Equations

In linear regression, we model the relationship between a dependent variable yy and an independent variable (or vector of features) XX using the equation:

y=Xβ+ξy = X\beta + ξ

where:

  • XX is the n×pn \times p design matrix (with nn samples and pp features),
  • β is the p×1p \times 1 vector of regression coefficients,
  • Îľ is the n×1n \times 1 vector of errors.

The Ordinary Least Squares (OLS) method estimates β by minimizing the sum of squared residuals:

min⁡β∥y−Xβ∥2\min_{\beta} | y - X\beta |

Taking the derivative with respect to β and setting it to zero:

XT(y−Xβ)=0X^T (y - X\beta) = 0 XTy=XTXβX^T y = X^T X \beta

Solving for β:

β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y

Here, the matrix XTXX^T X is the Gram matrix.


2. The Gram Matrix

The Gram matrix is defined as:

G=XTXG = X^T X

  • It is a p×pp \times p symmetric matrix.
  • The element Gij=XiTXjG_{ij} = X_i^T X_j represents the dot product between feature column vectors XiX_i and XjX_j, measuring their correlation.
  • If the columns of XX are orthonormal, then GG is the identity matrix.

3. Why the Gram Matrix Matters in Linear Regression

a) Existence of the Solution

  • For β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y to exist, XTXX^T X must be invertible.
  • If XTXX^T X is singular (non-invertible), it means that the columns of XX are linearly dependent, and the regression problem is ill-posed.

b) Multicollinearity

  • If two or more columns of XX are highly correlated, XTXX^T X becomes ill-conditioned (its determinant is close to zero).
  • This leads to **unstable estimates of β\b.
  • The Gram matrix helps detect multicollinearity by checking its condition number.

c) Computational Efficiency

  • If pp (number of features) is small compared to nn (number of samples), solving β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y directly is computationally cheaper than inverting large matrices.

4. Example

Consider a simple dataset with two features:

X=[123456]X = \begin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \end{bmatrix}

The Gram matrix is:

XTX=[135246][123456]=[1+9+252+12+302+12+304+16+36]=[35444456]X^T X = \begin{bmatrix} 1 & 3 & 5 \ 2 & 4 & 6 \end{bmatrix} \begin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \end{bmatrix} = \begin{bmatrix} 1+9+25 & 2+12+30 \ 2+12+30 & 4+16+36 \end{bmatrix} = \begin{bmatrix} 35 & 44 \ 44 & 56 \end{bmatrix}

This matrix is used to compute β\beta, and if it is singular (i.e., determinant is zero), multicollinearity is present.


5. Alternative Solutions

If XTXX^T X is singular, alternative techniques include:

  • Ridge Regression (Tikhonov Regularization): Adds a regularization term ÎťI\lambda I to stabilize inversion:

    β=(XTX+λI)−1XTy\beta = (X^T X + \lambda I)^{-1} X^T y

  • Principal Component Regression (PCR): Uses eigenvectors of XTXX^T X to reduce dimensionality.


Conclusion

The Gram matrix XTXX^T X is central to linear regression because:

  1. It determines whether the least squares solution exists.
  2. It captures correlations between features.
  3. It affects numerical stability and multicollinearity detection.
  4. It plays a role in regularized regression techniques.

Deriving the Gradient with respect to the weight vector

To show why XT(y−Xβ)X^T (y - X\beta) is the derivative of the Ordinary Least Squares (OLS) loss function, we will go step by step.

Step 1: Define the Loss Function

In OLS regression, we minimize the sum of squared errors:

L(β)=∥y−Xβ∥2=(y−Xβ)T(y−Xβ)L(\beta) = | y - X\beta |^2 = (y - X\beta)^T (y - X\beta)

Expanding the quadratic term:

L(β)=yTy−2yTXβ+βTXTXβL(\beta) = y^T y - 2 y^T X\beta + \beta^T X^T X \beta

Step 2: Compute the Gradient

We differentiate L(β)L(\beta) with respect to β\beta. Using matrix calculus rules:

First Term: yTyy^T y

  • This is independent of β\beta, so its derivative is zero.

Second Term: −2yTXβ-2 y^T X\beta

Using the derivative rule:

ddβ(cTβ)=c\frac{d}{d\beta} (c^T \beta) = c

where c=−2XTyc = -2X^T y, we get:

ddβ(−2yTXβ)=−2XTy\frac{d}{d\beta} (-2 y^T X\beta) = -2 X^T y

Third Term: βTXTXβ\beta^T X^T X \beta

Using the derivative rule:

ddβ(βTAβ)=2Aβ\frac{d}{d\beta} (\beta^T A \beta) = 2 A \beta

where A=XTXA = X^T X, we get:

ddβ(βTXTXβ)=2XTXβ\frac{d}{d\beta} (\beta^T X^T X \beta) = 2 X^T X \beta

Step 3: Combine the Terms

Summing all terms:

dLdβ=−2XTy+2XTXβ\frac{dL}{d\beta} = -2 X^T y + 2 X^T X \beta

Factoring out 2:

dLdβ=2(XTXβ−XTy)\frac{dL}{d\beta} = 2 (X^T X \beta - X^T y)

Setting the derivative to zero for minimization:

XTXβ=XTyX^T X \beta = X^T y

Rearrange to get the normal equation:

β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y

Step 4: Relating to XT(y−Xβ)X^T(y - X\beta)

Looking at the gradient expression:

XTXβ−XTy=XT(Xβ−y)X^T X \beta - X^T y = X^T (X\beta - y)

Multiplying by −1-1:

−XT(y−Xβ)- X^T (y - X\beta)

Since we originally took the gradient of a sum of squares (which has a factor of 2), the first-order condition (where the gradient is zero) corresponds to:

XT(y−Xβ)=0X^T (y - X\beta) = 0

This shows that the gradient of the squared loss function is precisely XT(y−Xβ)X^T(y - X\beta), which is why we set it to zero to solve for β\beta.


Conclusion

The expression XT(y−Xβ)X^T (y - X\beta) comes from differentiating the squared loss function. Setting it to zero gives the normal equations, which yield the OLS estimator.