🪴 Anil's Garden

❯

❯

Gram Matrix and Linear Regression

Gram Matrix and Linear Regression

10 Feb 20255 min read

llm-generated

This note isllm-generated Accesible at: https://chatgpt.com/share/67a35cf5-2f58-800d-9816-3b27a8c4e95f

Role of the Gram Matrix in Linear Regression

The Gram matrix plays a crucial role in linear regression, particularly in the normal equations and in understanding collinearity among features. Let’s break it down step by step, including derivations.

1. Linear Regression and Normal Equations

In linear regression, we model the relationship between a dependent variable yy and an independent variable (or vector of features) XX using the equation:

y=Xβ+εy = X\beta + ε

where:

XX is the n×pn \times p design matrix (with nn samples and pp features),
β is the p×1p \times 1 vector of regression coefficients,
ε is the n×1n \times 1 vector of errors.

The Ordinary Least Squares (OLS) method estimates β by minimizing the sum of squared residuals:

min⁡β∥y−Xβ∥2\min_{\beta} | y - X\beta |

Taking the derivative with respect to β and setting it to zero:

XT(y−Xβ)=0X^T (y - X\beta) = 0 XTy=XTXβX^T y = X^T X \beta

Solving for β:

β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y

Here, the matrix XTXX^T X is the Gram matrix.

2. The Gram Matrix

The Gram matrix is defined as:

G=XTXG = X^T X

It is a p×pp \times p symmetric matrix.
The element Gij=XiTXjG_{ij} = X_i^T X_j represents the dot product between feature column vectors XiX_i and XjX_j, measuring their correlation.
If the columns of XX are orthonormal, then GG is the identity matrix.

3. Why the Gram Matrix Matters in Linear Regression

a) Existence of the Solution

For β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y to exist, XTXX^T X must be invertible.
If XTXX^T X is singular (non-invertible), it means that the columns of XX are linearly dependent, and the regression problem is ill-posed.

b) Multicollinearity

If two or more columns of XX are highly correlated, XTXX^T X becomes ill-conditioned (its determinant is close to zero).
This leads to **unstable estimates of β\b.
The Gram matrix helps detect multicollinearity by checking its condition number.

c) Computational Efficiency

If pp (number of features) is small compared to nn (number of samples), solving β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y directly is computationally cheaper than inverting large matrices.

4. Example

Consider a simple dataset with two features:

X=[123456]X = \begin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \end{bmatrix}

The Gram matrix is:

XTX=[135246][123456]=[1+9+252+12+302+12+304+16+36]=[35444456]X^T X = \begin{bmatrix} 1 & 3 & 5 \ 2 & 4 & 6 \end{bmatrix} \begin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \end{bmatrix} = \begin{bmatrix} 1+9+25 & 2+12+30 \ 2+12+30 & 4+16+36 \end{bmatrix} = \begin{bmatrix} 35 & 44 \ 44 & 56 \end{bmatrix}

This matrix is used to compute β\beta, and if it is singular (i.e., determinant is zero), multicollinearity is present.

5. Alternative Solutions

If XTXX^T X is singular, alternative techniques include:

Ridge Regression (Tikhonov Regularization): Adds a regularization term λI\lambda I to stabilize inversion:

β=(XTX+λI)−1XTy\beta = (X^T X + \lambda I)^{-1} X^T y
Principal Component Regression (PCR): Uses eigenvectors of XTXX^T X to reduce dimensionality.

Conclusion

The Gram matrix XTXX^T X is central to linear regression because:

It determines whether the least squares solution exists.
It captures correlations between features.
It affects numerical stability and multicollinearity detection.
It plays a role in regularized regression techniques.

Deriving the Gradient with respect to the weight vector $β$

To show why XT(y−Xβ)X^T (y - X\beta) is the derivative of the Ordinary Least Squares (OLS) loss function, we will go step by step.

Step 1: Define the Loss Function

In OLS regression, we minimize the sum of squared errors:

L(β)=∥y−Xβ∥2=(y−Xβ)T(y−Xβ)L(\beta) = | y - X\beta |^2 = (y - X\beta)^T (y - X\beta)

Expanding the quadratic term:

L(β)=yTy−2yTXβ+βTXTXβL(\beta) = y^T y - 2 y^T X\beta + \beta^T X^T X \beta

Step 2: Compute the Gradient

We differentiate L(β)L(\beta) with respect to β\beta. Using matrix calculus rules:

First Term: yTyy^T y

This is independent of β\beta, so its derivative is zero.

Second Term: −2yTXβ-2 y^T X\beta

Using the derivative rule:

ddβ(cTβ)=c\frac{d}{d\beta} (c^T \beta) = c

where c=−2XTyc = -2X^T y, we get:

ddβ(−2yTXβ)=−2XTy\frac{d}{d\beta} (-2 y^T X\beta) = -2 X^T y

Third Term: βTXTXβ\beta^T X^T X \beta

Using the derivative rule:

ddβ(βTAβ)=2Aβ\frac{d}{d\beta} (\beta^T A \beta) = 2 A \beta

where A=XTXA = X^T X, we get:

ddβ(βTXTXβ)=2XTXβ\frac{d}{d\beta} (\beta^T X^T X \beta) = 2 X^T X \beta

Step 3: Combine the Terms

Summing all terms:

dLdβ=−2XTy+2XTXβ\frac{dL}{d\beta} = -2 X^T y + 2 X^T X \beta

Factoring out 2:

dLdβ=2(XTXβ−XTy)\frac{dL}{d\beta} = 2 (X^T X \beta - X^T y)

Setting the derivative to zero for minimization:

XTXβ=XTyX^T X \beta = X^T y

Rearrange to get the normal equation:

β=(XTX)−1XTy\beta = (X^T X)^{-1} X^T y

Step 4: Relating to XT(y−Xβ)X^T(y - X\beta)

Looking at the gradient expression:

XTXβ−XTy=XT(Xβ−y)X^T X \beta - X^T y = X^T (X\beta - y)

Multiplying by −1-1:

−XT(y−Xβ)- X^T (y - X\beta)

Since we originally took the gradient of a sum of squares (which has a factor of 2), the first-order condition (where the gradient is zero) corresponds to:

XT(y−Xβ)=0X^T (y - X\beta) = 0

This shows that the gradient of the squared loss function is precisely XT(y−Xβ)X^T(y - X\beta), which is why we set it to zero to solve for β\beta.

Conclusion

The expression XT(y−Xβ)X^T (y - X\beta) comes from differentiating the squared loss function. Setting it to zero gives the normal equations, which yield the OLS estimator.

Graph View

Role of the Gram Matrix in Linear Regression
1. Linear Regression and Normal Equations
2. The Gram Matrix
3. Why the Gram Matrix Matters in Linear Regression
4. Example
5. Alternative Solutions
Conclusion
Deriving the Gradient with respect to the weight vector \beta
Step 1: Define the Loss Function
Step 2: Compute the Gradient
Step 3: Combine the Terms
Step 4: Relating to XT(y−Xβ)X^T(y - X\beta)
Conclusion

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋