Preface
Goal: Goes further from linear regression to polynomial regression. The theoretical foundations, explains why the math works.
The statistical properties for polynomial regression is different especially the utilization of gram matrix. From just basic math, we need to know linear algebra to do matrix operation.
Warning: This article contain basic matrix operations, such as transpose and inverse. Please be prepared.
Cheatsheet
I combined different things into a single coherent diagram.
- Regression model
- Error and residual analysis
- Hypothesis testing
- R² and R²(adjusted)
You can see how the polynomial regression differs from linear regressions.
You can obtain both the SVG source here:
Note that I only cover samples, not the population. Population has a only slight difference. We will get into this later, still in this article.
1: Equation for Polynomial
Polynomial Regression (not the least square method)
Instead of focusing on Covariance, the equation is started with gram matrix.
Big Picture
Most of the equation in the left side part has been covered in polynomial calculation article.
I don’t think that I need to refresh the equation over and over again. But let’s dexribe the equation notation in other forms.
Observed Data
Let’s say we have some pairs of points from data observations.
We are going to find the curve fittings, and get the prediction results:
Let’s find the solution for this using polynomial coefficients.
Fit Model
Polynomial Prediction Model
We have the generic theoritical model with this explicit form:
Most of the time I only consider these three curve fitting model for predictions:
- Linear Fit (first order)
- Quadratic Fit (second order)
- Cubic Fit (third order)
Where order is the degree of the polynomial. And β (beta) denotes the coefficients for specific polynomial degrees.
Notes: higher-order terms help model curvature, but also risk overfitting. This can be problematic in real world situations.
Matrix Form
From here we have the famous vandermonde matrix, in general it is looks like this.
Where
- n: Number of observations (rows in X)
For the three curve fittings, the actual matrix can be described as following:
Solution
This is the solution,
where X
is the predictor variables,
and y
is dependent variables.
For each curve fittings, the coefficients solutions can be described as following:
Explicit Gram Matrix
Now we also the generalized form the gram matrix (Xáµ—.X).
Where
- m: Index used in sums for (Xáµ—.X) entries
The matrix is always symmetric.
We are going use computation. But for clarity, we can show the actual gram matrix for each curve fittings can be described as following:
Inverse Gram Matrix
To calculate each Coefficients, we need the inverse Gram Matrix (Xᵗ.X)ˉ¹ from the x observed values. There is no simple closed-form expression for inverse gram matrix (Xᵗ.X)ˉ¹. We will use Excel computation instead.
Fit Prediction
The Estimated Coefficients (Calculated from Data), can be described as below:
Predicted Values
By Model Type
Using the estimated coefficients β, we substitute into the explicit polynomial forms. For each observed value xᵢ, we compute predictions yᵢ for each model:
Direct Comparison
Predicted vs Observed Values
The actual calculations use the same observed xáµ¢values, enabling direct comparison with observed yáµ¢:
The model comparison reveals how polynomial complexity affects fit quality.
The interesting parts comes in the next section below.
2: Equation for Regression
Most of the equation in the top side part alos has been covered in polynomial calculation article.
No Variance?
Where’s Waldo?
So where is the famous Standard Deviation? Pearson, and covariance? Actually these properties are only relevant for linear regression. For polynomial regression we will utilize other properties. No slope, no intercept. Dang! Unlearn process happened.
We are still using SST (total variance) calculate SSR, with mean inside the equation.
We still have covariance, but not in the simple form as used in previous linear regression.
Degrees of Freedom
In polynomial regression, the degrees of freedom (df) for the error term (residuals) is calculated as:
Where the equation itself can be interpreted as:
- n = Total number of observations (data points).
- p = Number of predictors (independent variables) in the model.
- -1 = Accounts for the intercept term (β0​).
Now we can rewrite for each model as:
The use of the predictors is pretty clear.
Residual
The sum of squared errors (SSE) is calculated the same way for all models, but the residuals differ based on the model’s complexity:
Where the predicted values of y vary by model:
Don’t worry, we are not going to derive these equation. We’ll compute these directly in Excel using tabular calculations.
Coefficient of Determination
The R Square (R²)
In general, the R Square (R²) is still defined as follows.
Using tabular in Excel (or python), we can just put the residual calculation above into above equation.
From here we can calculate the adjusted R Square with following equation:
MSE
MSE stand for Mean Standard Errors. The mean word refers to the average of squared errors, adjusted for degrees of freedom.
And for further calculation we can have this helper.
Gram Matrix
To calculate Standard Error of each Coefficients, we need Gram Matrix (Xáµ—.X) from the x observed values.
From the matrix form:
Covariance Matrix of β
β = coefficient estimates
The theoretical covariance matrix is:
where:
- σ² is the true error variance (unknown in practice)
- (Xᵗ.X)ˉ¹ is the inverse Gram matrix
In practice, we estimate σ² using the Mean Squared Error (MSE):
To find inverse gram matrix (Xᵗ.X)ˉ¹, we have to rely on computation instead of finding generalization equation.
Variance of β
β = coefficient estimates
The true variance of the coefficient estimates is
Where
- σ² is the unknown error variance
In practice, we estimate σ² using the Mean Squared Error (MSE):
Diagonal Matrix
The diagonal of the inverse Gram matrix, the diags|(Xᵗ.X)ˉ¹|, represents the scaled variance of β (variance per unit of σ²):
Or in other notation can be written as
The same applied for diags|(Xᵗ.X)ˉ¹|ⱼ. There is no generalization either, so we have to rely on computation instead.
Standard Error of Coefficient
Standard deviation is the square root of the variance right? Then the standard error (SE) is indeed the square root of the variance of the coefficient estimates.
For any coefficient βⱼ​ in polynomial regression:
We can break down for each case. The standard Errors by Polynomial Degree are:
t-value
In Polynomial Regression
The t-value for each coefficient β​ tests the null hypothesis H₀: βj=0. It is calculated as:
Where:
- Numerator β​ = Estimated coefficient
- Denumerator SE(β) = Standard error of β​ (from previous section)
H₀: βj=0 asserts that the predictor xⱼ (e.g., x, x², etc.) has no effect on the response variable y. If H₀​ is true, the term xⱼ should be dropped from the model.
p-value
No need any gamma function complexity for practical calculation. For beginner like us, it is better to use excel built-in fomula instead.
Confidence Intervals
We can continue to other statistical properties beyond just coefficients, but I gues I need to stop before this article become to complex. Let’s keep this article to be statistics for beginner.
Now it is a good time to write down these equations to tabular excel.
What Lies Ahead 🤔?
Math is fun, right?
From theoretical foundations we are going shift to practical implementation in daily basis, using spreadsheet formula, python tools, and visualizations.
While the theory explains why the math works, the practice shows how to execute it with real tools. With the flow, from theory to tabular spreadsheets, then from we are going to build visualization using python.
Consider diving into the next step by exploring [ Trend - Polynomial Regression - Formula ].