Preface
Goal: Goes further from linear regression to polynomial regression. The practical implementation, shows how to execute the theory with spreadsheet.
From theoretical foundations we are going shift to practical implementation in daily basis, using spreadsheet formula, python tools, and visualizations.
While the theory explains why the math works, the practice shows how to execute it with real tools. With the flow, from theory to tabular spreadsheets, then from we are going to build visualization using python.
We are going to use both linest
formula and manual calculation.
We are going to start with simple cell address in formula,
avoiding complex spreadsheet feature.
Then we are going to continue by using named range,
simplifying the formula understanding,
by using name instead of plain cell address.
1: Overview
The easiest way to get standard error,
from polynomial regression is by using linest
formula.
That’s all it takes. And you are done.
But if you want a lower level implementation using the previous theory,
to crosscheck you works, we have an unfortunate situation.
Unlike linear regression, unfortunately we don’t have built-in function , to get the covariance, variance, and so on. We are on our own.
But don’t worry, it is easy to implement, since we have covered most of the material in previous articles.
Worksheet Source
Playsheet Artefact
The Excel file is available, so you can have fun, modify as you want.
Basic Stuff
The spreadsheet in this article is based on previous article. There are a lot of formula introduced in this previous article.
After a few spreadsheet from previous articles, I don’t think that I need to guide the reader stp-by-step anymore. So all I need to do is giving brief overview of what this sheet does.
Complete Worksheet
Although this looks complex. It is easy when you have a working example. This sheet looks long because I also add regression analysis and correlation.
The hardest part is making complex things easy for beginner. And I think I’ve done this part, so people can just use it easily. All you need to do is to change the observed data series, and you will get the result immediately. I hope that this is useful.
We will break down each section in this article. This article will use manual calculation.
2: Prediction using Linest
To get polynomial regression, juts use this famous linest
formula.
No need to reinvent the wheel.
Getting Coefficients
Supposed that your observed data pair (xáµ¢, yáµ¢) tabulated in range B13:C25
.
Then you can have the estimated coefficient by using this linest formula:
Linear: Range E8:E9
(=TRANSPOSE(LINEST(C13:C25,B13:B25)))
Quadratic: Range F7:F9
{=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2}))}
Cubic: Range: G6:G9
{=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2,3}))}
Where
- Linear Fit = first order
- Quadratic Fit = second order
- Cubic Fit = third order
Getting Predictions
Now we can make tabular data of Ŷᵢ = fit1(xᵢ)
as:
Linear: E13:E25
=$E$9+$E$8*B13
Quadratic: F13:F25
=$F$9+$F$8*B13+$F$7*B13^2From theoretical foundations we are going shift
to practical implementation in daily basis,
using spreadsheet formula, python tools, and visualizations.
While the theory explains why the math works,
the practice shows how to execute it with real tools.
With the flow, from theory to tabular spreadsheets,
then from we are going to build visualization using python.
Cubic: G13:G25
=$G$9+$G$8*B13+$G$7*B13^2+$G$6*B13^3
Done, without any complexity.
3: Prediction using Gram Matrix
What good is it learning without knowing how it works? Let’s go dive into the matrix.
As opposed to linest
formula,
we can reverse engineer the calculation.
Doing the calculation manually,
then understand how the math works internally.
We can compare the result to linest result directly.
Gram Matrix (Aáµ—.A)
Linear: B42:C43
{=MMULT(
TRANSPOSE(CHOOSE({1,2}, 1, B13:B25)),
CHOOSE({1,2}, 1, B13:B25))}}
Quadratic: E42:G44
(=MMULT(
TRANSPOSE(CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2)),
CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2))))
Cubic: I42:L45
=MMULT(
TRANSPOSE(CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3)),
CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3))
Inverse Gram Matrix (Aᵗ.A)ˉ¹
Linear: B48:C49
{=MINVERSE(B42:C43)}
Quadratic: E48:G50
{=MINVERSE(E42:G44)}
Cubic: I48:L51
{=MINVERSE(I42:L45)}
Right Hand Side (Aáµ—.B)
Linear: B54:B55
{=MMULT(
TRANSPOSE(CHOOSE({1,2}, 1, B13:B25)),
C13:C25)}}
Quadratic: E54:E56
{=MMULT(
TRANSPOSE(CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2)),
C13:C25)}}
Cubic: I54:I57
{=MMULT(
TRANSPOSE(CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3)),
C13:C25)}
Estimated Coefficients (β = [a, …])
Linear: B60:B61
{=MMULT(B48:C49,B54:B55)}
Quadratic: E60:E62
{=MMULT(E48:G50,E54:E56)}
Cubic: I60:I63
{=MMULT(I48:L51,I54:I57)}
If you think reading cell address is painful. I completely agree with you. This is why we are going to use named range in the next section.
I keep the address cell, to keep basic spreadsheet, withou t named range featare that lead to incompatibility between different software.
4: Degrees of Freedom
The calculation of degree of freedom is obvious. I don’t think that It require detail descriptions.
Linear: E33
=COUNT($B$13:$B$25)-2
Quadratic: F33
=COUNT($B$13:$B$25)-3
Cubic: G33
=COUNT($B$13:$B$25)-4
5: Linear Properties
Correlation Calculation
We do not need all the basic correlaton properties here.
The variance, covariance and standard deviation here only applied to linear regression.
But however we still need this one thing: SST = ∑(yᵢ-ȳ)²
.
∑(yᵢ-ȳ)²: M28
=SUM(M13:M25)
We don’t need anything else. Feel free to delete those cells, and then see what’s happened.
6: Standard Error using Linest
Linest
is not just getting estimated polynomial coefficients.
Linest
can also geeting other properties as well,
such as standard error of the coefficients using the second row.
Getting Coefficients
I remind that we have observed data pair (xáµ¢, yáµ¢) tabulated in range B13:C25
.
Then we can have the estimated coefficient by using this linest formula:
Linear: Range Q8:R9
=TRANSPOSE(LINEST(C13:C25,B13:B25,1,1))
Quadratic: Range T7:U9
=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2},1,1))
Cubic: Range: W6:X9
=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2,3},1,1))
With transpose, the standard error of the coefficients, is available in the second column.
7: Residual Properties
As opposed to linest
formula that calculate standard error internaly,
we can also reverse engineer the standard error calculation.
By doing the calculation manually,
we can understand how the math internally works.
Tabular Residual and SSR
Let’s begin with tabulating the ϵᵢ, ϵᵢ² and also ∑ϵᵢ².
Remember the fit range?
Y Observed: C13:C25
Linear: E13:E25
Quadratic: F13:F25
Cubic: G13:G25
Now we can tabulate the residual using the following formulas:
Linear:
* ϵᵢ =$C13-E13
* ϵᵢ² =Q13^2
* ∑ϵᵢ² =SUM(R13:R25) at R28
Quadratic:
* ϵᵢ =$C13-F13
* ϵᵢ² =T13^2
* ∑ϵᵢ² =SUM(U13:U25) at U28
Cubic:
* ϵᵢ =$C13-G13
* ϵᵢ² =W13^2
* ∑ϵᵢ² =SUM(X13:X25) at X28
Coefficient of Determination
Then we can continue with R² and R²adjusted:
∑(yᵢ-ȳ)² at M28
df linear at E32
df quadratic at F32
df cubic at G32
Then we can substitute
Linear:
* R² =1-($R$28/$M$28) at Q32
* R²adjusted =1-((1-Q32)*($B$32-1)/$E32)
Quadratic:
* R² =1-($U$28/$M$28) at T32
* R²adjusted =1-((1-T32)*($B$32-1)/$E32)
Cubic:
* R² =1-($X$28/$M$28) at W32
* R²adjusted =1-((1-W32)*($B$32-1)/$E32)|----------------|---------------------------|
Standard Error of the Estimate
With the same reference we can substitute MSE and SSE:
Linear:
* MSE =R28/$E$32 at Q36
* SEE =SQRT(Q36)
Quadratic:
* MSE =U28/$F$32 at T36
* SEE =SQRT(T36)
Cubic:
* MSE =X28/$G$32 at W36
* SEE =SQRT(W36)
We are done.
8: Standard Error using Diagonal Matrix
Let’s have look again at the matrix solution. This time I give highlight to diagnal matrix.
The diagonal of inverse of gram matrix is actualy, the variance of the estimated coefficients.
The easiest thing to do is by pointing the cell,
for example Var(β₀) for linear fit is =B47
.
But if you want to be looks like a real intellectual doing practical solution,
you can make yourself suffer using index formula.
For example this tabulation of variance before scaled shown below. This is using alphabet notation (instead of beta):
Linear:
* a =INDEX(B47:C48,1,1)
* b =INDEX(B47:C48,2,2)
Quadratic:
* a =INDEX(E47:G49,1,1)
* b =INDEX(E47:G49,2,2)
* c =INDEX(E47:G49,3,3)
Cubic:
* a =INDEX(L47:O50,1,1)
* b =INDEX(L47:O50,2,2)
* c =INDEX(L47:O50,3,3)
* d =INDEX(L47:O50,4,4)
Now we can calculate the standard error by square root all of them.
Linear:
* SE(a) =SQRT($Q$36*R47)
* SE(b) =SQRT($Q$36*R48)
Quadratic:
* SE(a) =SQRT($T$36*U47)
* SE(b) =SQRT($T$36*U48)
* SE(c) =SQRT($T$36*U49)`
Cubic:
* SE(a) =SQRT($W$36*X47)
* SE(b) =SQRT($W$36*X48)
* SE(c) =SQRT($W$36*X49)
* SE(d) =SQRT($W$36*X50)
That’s it. We’re al most finished.
9: t-value and p-value
The rest is just continue the sheet by substituting the result of each SE(β).
For example t-value
in this cells:
Linear:
* t-value(a) =B59/R53
* t-value(b) =B60/R54
Quadratic:
* t-value(a) =B60/R54
* t-value(b) =E60/U54
* t-value(c) =E61/U55
Cubic:
* t-value(a) =L59/X53
* t-value(b) =L60/X54
* t-value(c) =L61/X55
* t-value(d) =L62/X56
And finally p-value
for two tailed using T.DIST.2T
formula.
Since we are using two tail, we are need abs
formula to avoid negative value.
Linear:
* p-value(a) =T.DIST.2T(ABS(R59),$E$32)
* p-value(b) =T.DIST.2T(ABS(R60),$E$32)
Quadratic:
* p-value(a) =T.DIST.2T(ABS(U59),$F$32)
* p-value(b) =T.DIST.2T(ABS(U60),$F$32)`
* p-value(c) =T.DIST.2T(ABS(U61),$F$32)
Cubic:
* p-value(a) =T.DIST.2T(ABS(X59),$G$32)`
* p-value(b) =T.DIST.2T(ABS(X60),$G$32)
* p-value(c) =T.DIST.2T(ABS(X61),$G$32)
* p-value(d) =T.DIST.2T(ABS(X62),$G$32)
Just note that libreoffice and excel might calculate inifinte number differently. So the final result in Excel, for examaple when overfitting comes. In Excel would be divided by zero, while libreoffice have p-value of zero.
Is that all? Hell no. We need a more practical sheet. something shorter, and useful for daily basis.
10: Practical Sheet
In real world daily basis, we should avoid the complexity.
Data Entry Consideration
One of my concern is to make the data entry easy. While our statistical properties size in cell is fixed, the observed data is definitely vary, from just a dozen of data to hundreds. So we need to separate the data rows and the statistic. Since the observed data might grow indefinietly, we should put the entry data at the bottom. This is why I have layout as below:
Worksheet Source
Playsheet Artefact
I’m using range name as reference. Due to incompatibility between Microsoft Excel and LibreOffice Calc in named range, I’d better use ods format instead of xlsx.
This opendocument format is available here, so you can have fun, modify as you want. |—————-|—————————|
You can open this document in LibreOffice.
Named Range
This is how the named range looks like in LibreOffice Calc. You can do the same thing with Microsoft Excel. But there are incompatibilies. Since I use LibreOffice for daily basis, This is what I’ve got.
The steps are:
-
First I defined the observed range. From here I get the coefficient value (and standard error).
-
With coefficient value, I can calculate predicted values.
-
With predicted range I can calculate statistial properties.
Let say our sheet name is `regression,’ the name address range can described as below:
Named Range | Range Address |
---|---|
x_observed | $regression.$B$42:$B$54 |
y_observed | $regression.$C$42:$C$54 |
coeff_1 | $regression.$I$8:$I$9 |
coeff_2 | $regression.$L$7:$L$9 |
coeff_3 | $regression.$O$6:$O$9 |
y1_predicted | $regression.$E$42:$E$54 |
y2_predicted | $regression.$F$42:$F$54 |
y3_predicted | $regression.$G$42:$G$54 |
Observed Value
Let’s start with observed value. The value might vary from samples to other samples, but let’s just use this example below:
Let’s set the named range with these cell address.
x_observed at B42:B54
y_observed at C42:C54
Basic Properties
From this observed value we can calculate basic statistical properties.
n =COUNT(x_observed) at B30
x̄ (mean) =AVERAGE(x_observed) at B34
ȳ (mean) =AVERAGE(y_observed) at C34
We just ȳ ti calculate ∑(yᵢ-ȳ)². And we can completely dispose x̄ from the sheet.
Linest Result
Coefficient and Standard Error
Since we have already know how it works,
no need to manually calculate coefficient and standard error anymore.
We can do directly get the result from linest
formula
| coeff_1 | $regression.$I$8:$I$9 | | coeff_2 | $regression.$L$7:$L$9 | | coeff_3 | $regression.$O$6:$O$9 |
Let’s choose that range, calculate, then set the named range with these cell address.
coeff_1 at I8:I9
=TRANSPOSE(LINEST(y_observed,x_observed,1,1))
coeff_2 at L7:L9
=TRANSPOSE(LINEST(y_observed,x_observed^{1,2},1,1))
coeff_3 at O8:O9
=TRANSPOSE(LINEST(y_observed,x_observed^{1,2,3},1,1))
It is more clear this way than using cell address right?
Predicted Values
From this coeffcient, we can calculate the predicted values. But this time I’m going to use different formula, other than previously mentioned.
y1_predicted at E42:E54
=SUMPRODUCT(TRANSPOSE(coeff_1),B43^{1,0})
y2_predicted at F42:F54
=SUMPRODUCT(TRANSPOSE(coeff_2),B43^{2,1,0})
y3_predicted at G42:G54
=SUMPRODUCT(TRANSPOSE(coeff_3),B43^{3,2,1,0})
Looks weird? It is even weirder when we understand that each coeff named range has two columns. Well.. Let’s get used to it.
t-value
With only linest
result we can get the coeff also (second row/column).
From here we can get the t-value.
t-value β₀ at K14
=J8/K8
We can just copy this cell to the rest of the t-value for each β.
Degrees of Freedom
In order to get the p-value, we need to calculate the degree of freedom.
df1 at J25
=COUNT(x_observed)-2
df2 at M25
=COUNT(x_observed)-3
df3 at P25
=COUNT(x_observed)-4
Let’s compared with previous form.
Linear df: E33
=COUNT($B$13:$B$25)-2
Quadratic df: F33
=COUNT($B$13:$B$25)-3
Cubic df: G33
=COUNT($B$13:$B$25)-4
Now the formula would make sense right?
p-value
The same applied for p-value.
t-value β₀ at K14
=J8/K8
p-value β₀
=T.DIST.2T(ABS(K14),$J$25)
We can just copy this cell to the rest of the p-value for each β.
Statistic Properties
We have different statistic properties, Let’s cover-up the formula one by one.
SSR = ∑ϵᵢ²
Linear ∑ϵᵢ²: at K25
=SUMSQ(y_observed-y1_predicted)
Quadratic ∑ϵᵢ²: at M25
=SUMSQ(y_observed-y2_predicted)
Cubic ∑ϵᵢ²: at M25
=SUMSQ(y_observed-y3_predicted)
SST = ∑(yᵢ-ȳ)²
General:
ȳ (mean) at C34
=AVERAGE(y_observed)
SST = ∑(yᵢ-ȳ)² at H30
=SUMSQ(y_observed-$C$34)
R²
No improvement here. We go back to use address directly.
Linear R²: at J30
=1-($K$25/$H$30)
Quadratic R²: at M30
=1-($N$25/$H$30)
Cubic R²: at Q30
=1-($Q$25/$H$30)
MSE and RMSE
Linear:
MSE =K25/$J$25 at J34
RMSE =SQRT(J34)
Quadratic:
MSE =N25/$M$25 at M34
RMSE =SQRT(M34)
Cubic:
MSE =Q25/$P$25 at Q34
RMSE =SQRT(P34)
And that’s all. We have already implement all the equation into spreadsheet.
What Lies Ahead 🤔?
From theoretical foundations that explains why the math works, we are going shift to practical implementation in daily basis with real tools. After spreadsheet, we still have to cover, python tools, and visualizations.
While a ready to use spreadsheet template is, enough for admin to infer the statistical properties, without even doing analysis at all. We need to a ready to use code, that we can creatively modified, for any different cases.
Consider diving into the next step by exploring [ Trend - Polynomial Regression - Python ].