Where to Discuss?

Local Group

Preface

Goal: Goes further from linear regression to polynomial regression. The practical implementation, shows how to execute the theory with spreadsheet.

From theoretical foundations we are going shift to practical implementation in daily basis, using spreadsheet formula, python tools, and visualizations.

While the theory explains why the math works, the practice shows how to execute it with real tools. With the flow, from theory to tabular spreadsheets, then from we are going to build visualization using python.

We are going to use both linest formula and manual calculation. We are going to start with simple cell address in formula, avoiding complex spreadsheet feature. Then we are going to continue by using named range, simplifying the formula understanding, by using name instead of plain cell address.


1: Overview

The easiest way to get standard error, from polynomial regression is by using linest formula. That’s all it takes. And you are done. But if you want a lower level implementation using the previous theory, to crosscheck you works, we have an unfortunate situation.

Unlike linear regression, unfortunately we don’t have built-in function , to get the covariance, variance, and so on. We are on our own.

But don’t worry, it is easy to implement, since we have covered most of the material in previous articles.

Worksheet Source

Playsheet Artefact

The Excel file is available, so you can have fun, modify as you want.

Basic Stuff

The spreadsheet in this article is based on previous article. There are a lot of formula introduced in this previous article.

After a few spreadsheet from previous articles, I don’t think that I need to guide the reader stp-by-step anymore. So all I need to do is giving brief overview of what this sheet does.

Complete Worksheet

Although this looks complex. It is easy when you have a working example. This sheet looks long because I also add regression analysis and correlation.

Trend: Worksheet: Polynomial Regression: Complete Worksheet

The hardest part is making complex things easy for beginner. And I think I’ve done this part, so people can just use it easily. All you need to do is to change the observed data series, and you will get the result immediately. I hope that this is useful.

We will break down each section in this article. This article will use manual calculation.


2: Prediction using Linest

To get polynomial regression, juts use this famous linest formula. No need to reinvent the wheel.

Trend: Worksheet: Polynomial Regression: Linest Predictions

Getting Coefficients

Supposed that your observed data pair (xáµ¢, yáµ¢) tabulated in range B13:C25. Then you can have the estimated coefficient by using this linest formula:

Linear: Range E8:E9
(=TRANSPOSE(LINEST(C13:C25,B13:B25)))

Quadratic: Range F7:F9
{=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2}))}

Cubic: Range: G6:G9
{=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2,3}))}

Where

  1. Linear Fit = first order
  2. Quadratic Fit = second order
  3. Cubic Fit = third order

Getting Predictions

Now we can make tabular data of Ŷᵢ = fit1(xᵢ) as:

Linear: E13:E25
=$E$9+$E$8*B13

Quadratic: F13:F25
=$F$9+$F$8*B13+$F$7*B13^2From theoretical foundations we are going shift
to practical implementation in daily basis,
using spreadsheet formula, python tools, and visualizations.

While the theory explains why the math works,
the practice shows how to execute it with real tools.
With the flow, from theory to tabular spreadsheets,
then from we are going to build visualization using python.

Cubic: G13:G25
=$G$9+$G$8*B13+$G$7*B13^2+$G$6*B13^3

Done, without any complexity.


3: Prediction using Gram Matrix

What good is it learning without knowing how it works? Let’s go dive into the matrix.

As opposed to linest formula, we can reverse engineer the calculation. Doing the calculation manually, then understand how the math works internally.

Trend: Worksheet: Polynomial Regression: Estimated Coefficients

We can compare the result to linest result directly.

Gram Matrix (Aáµ—.A)

Linear: B42:C43
{=MMULT(
    TRANSPOSE(CHOOSE({1,2}, 1, B13:B25)),
    CHOOSE({1,2}, 1, B13:B25))}}

Quadratic: E42:G44
(=MMULT(
    TRANSPOSE(CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2)),
    CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2))))

Cubic: I42:L45
=MMULT(
    TRANSPOSE(CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3)),
    CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3))

Inverse Gram Matrix (Aᵗ.A)ˉ¹

Linear: B48:C49
{=MINVERSE(B42:C43)}

Quadratic: E48:G50
{=MINVERSE(E42:G44)}

Cubic: I48:L51
{=MINVERSE(I42:L45)}

Right Hand Side (Aáµ—.B)

Linear: B54:B55
{=MMULT(
    TRANSPOSE(CHOOSE({1,2}, 1, B13:B25)),
    C13:C25)}}

Quadratic: E54:E56
{=MMULT(
    TRANSPOSE(CHOOSE({1,2,3}, 1, B13:B25, B13:B25^2)),
    C13:C25)}}

Cubic: I54:I57
{=MMULT(
    TRANSPOSE(CHOOSE({1,2,3,4}, 1, B13:B25, B13:B25^2, B13:B25^3)),
    C13:C25)}

Estimated Coefficients (β = [a, …])

Linear: B60:B61
{=MMULT(B48:C49,B54:B55)}

Quadratic: E60:E62
{=MMULT(E48:G50,E54:E56)}

Cubic: I60:I63
{=MMULT(I48:L51,I54:I57)}

If you think reading cell address is painful. I completely agree with you. This is why we are going to use named range in the next section.

I keep the address cell, to keep basic spreadsheet, withou t named range featare that lead to incompatibility between different software.


4: Degrees of Freedom

The calculation of degree of freedom is obvious. I don’t think that It require detail descriptions.

Trend: Worksheet: Polynomial Regression: Degrees of Freedom

Linear: E33
=COUNT($B$13:$B$25)-2

Quadratic: F33
=COUNT($B$13:$B$25)-3

Cubic: G33
=COUNT($B$13:$B$25)-4

5: Linear Properties

Correlation Calculation

We do not need all the basic correlaton properties here.

Trend: Worksheet: Polynomial Regression: Estimated Coefficients

The variance, covariance and standard deviation here only applied to linear regression. But however we still need this one thing: SST = ∑(yᵢ-ȳ)².

∑(yᵢ-ȳ)²: M28
=SUM(M13:M25)

We don’t need anything else. Feel free to delete those cells, and then see what’s happened.


6: Standard Error using Linest

Linest is not just getting estimated polynomial coefficients. Linest can also geeting other properties as well, such as standard error of the coefficients using the second row.

Trend: Worksheet: Polynomial Regression: Linest Standard Error

Getting Coefficients

I remind that we have observed data pair (xáµ¢, yáµ¢) tabulated in range B13:C25. Then we can have the estimated coefficient by using this linest formula:

Linear: Range Q8:R9
=TRANSPOSE(LINEST(C13:C25,B13:B25,1,1))

Quadratic: Range T7:U9
=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2},1,1))

Cubic: Range: W6:X9
=TRANSPOSE(LINEST(C13:C25,B13:B25^{1,2,3},1,1))

With transpose, the standard error of the coefficients, is available in the second column.


7: Residual Properties

As opposed to linest formula that calculate standard error internaly, we can also reverse engineer the standard error calculation. By doing the calculation manually, we can understand how the math internally works.

Tabular Residual and SSR

Let’s begin with tabulating the ϵᵢ, ϵᵢ² and also ∑ϵᵢ².

Trend: Worksheet: Polynomial Regression: Sum of Residual

Remember the fit range?

Y Observed: C13:C25
Linear: E13:E25
Quadratic: F13:F25
Cubic: G13:G25

Now we can tabulate the residual using the following formulas:

Linear: 
* ϵᵢ   =$C13-E13
* ϵᵢ²  =Q13^2
* ∑ϵᵢ² =SUM(R13:R25) at R28

Quadratic: 
* ϵᵢ   =$C13-F13
* ϵᵢ²  =T13^2
* ∑ϵᵢ² =SUM(U13:U25) at U28

Cubic: 
* ϵᵢ   =$C13-G13
* ϵᵢ²  =W13^2
* ∑ϵᵢ² =SUM(X13:X25) at X28

Coefficient of Determination

Then we can continue with R² and R²adjusted:

Trend: Worksheet: Polynomial Regression: Mean Standard Error

∑(yᵢ-ȳ)² at M28
df linear at E32
df quadratic at F32
df cubic at G32

Then we can substitute

Linear: 
* R² =1-($R$28/$M$28) at Q32
* R²adjusted =1-((1-Q32)*($B$32-1)/$E32)

Quadratic: 
* R² =1-($U$28/$M$28) at T32
* R²adjusted =1-((1-T32)*($B$32-1)/$E32)

Cubic: 
* R² =1-($X$28/$M$28) at W32
* R²adjusted =1-((1-W32)*($B$32-1)/$E32)|----------------|---------------------------|

Standard Error of the Estimate

With the same reference we can substitute MSE and SSE:

Linear: 
* MSE =R28/$E$32 at Q36
* SEE =SQRT(Q36)

Quadratic: 
* MSE =U28/$F$32 at T36
* SEE =SQRT(T36)

Cubic: 
* MSE =X28/$G$32 at W36
* SEE =SQRT(W36)

We are done.


8: Standard Error using Diagonal Matrix

Let’s have look again at the matrix solution. This time I give highlight to diagnal matrix.

Trend: Worksheet: Polynomial Regression: Estimated Coefficients

The diagonal of inverse of gram matrix is actualy, the variance of the estimated coefficients.

The easiest thing to do is by pointing the cell, for example Var(β₀) for linear fit is =B47. But if you want to be looks like a real intellectual doing practical solution, you can make yourself suffer using index formula.

Trend: Worksheet: Polynomial Regression: Standard Error using Diagonal Matrix

For example this tabulation of variance before scaled shown below. This is using alphabet notation (instead of beta):

Linear: 
* a =INDEX(B47:C48,1,1)
* b =INDEX(B47:C48,2,2)

Quadratic: 
* a =INDEX(E47:G49,1,1)
* b =INDEX(E47:G49,2,2)
* c =INDEX(E47:G49,3,3)

Cubic: 
* a =INDEX(L47:O50,1,1)
* b =INDEX(L47:O50,2,2)
* c =INDEX(L47:O50,3,3)
* d =INDEX(L47:O50,4,4)

Now we can calculate the standard error by square root all of them.

Linear: 
* SE(a) =SQRT($Q$36*R47)
* SE(b) =SQRT($Q$36*R48)

Quadratic: 
* SE(a) =SQRT($T$36*U47)
* SE(b) =SQRT($T$36*U48)
* SE(c) =SQRT($T$36*U49)`

Cubic: 
* SE(a) =SQRT($W$36*X47)
* SE(b) =SQRT($W$36*X48)
* SE(c) =SQRT($W$36*X49)
* SE(d) =SQRT($W$36*X50)

That’s it. We’re al most finished.


9: t-value and p-value

The rest is just continue the sheet by substituting the result of each SE(β).

Trend: Worksheet: Polynomial Regression: t-value and p-value

For example t-value in this cells:

Linear: 
* t-value(a) =B59/R53
* t-value(b) =B60/R54

Quadratic: 
* t-value(a) =B60/R54
* t-value(b) =E60/U54
* t-value(c) =E61/U55

Cubic: 
* t-value(a) =L59/X53
* t-value(b) =L60/X54
* t-value(c) =L61/X55
* t-value(d) =L62/X56

And finally p-value for two tailed using T.DIST.2T formula. Since we are using two tail, we are need abs formula to avoid negative value.

Linear: 
* p-value(a) =T.DIST.2T(ABS(R59),$E$32)
* p-value(b) =T.DIST.2T(ABS(R60),$E$32)

Quadratic: 
* p-value(a) =T.DIST.2T(ABS(U59),$F$32)
* p-value(b) =T.DIST.2T(ABS(U60),$F$32)`
* p-value(c) =T.DIST.2T(ABS(U61),$F$32)

Cubic: 
* p-value(a) =T.DIST.2T(ABS(X59),$G$32)`
* p-value(b) =T.DIST.2T(ABS(X60),$G$32)
* p-value(c) =T.DIST.2T(ABS(X61),$G$32)
* p-value(d) =T.DIST.2T(ABS(X62),$G$32)

Just note that libreoffice and excel might calculate inifinte number differently. So the final result in Excel, for examaple when overfitting comes. In Excel would be divided by zero, while libreoffice have p-value of zero.

Is that all? Hell no. We need a more practical sheet. something shorter, and useful for daily basis.


10: Practical Sheet

In real world daily basis, we should avoid the complexity.

Data Entry Consideration

One of my concern is to make the data entry easy. While our statistical properties size in cell is fixed, the observed data is definitely vary, from just a dozen of data to hundreds. So we need to separate the data rows and the statistic. Since the observed data might grow indefinietly, we should put the entry data at the bottom. This is why I have layout as below:

Trend: Worksheet: Polynomial Regression: Complete Worksheet

Worksheet Source

Playsheet Artefact

I’m using range name as reference. Due to incompatibility between Microsoft Excel and LibreOffice Calc in named range, I’d better use ods format instead of xlsx.

This opendocument format is available here, so you can have fun, modify as you want. |—————-|—————————|

You can open this document in LibreOffice.

Named Range

This is how the named range looks like in LibreOffice Calc. You can do the same thing with Microsoft Excel. But there are incompatibilies. Since I use LibreOffice for daily basis, This is what I’ve got.

Trend: Worksheet: Polynomial Regression: Named Range

The steps are:

  1. First I defined the observed range. From here I get the coefficient value (and standard error).

  2. With coefficient value, I can calculate predicted values.

  3. With predicted range I can calculate statistial properties.

Let say our sheet name is `regression,’ the name address range can described as below:

Named Range Range Address
x_observed $regression.$B$42:$B$54
y_observed $regression.$C$42:$C$54
coeff_1 $regression.$I$8:$I$9
coeff_2 $regression.$L$7:$L$9
coeff_3 $regression.$O$6:$O$9
y1_predicted $regression.$E$42:$E$54
y2_predicted $regression.$F$42:$F$54
y3_predicted $regression.$G$42:$G$54

Observed Value

Let’s start with observed value. The value might vary from samples to other samples, but let’s just use this example below:

Trend: Worksheet: Polynomial Regression: Observed Value

Let’s set the named range with these cell address.

x_observed at B42:B54
y_observed at C42:C54

Basic Properties

From this observed value we can calculate basic statistical properties.

Trend: Worksheet: Polynomial Regression: Basic Properties

n =COUNT(x_observed) at B30
x̄ (mean) =AVERAGE(x_observed) at B34 
ȳ (mean) =AVERAGE(y_observed) at C34

We just ȳ ti calculate ∑(yᵢ-ȳ)². And we can completely dispose x̄ from the sheet.

Linest Result

Coefficient and Standard Error

Since we have already know how it works, no need to manually calculate coefficient and standard error anymore. We can do directly get the result from linest formula

| coeff_1 | $regression.$I$8:$I$9 | | coeff_2 | $regression.$L$7:$L$9 | | coeff_3 | $regression.$O$6:$O$9 |

Trend: Worksheet: Polynomial Regression: Linest Result

Let’s choose that range, calculate, then set the named range with these cell address.

coeff_1 at I8:I9
=TRANSPOSE(LINEST(y_observed,x_observed,1,1))

coeff_2 at L7:L9
=TRANSPOSE(LINEST(y_observed,x_observed^{1,2},1,1))

coeff_3 at O8:O9
=TRANSPOSE(LINEST(y_observed,x_observed^{1,2,3},1,1))

It is more clear this way than using cell address right?

Predicted Values

From this coeffcient, we can calculate the predicted values. But this time I’m going to use different formula, other than previously mentioned.

Trend: Worksheet: Polynomial Regression: Predicted Value

y1_predicted at E42:E54
=SUMPRODUCT(TRANSPOSE(coeff_1),B43^{1,0})

y2_predicted at F42:F54
=SUMPRODUCT(TRANSPOSE(coeff_2),B43^{2,1,0})

y3_predicted at G42:G54
=SUMPRODUCT(TRANSPOSE(coeff_3),B43^{3,2,1,0})

Looks weird? It is even weirder when we understand that each coeff named range has two columns. Well.. Let’s get used to it.

t-value

With only linest result we can get the coeff also (second row/column). From here we can get the t-value.

Trend: Worksheet: Polynomial Regression: t-value, p-value

t-value β₀ at K14
=J8/K8

We can just copy this cell to the rest of the t-value for each β.

Degrees of Freedom

In order to get the p-value, we need to calculate the degree of freedom.

Trend: Worksheet: Polynomial Regression: Statistic Properties

df1 at J25
=COUNT(x_observed)-2

df2 at M25
=COUNT(x_observed)-3

df3 at P25
=COUNT(x_observed)-4

Let’s compared with previous form.

Linear df: E33
=COUNT($B$13:$B$25)-2

Quadratic df: F33
=COUNT($B$13:$B$25)-3

Cubic df: G33
=COUNT($B$13:$B$25)-4

Now the formula would make sense right?

p-value

The same applied for p-value.

t-value β₀ at K14
=J8/K8

p-value β₀
=T.DIST.2T(ABS(K14),$J$25)

We can just copy this cell to the rest of the p-value for each β.

Statistic Properties

We have different statistic properties, Let’s cover-up the formula one by one.

Trend: Worksheet: Polynomial Regression: Statistic Properties

SSR = ∑ϵᵢ²

Linear ∑ϵᵢ²: at K25
=SUMSQ(y_observed-y1_predicted)

Quadratic ∑ϵᵢ²: at M25
=SUMSQ(y_observed-y2_predicted)

Cubic ∑ϵᵢ²: at M25
=SUMSQ(y_observed-y3_predicted)

SST = ∑(yᵢ-ȳ)²

General:
ȳ (mean) at C34
=AVERAGE(y_observed)

SST = ∑(yᵢ-ȳ)² at H30
=SUMSQ(y_observed-$C$34)

R²

No improvement here. We go back to use address directly.

Linear R²: at J30
=1-($K$25/$H$30)

Quadratic R²: at M30
=1-($N$25/$H$30)

Cubic R²: at Q30
=1-($Q$25/$H$30)

MSE and RMSE

Linear:
MSE  =K25/$J$25 at J34
RMSE =SQRT(J34)

Quadratic:
MSE  =N25/$M$25 at M34
RMSE =SQRT(M34)

Cubic:
MSE  =Q25/$P$25 at Q34
RMSE =SQRT(P34)

And that’s all. We have already implement all the equation into spreadsheet.


What Lies Ahead 🤔?

From theoretical foundations that explains why the math works, we are going shift to practical implementation in daily basis with real tools. After spreadsheet, we still have to cover, python tools, and visualizations.

While a ready to use spreadsheet template is, enough for admin to infer the statistical properties, without even doing analysis at all. We need to a ready to use code, that we can creatively modified, for any different cases.

Consider diving into the next step by exploring [ Trend - Polynomial Regression - Python ].