Where to Discuss?

Local Group

Preface

Goal: Getting to know statistical properties, equation by cheatsheet step by step.

Basic math is required, when the case you face become complex. When you have to test your calculation manually, before using buil-in formula or python library. We need to know how the manual calculation works.

Even with calculator online, How do I suppose to understand how the manual calculation works, How can we be sure that the result is correct? if it doesn’t comes with the math? Of course there are books for this. But even when I know the concept and the math, how do I automate my job with scripting?

Here I provide the equation cheatsheet, as a fundamental knowledge before processiong further. If you think my explanation, or calculation is wrong, I welcome any better opinion.

Cheatsheet

For beginner, I provide equation cheatsheet for linear regression for samples.

Trend: TeX: Cheatsheet: Equation Flow

You can obtain both the SVG source and Tex draft here:

Note that population and samples has different notation.


Equation for Linear Regression

Least Square Method

We have introduced basic properties in the previous least square article. Beyond that there are other statistic properties required, for regression analysis and correlation.

Trend: TeX: Cheatsheet: Population and Sample

You can get the source of the cheatsheet figure above here:

Data Series

With sample of data points

Total

The count and sum can be represented as:

Mean

Just the average

Variance

The spreads of each data series.

Covariance

Just like variance, but both axis.

Standard Deviation

Square root of variance.

Slope and Intercept

We can calculate the slope (m) using this equation:

You can also represent the equation for the slope m as

This way you can get the intercept (b) as follows:

Correlation Coefficient

Pearson

Similar to slope, but using both sₓ and Sy.

The Pearson correlation coefficient can also be expressed as follows:

Coefficient of Determination

The R Square (R²)

In general, the R Square (R²) can be defined as follows.

It is interesting that this is just comparing of (dispersed yᵢ observed againts ŷᵢ predicted) and, (dispersed yᵢ observed againts mean ȳ).

The abbreviation are as below:

  • SSE: sum of squared errors, represents the sum of squared differences between the observed values of the dependent variable (yᵢ​) and the predicted values (ŷᵢ)
  • TSS: total sum of squares, which measures the total variability of the dependent variable around its mean (ȳ​).

R² for simple linear regression

The R Square (R²) for simple linear regression is the same as r². This can be expressed as follows.

The equation above is not true in general. In cases where the relationship between the variables is non-linear, the coefficient of determination (R²) may still be calculated using the last one.

Residual

SSE can be defined as

While the TSS measure against mean. The SSE measure against error. This similarity help me remember the concept.

MSE

MSE can be defined as

While variance divided by total degrees of freedom (n-1). The MSE divided by residual degrees of freedom (n-k-1). This similarity help me remember the concept.

Standard Error of Slope

This is also similar to Standard Deviation, except that we need to divide (error in y axis) by (variation in x) to get the slope.

This can be written as:

This similarity help me remember the concept.

t-value

The equation for calculating the t-value, in the context of linear regression analysis is as follows. Where β̅₁ is the m slope.

However, for non-linear regression models, the concept of a t-value is not directly applicable in the same way as it is in linear regression. The interpretation and calculation of significance for parameters in non-linear models can differ significantly based on the estimation method used and the specific context of the model.


Example

For example with this (x,y) sample series. we can have all the calculated statistics as follows.

x, y
0, 5
1, 12
2, 25
3, 44
4, 69
5, 100
6, 137
7, 180
8, 229
9, 284
10, 345
11, 412
12, 485

Regression provides an equation to predict one variable from another.

Regression analysis involves predicting the value of one variable based on the value of another variable. In this case, we are predicting the values of y based on the values of x.

n = 13

∑x (total) = 78
∑y (total) = 2327

x̄ (mean) = 6
ȳ (mean) =  179

∑(xᵢ-x̄) = 0
∑(yᵢ-ȳ) = 0

∑(xᵢ-x̄)² = 182
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280

m (slope) = 40
b (intercept) = -61

With this we get the formula equation as:

y = -61 + 40*x

Correlation measures the strength and direction of the linear relationship between two variables.

Correlation measures the strength and direction of the linear relationship between two variables. It does not imply causation, only association.

The correlation coefficient (r) ranges from -1 to 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation. A value of 0 indicates no linear correlation.

∑(yᵢ-ȳ)²      = 309218
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280

sₓ² (variance) = 15,17
sy² (variance) = 25.768,17
covariance     = 606,67

sₓ (std dev)   = 3,89
sy (std dev)   = 160,52
r (pearson)    = 0,9704

R² = 0,9417

In this case, the correlation coefficient (r) is 0.97, indicating a very strong positive correlation between x and y.

The standard error of the slope (SE(β₁)), T-value, and p-value are related to the significance of the slope coefficient in a regression model. A low p-value indicates that the slope coefficient is significant.

SSR = ∑ϵ² = 18.018
MSE = ∑ϵ²/(n-2) = 1.638
SE(β₁)  = √(MSE/sₓ) = 3,00
t-value = β̅₁/SE(β₁) = 13,33
p-value = 0,0000000391

We can calculate above example using worksheet, or scripting language such as python.


What Lies Ahead 🤔?

It is fun, right?

We can describe mathematical equation, in practical way.

It is nice to know the manual calculation, step by step using this fundamental equation. But this might not be suitable for daily basis, we need to explore different calculation method. That above is a complex way to get the result. Of course there is a simple way get the result, such as built-in worksheet formula, or even the oneliner method from python library.

Consider diving into the next step by exploring [ Trend - Properties - Formula ].