Preface
Goal: Getting to know statistical properties, equation by cheatsheet step by step.
Basic math is required, when the case you face become complex. When you have to test your calculation manually, before using buil-in formula or python library. We need to know how the manual calculation works.
Even with calculator online, How do I suppose to understand how the manual calculation works, How can we be sure that the result is correct? if it doesn’t comes with the math? Of course there are books for this. But even when I know the concept and the math, how do I automate my job with scripting?
Here I provide the equation cheatsheet, as a fundamental knowledge before processiong further. If you think my explanation, or calculation is wrong, I welcome any better opinion.
Cheatsheet
For beginner, I provide equation cheatsheet for linear regression for samples.
You can obtain both the SVG source and Tex draft here:
Note that population and samples has different notation.
Equation for Linear Regression
Least Square Method
We have introduced basic properties in the previous least square article. Beyond that there are other statistic properties required, for regression analysis and correlation.
You can get the source of the cheatsheet figure above here:
Data Series
With sample of data points
Total
The count
and sum
can be represented as:
Mean
Just the average
Variance
The spreads of each data series.
Covariance
Just like variance, but both axis.
Standard Deviation
Square root of variance.
Slope and Intercept
We can calculate the slope (m) using this equation:
You can also represent the equation for the slope m as
This way you can get the intercept (b) as follows:
Correlation Coefficient
Pearson
Similar to slope, but using both sₓ and Sy.
The Pearson correlation coefficient can also be expressed as follows:
Coefficient of Determination
The R Square (R²)
In general, the R Square (R²) can be defined as follows.
It is interesting that this is just comparing of (dispersed yᵢ observed againts ŷᵢ predicted) and, (dispersed yᵢ observed againts mean ȳ).
The abbreviation are as below:
- SSE: sum of squared errors, represents the sum of squared differences between the observed values of the dependent variable (yᵢ) and the predicted values (ŷᵢ)
- TSS: total sum of squares, which measures the total variability of the dependent variable around its mean (ȳ).
R² for simple linear regression
The R Square (R²) for simple linear regression is the same as r². This can be expressed as follows.
The equation above is not true in general. In cases where the relationship between the variables is non-linear, the coefficient of determination (R²) may still be calculated using the last one.
Residual
SSE can be defined as
While the TSS measure against mean. The SSE measure against error. This similarity help me remember the concept.
MSE
MSE can be defined as
While variance divided by total degrees of freedom (n-1). The MSE divided by residual degrees of freedom (n-k-1). This similarity help me remember the concept.
Standard Error of Slope
This is also similar to Standard Deviation, except that we need to divide (error in y axis) by (variation in x) to get the slope.
This can be written as:
This similarity help me remember the concept.
t-value
The equation for calculating the t-value, in the context of linear regression analysis is as follows. Where β̅₁ is the m slope.
However, for non-linear regression models, the concept of a t-value is not directly applicable in the same way as it is in linear regression. The interpretation and calculation of significance for parameters in non-linear models can differ significantly based on the estimation method used and the specific context of the model.
Example
For example with this (x,y)
sample series.
we can have all the calculated statistics as follows.
x, y
0, 5
1, 12
2, 25
3, 44
4, 69
5, 100
6, 137
7, 180
8, 229
9, 284
10, 345
11, 412
12, 485
Regression provides an equation to predict one variable from another.
Regression analysis involves predicting the value of one variable based on the value of another variable. In this case, we are predicting the values of y based on the values of x.
n = 13
∑x (total) = 78
∑y (total) = 2327
x̄ (mean) = 6
ȳ (mean) = 179
∑(xᵢ-x̄) = 0
∑(yᵢ-ȳ) = 0
∑(xᵢ-x̄)² = 182
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280
m (slope) = 40
b (intercept) = -61
With this we get the formula equation as:
y = -61 + 40*x
Correlation measures the strength and direction of the linear relationship between two variables.
Correlation measures the strength and direction of the linear relationship between two variables. It does not imply causation, only association.
The correlation coefficient (r) ranges from -1 to 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation. A value of 0 indicates no linear correlation.
∑(yᵢ-ȳ)² = 309218
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280
sₓ² (variance) = 15,17
sy² (variance) = 25.768,17
covariance = 606,67
sₓ (std dev) = 3,89
sy (std dev) = 160,52
r (pearson) = 0,9704
R² = 0,9417
In this case, the correlation coefficient (r) is 0.97, indicating a very strong positive correlation between x and y.
The standard error of the slope (SE(β₁)), T-value, and p-value are related to the significance of the slope coefficient in a regression model. A low p-value indicates that the slope coefficient is significant.
SSR = ∑ϵ² = 18.018
MSE = ∑ϵ²/(n-2) = 1.638
SE(β₁) = √(MSE/sₓ) = 3,00
t-value = β̅₁/SE(β₁) = 13,33
p-value = 0,0000000391
We can calculate above example using worksheet, or scripting language such as python.
What Lies Ahead 🤔?
It is fun, right?
We can describe mathematical equation, in practical way.
It is nice to know the manual calculation, step by step using this fundamental equation. But this might not be suitable for daily basis, we need to explore different calculation method. That above is a complex way to get the result. Of course there is a simple way get the result, such as built-in worksheet formula, or even the oneliner method from python library.
Consider diving into the next step by exploring [ Trend - Properties - Formula ].