Preface
Goal: Getting to know statistical properties, using python library step by step.
You should know the statistical concept, before making a statistical model in worksheet. So you can understand how to debug, when something goes wrong in your calculation.
From equation, and statistic model in worksheet. we can step further to implement the model in scripting. This way we can go further with automation, nice visualization, and combine with other technologies.
Manual Calculation
We can summarize all manual calculation of, statistic properties from previous worksheet with python:
Python Source
The source can be obtained here:
Data Source
Instead of hardcoded data, we can setup the source data in CSV.
The CSV looks like this file below:
x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485
Loading Data Series
First we need to load the data:
import numpy as np
# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_values = pairCSV[:, 0]
y_values = pairCSV[:, 1]
Basic Properties
Then we calculate basic statistic properties.
# Number of data points
n = len(x_observed)
# Calculate sums
x_sum = np.sum(x_observed)
y_sum = np.sum(y_observed)
# Calculate means
x_mean = np.mean(x_observed)
y_mean = np.mean(y_observed)
# Output of basic properties
print(f'{f"n":10s} = {n:4d}')
print(f'∑x (total) = {x_sum:7.2f}')
print(f'∑y (total) = {y_sum:7.2f}')
print(f'x̄ (mean) = {x_mean:7.2f}')
print(f'ȳ (mean) = {y_mean:7.2f}')
print()
Deviations
Then we can find the least square coefficient:
# Calculate deviations
x_deviation = x_observed - x_mean
y_deviation = y_observed - y_mean
# Calculate squared deviations
x_sq_deviation = np.sum(x_deviation ** 2)
y_sq_deviation = np.sum(x_deviation ** 2)
# Calculate cross-deviation
xy_cross_deviation = np.sum(x_deviation * y_deviation)
# Calculate slope (m) and intercept (b)
m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean
Let’s display it in terminal.
# Output of least square calculation
print(f'∑(xᵢ-x̄) = {np.sum(x_deviation):9.2f}')
print(f'∑(yᵢ-ȳ) = {np.sum(y_deviation):9.2f}')
print(f'∑(xᵢ-x̄)² = {x_sq_deviation:9.2f}')
print(f'∑(yᵢ-ȳ)² = {y_sq_deviation:9.2f}')
print(f'∑(xᵢ-x̄)(yᵢ-ȳ) = {xy_cross_deviation:9.2f}')
print(f'm (slope) = {m_slope:9.2f}')
print(f'b (intercept) = {b_intercept:9.2f}')
print()
print(f'Equation y = ' \
+ f'{b_intercept:5.2f} + {m_slope:5.2f}.x')
print()
Variance and Covariance
Then go further with variance and covariance, until we get the R square:
# Calculate variance
x_variance = x_sq_deviation / n
y_variance = y_sq_deviation / n
# Calculate covariance
xy_covariance = xy_cross_deviation / n
# Calculate standard deviations
x_std_dev = np.sqrt(x_variance)
y_std_dev = np.sqrt(y_variance)
# Calculate Pearson correlation coefficient (r)
r = xy_covariance / (x_std_dev * y_std_dev)
# Calculate R-squared (R²)
r_squared = r ** 2
# Output of correlation calculation
print(f'sₓ² (variance) = {x_variance:9.2f}')
print(f'sy² (variance) = {y_variance:9.2f}')
print(f'covariance = {xy_covariance:9.2f}')
print(f'sₓ (std dev) = {x_std_dev:9.2f}')
print(f'sy (std dev) = {y_std_dev:9.2f}')
print(f'r (pearson) = {r:9.2f}')
print(f'R² = {r_squared:9.2f}')
print()
Residuals
And also the correlation, until we get the t-value.
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
# Calculate sum of squared residuals
ss_residuals = np.sum(y_err ** 2)
# Calculate degrees of freedom
df = n - 2
# Calculate variance of residuals (MSE)
var_residuals = ss_residuals / df
# Calculate standard error of the slope
std_err_slope = np.sqrt(var_residuals / x_sq_deviation)
# Calculate t-value
t_value = m_slope / std_err_slope
# Output the results
print(f'SSR = ∑ϵ² = {ss_residuals:9.2f}')
print(f'MSE = ∑ϵ²/(n-2) = {var_residuals:9.2f}')
print(f'SE(β₁) = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()
Output Result
Now we can have our result:
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
∑(xᵢ-x̄) = 0.00
∑(yᵢ-ȳ) = 0.00
∑(xᵢ-x̄)² = 182.00
∑(yᵢ-ȳ)² = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 14.00
sy² (variance) = 23786.00
covariance = 560.00
sₓ (std dev) = 3.74
sy (std dev) = 154.23
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Properties Helper
As you can see the the script above is too long, for just obtaining least square coefficient. We can simply use with python library, to get only what we need, then shorten the script.
We also need to refactor the code for reuse, So before that we need to solve this refactoring issue.
Handling Multiple Variables
Script above using a lot of statistics properties, these local variables are scattered, and the worst part is we don’t know what we need, while doing statistic exploration. This makes coding harder to modularized.
With python we can use python locals()
built-in method.
With this locals()
, we can bundle all variables as dictionary.
def get_properties(file_path) -> Dict:
...
return locals()
The in other python files, we can load properties from `Properties.py`` helper and unpack them into local variables.
properties = get_properties("50-samples.csv")
locals().update(properties)
Or in we can use in a function.
def display(properties: Dict) -> None:
p = Dict2Obj(properties)
...
It will be more comfortable to use object, than dictionary.
class Dict2Obj:
def __init__(self, properties_dict):
self.__dict__.update(properties_dict)
The Properties Helper
Le’ts create a python helper that is reusable basic properties, for use with other further calculation in different file.
In this helper we have two functions:
get_properties()
display()
The beginning part of the get_properties()
function,
is very similar with above scripting code.
def get_properties(file_path) -> Dict:
# Getting Matrix Values
pairCSV = np.genfromtxt(file_path,
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]
# Number of data points
n = len(x_observed)
# Calculate sums
x_sum = np.sum(x_observed)
y_sum = np.sum(y_observed)
# Calculate means
x_mean = np.mean(x_observed)
y_mean = np.mean(y_observed)
Then we can use standard numpy method:
# Calculate variance
x_variance = np.var(x_observed)
y_variance = np.var(y_observed)
# Calculate covariance
xy_covariance = np.cov(x_observed, y_observed)[0, 1]
# Calculate standard deviations
x_std_dev = np.std(x_observed)
y_std_dev = np.std(y_observed)
And then we can get the regression using linregress()
:
# Calculate slope (m), intercept (b),
# and other regression parameters
m_slope, b_intercept, r_value, p_value, \
std_err_slope = linregress(x_observed, y_observed)
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
return locals()
Then we can display the properties in different function. Separating the model and the view.
def display(properties: Dict) -> None:
p = Dict2Obj(properties)
# Output basic properties
print(f'{f"n":10s} = {p.n:4d}')
print(f'∑x (total) = {p.x_sum:7.2f}')
print(f'∑y (total) = {p.y_sum:7.2f}')
print(f'x̄ (mean) = {p.x_mean:7.2f}')
print(f'ȳ (mean) = {p.y_mean:7.2f}')
print()
# Output statistics properties
print(f'sₓ² (variance) = {p.x_variance:9.2f}')
print(f'sy² (variance) = {p.y_variance:9.2f}')
print(f'covariance = {p.xy_covariance:9.2f}')
print(f'sₓ (std dev) = {p.x_std_dev:9.2f}')
print(f'sy (std dev) = {p.y_std_dev:9.2f}')
print(f'm (slope) = {p.m_slope:9.2f}')
print(f'b (intercept) = {p.b_intercept:9.2f}')
print()
print(f'Equation y = ' \
+ f'{p.b_intercept:5.2f} + {p.m_slope:5.2f}.x')
print()
Using The Helper
We can go further with other statistical properties.
Python Source
The source can be obtained here:
Loading The Helper
First we drop all calculated properties as local variables:
import numpy as np
# Local Library
from Properties import get_properties, display
# Load properties from Properties.py helper
# and unpack them into local variables
properties = get_properties("50-samples.csv")
display(properties)
locals().update(properties)
Proceed Further
Then we can proceed other calculation
# Calculate R-squared (R²)
r_squared = r_value ** 2
# Output of correlation calculation
print(f'sₓ (std dev) = {x_std_dev:9.2f}')
print(f'sy (std dev) = {y_std_dev:9.2f}')
print(f'r (pearson) = {r_value:9.2f}')
print(f'R² = {r_squared:9.2f}')
print()
And finally the t-value based on residual calculation.
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)
# Calculate t-value
t_value = m_slope / std_err_slope
# Output the results
print(f'MSE = ∑ϵ²/(n-2) = {var_residuals:9.2f}')
print(f'SE(β₁) = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()
Output Result
We can have our result as follows:
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
sₓ² (variance) = 14.00
sy² (variance) = 23786.00
covariance = 606.67
sₓ (std dev) = 3.74
sy (std dev) = 154.23
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ (std dev) = 3.74
sy (std dev) = 154.23
r (pearson) = 0.97
R² = 0.94
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Further Library
For a curious one, we can go further,
getting to know how a calculation works.
For example the p-value
using t_distribution
.
Python Source
The source can be obtained here:
T Distribution
First we need to import the t-distribution,
Mote that we do not need Properties
helper,
to show how easy it is to obtain the value with python.
import numpy as np
from scipy.stats import linregress
from scipy.stats import t as t_distribution
Then getting source data as usual.
# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]
# Number of data points
n = len(x_observed)
Linear Regression
Then calculate linear regression using linregress
.
# Calculate slope (m), intercept (b),
# and other regression parameters
m_slope, b_intercept, r_value, p_value, \
std_err_slope = linregress(x_observed, y_observed)
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
Then calculate p value
using cdf
.
# Calculate degrees of freedom
df = n - 2
# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)
# Calculate t-value
t_value = m_slope / std_err_slope
# Calculate two-tailed p-value
p_value = 2 * (1 - t_distribution.cdf(abs(t_value), df))
# Output the results
print(f'Slope (m): {m_slope:.2f}')
print(f'Standard error of the slope: {std_err_slope:.2f}')
print(f't-value: {t_value:.2f}')
print(f'Two-tailed p-value: {p_value:.10f}')
print()
P Value and T Value
And also calculate p-value
manually using critical t-value
,
using ppf
and also cdf
.
# Define the significance level (alpha)
alpha = 0.05 # for a two-tailed test
# Calculate critical t-value
critical_t = t_distribution.ppf(1 - alpha / 2, df)
# Calculate the p-value manually
td = t_distribution.cdf(abs(t_value), df)
if abs(t_value) > critical_t:
p_value_manual = 2 * (1 - td)
else:
p_value_manual = 2 * td
print(f'Critical t-value: {critical_t:.10f}')
print(f'Manually calculated p-value: {p_value_manual:.10f}')
Output Result
We can have our result as follows:
Slope (m): 40.00
Standard error of the slope: 3.00
t-value: 13.33
Two-tailed p-value: 0.0000000391
Critical t-value: 2.2009851601
Manually calculated p-value: 0.0000000391
That’s all. For further information, please contact your nearest statistician.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
What’s the Next Exciting Step 🤔?
We are ready for more usage of this python helper, to give intepretation of those statistic properties above, in visualization.
Consider continuing your exploration by reading [ Trend - Properties - Visualization ].