Where to Discuss?

Local Group

Preface

Goal: Getting to know statistical properties, using python library step by step.

You should know the statistical concept, before making a statistical model in worksheet. So you can understand how to debug, when something goes wrong in your calculation.

From equation, and statistic model in worksheet. we can step further to implement the model in scripting. This way we can go further with automation, nice visualization, and combine with other technologies.


Manual Calculation

We can summarize all manual calculation of, statistic properties from previous worksheet with python:

Python Source

The source can be obtained here:

Data Source

Instead of hardcoded data, we can setup the source data in CSV.

The CSV looks like this file below:

x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485

Python: Manual Calculation: Statistical Properties: CSV Source

Loading Data Series

First we need to load the data:

import numpy as np

# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
  skip_header=1, delimiter=",", dtype=float)

# Extract x and y values from CSV data
x_values = pairCSV[:, 0]
y_values = pairCSV[:, 1]

Python: Manual Calculation: Statistical Properties: Series

Basic Properties

Then we calculate basic statistic properties.

# Number of data points
n = len(x_observed)

# Calculate sums
x_sum = np.sum(x_observed)
y_sum = np.sum(y_observed)

# Calculate means
x_mean = np.mean(x_observed)
y_mean = np.mean(y_observed)

# Output of basic properties
print(f'{f"n":10s} = {n:4d}')
print(f'∑x (total) = {x_sum:7.2f}')
print(f'∑y (total) = {y_sum:7.2f}')
print(f'x̄ (mean)   = {x_mean:7.2f}')
print(f'ȳ (mean)   = {y_mean:7.2f}')
print()

Python: Manual Calculation: Statistical Properties: Total and Mean

Deviations

Then we can find the least square coefficient:

# Calculate deviations
x_deviation = x_observed - x_mean
y_deviation = y_observed - y_mean

# Calculate squared deviations
x_sq_deviation = np.sum(x_deviation ** 2)
y_sq_deviation = np.sum(x_deviation ** 2)

# Calculate cross-deviation
xy_cross_deviation = np.sum(x_deviation * y_deviation)
# Calculate slope (m) and intercept (b)
m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean

Python: Manual Calculation: Statistical Properties: Deviations

Let’s display it in terminal.

# Output of least square calculation
print(f'∑(xᵢ-x̄)    = {np.sum(x_deviation):9.2f}')
print(f'∑(yᵢ-ȳ)    = {np.sum(y_deviation):9.2f}')
print(f'∑(xᵢ-x̄)²   = {x_sq_deviation:9.2f}')
print(f'∑(yᵢ-ȳ)²   = {y_sq_deviation:9.2f}')
print(f'∑(xᵢ-x̄)(yᵢ-ȳ)  = {xy_cross_deviation:9.2f}')
print(f'm (slope)      = {m_slope:9.2f}')
print(f'b (intercept)  = {b_intercept:9.2f}')
print()

print(f'Equation     y = ' \
  + f'{b_intercept:5.2f} + {m_slope:5.2f}.x')
print()

Python: Manual Calculation: Statistical Properties: Deviations

Variance and Covariance

Then go further with variance and covariance, until we get the R square:

# Calculate variance
x_variance = x_sq_deviation / n
y_variance = y_sq_deviation / n

# Calculate covariance
xy_covariance = xy_cross_deviation / n

# Calculate standard deviations
x_std_dev = np.sqrt(x_variance)
y_std_dev = np.sqrt(y_variance)

# Calculate Pearson correlation coefficient (r)
r = xy_covariance / (x_std_dev * y_std_dev)

# Calculate R-squared (R²)
r_squared = r ** 2

Python: Manual Calculation: Statistical Properties: R Square

# Output of correlation calculation
print(f'sₓ² (variance) = {x_variance:9.2f}')
print(f'sy² (variance) = {y_variance:9.2f}')
print(f'covariance     = {xy_covariance:9.2f}')
print(f'sₓ (std dev)   = {x_std_dev:9.2f}')
print(f'sy (std dev)   = {y_std_dev:9.2f}')
print(f'r (pearson)    = {r:9.2f}')
print(f'R²             = {r_squared:9.2f}')
print()

Python: Manual Calculation: Statistical Properties: R Square

Residuals

And also the correlation, until we get the t-value.

# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit

# Calculate sum of squared residuals
ss_residuals = np.sum(y_err ** 2)

# Calculate degrees of freedom
df = n - 2

# Calculate variance of residuals (MSE)
var_residuals = ss_residuals / df

# Calculate standard error of the slope
std_err_slope = np.sqrt(var_residuals / x_sq_deviation)

# Calculate t-value
t_value = m_slope / std_err_slope

Python: Manual Calculation: Statistical Properties: Residual

# Output the results
print(f'SSR = ∑ϵ²           = {ss_residuals:9.2f}')
print(f'MSE = ∑ϵ²/(n-2)     = {var_residuals:9.2f}')
print(f'SE(β₁)  = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()

Python: Manual Calculation: Statistical Properties: Residual

Output Result

Now we can have our result:

n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

∑(xᵢ-x̄)    =      0.00
∑(yᵢ-ȳ)    =      0.00
∑(xᵢ-x̄)²   =    182.00
∑(yᵢ-ȳ)²   = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ)  =   7280.00
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     14.00
sy² (variance) =  23786.00
covariance     =    560.00
sₓ (std dev)   =      3.74
sy (std dev)   =    154.23
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Python: Manual Calculation: Statistical Properties: Result Output

Interactive JupyterLab

You can obtain the interactive JupyterLab in this following link:


Properties Helper

As you can see the the script above is too long, for just obtaining least square coefficient. We can simply use with python library, to get only what we need, then shorten the script.

We also need to refactor the code for reuse, So before that we need to solve this refactoring issue.

Handling Multiple Variables

Script above using a lot of statistics properties, these local variables are scattered, and the worst part is we don’t know what we need, while doing statistic exploration. This makes coding harder to modularized.

With python we can use python locals() built-in method.

With this locals(), we can bundle all variables as dictionary.

def get_properties(file_path) -> Dict:
  ...

  return locals()

The in other python files, we can load properties from `Properties.py`` helper and unpack them into local variables.

properties = get_properties("50-samples.csv")
locals().update(properties)

Or in we can use in a function.

def display(properties: Dict) -> None:
  p = Dict2Obj(properties)

  ...

It will be more comfortable to use object, than dictionary.

class Dict2Obj:
  def __init__(self, properties_dict):
    self.__dict__.update(properties_dict)

Python: Library: Statistical Properties: Helper: Skeleton

The Properties Helper

Le’ts create a python helper that is reusable basic properties, for use with other further calculation in different file.

In this helper we have two functions:

  • get_properties()
  • display()

The beginning part of the get_properties() function, is very similar with above scripting code.

def get_properties(file_path) -> Dict:
  # Getting Matrix Values
  pairCSV = np.genfromtxt(file_path,
    skip_header=1, delimiter=",", dtype=float)
  
  # Extract x and y values from CSV data
  x_observed = pairCSV[:, 0]
  y_observed = pairCSV[:, 1]

  # Number of data points
  n = len(x_observed)

  # Calculate sums
  x_sum = np.sum(x_observed)
  y_sum = np.sum(y_observed)

  # Calculate means
  x_mean = np.mean(x_observed)
  y_mean = np.mean(y_observed)

Python: Library: Statistical Properties: Helper: Total and Mean

Then we can use standard numpy method:

  # Calculate variance
  x_variance = np.var(x_observed)
  y_variance = np.var(y_observed)

  # Calculate covariance
  xy_covariance = np.cov(x_observed, y_observed)[0, 1]

  # Calculate standard deviations
  x_std_dev = np.std(x_observed)
  y_std_dev = np.std(y_observed)

Python: Library: Statistical Properties: Helper: Deviation

And then we can get the regression using linregress():

  # Calculate slope (m), intercept (b),
  # and other regression parameters
  m_slope, b_intercept, r_value, p_value, \
  std_err_slope = linregress(x_observed, y_observed)

  # Create regression line
  y_fit = m_slope * x_observed + b_intercept
  y_err = y_observed - y_fit

  return locals()

Python: Library: Statistical Properties: Helper: Linear Regression

Then we can display the properties in different function. Separating the model and the view.

def display(properties: Dict) -> None:
  p = Dict2Obj(properties)

  # Output basic properties
  print(f'{f"n":10s} = {p.n:4d}')
  print(f'∑x (total) = {p.x_sum:7.2f}')
  print(f'∑y (total) = {p.y_sum:7.2f}')
  print(f'x̄ (mean)   = {p.x_mean:7.2f}')
  print(f'ȳ (mean)   = {p.y_mean:7.2f}')
  print()

  # Output statistics properties
  print(f'sₓ² (variance) = {p.x_variance:9.2f}')
  print(f'sy² (variance) = {p.y_variance:9.2f}')
  print(f'covariance     = {p.xy_covariance:9.2f}')
  print(f'sₓ (std dev)   = {p.x_std_dev:9.2f}')
  print(f'sy (std dev)   = {p.y_std_dev:9.2f}')
  print(f'm (slope)      = {p.m_slope:9.2f}')
  print(f'b (intercept)  = {p.b_intercept:9.2f}')
  print()

  print(f'Equation     y = ' \
    + f'{p.b_intercept:5.2f} + {p.m_slope:5.2f}.x')
  print()

Python: Library: Statistical Properties: Helper: Display


Using The Helper

We can go further with other statistical properties.

Python Source

The source can be obtained here:

Loading The Helper

First we drop all calculated properties as local variables:

import numpy as np

# Local Library
from Properties import get_properties, display

# Load properties from Properties.py helper
# and unpack them into local variables
properties = get_properties("50-samples.csv")
display(properties)
locals().update(properties)

Python: Library: Statistical Properties: Using Helper: Locals

Proceed Further

Then we can proceed other calculation

# Calculate R-squared (R²)
r_squared = r_value ** 2

# Output of correlation calculation

print(f'sₓ (std dev)   = {x_std_dev:9.2f}')
print(f'sy (std dev)   = {y_std_dev:9.2f}')
print(f'r (pearson)    = {r_value:9.2f}')
print(f'R²             = {r_squared:9.2f}')
print()

Python: Library: Statistical Properties: Using Helper: R-Squared

And finally the t-value based on residual calculation.

# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit

# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)

# Calculate t-value
t_value = m_slope / std_err_slope

# Output the results
print(f'MSE = ∑ϵ²/(n-2)     = {var_residuals:9.2f}')
print(f'SE(β₁)  = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()

Python: Library: Statistical Properties: Using Helper: t-value

Output Result

We can have our result as follows:

n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

sₓ² (variance) =     14.00
sy² (variance) =  23786.00
covariance     =    606.67
sₓ (std dev)   =      3.74
sy (std dev)   =    154.23
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ (std dev)   =      3.74
sy (std dev)   =    154.23
r (pearson)    =      0.97
R²             =      0.94

MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Python: Library: Statistical Properties: Using Helper: Output Result

Interactive JupyterLab

You can obtain the interactive JupyterLab in this following link:


Further Library

For a curious one, we can go further, getting to know how a calculation works. For example the p-value using t_distribution.

Python Source

The source can be obtained here:

T Distribution

First we need to import the t-distribution, Mote that we do not need Properties helper, to show how easy it is to obtain the value with python.

import numpy as np
from scipy.stats import linregress
from scipy.stats import t as t_distribution

Then getting source data as usual.

# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
  skip_header=1, delimiter=",", dtype=float)

# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]

# Number of data points
n = len(x_observed)

Python: Further Library: Statistical Properties: Import

Linear Regression

Then calculate linear regression using linregress.

# Calculate slope (m), intercept (b),
# and other regression parameters
m_slope, b_intercept, r_value, p_value, \
std_err_slope = linregress(x_observed, y_observed)

# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit

Python: Further Library: Statistical Properties: Linear Regression

Then calculate p value using cdf.

# Calculate degrees of freedom
df = n - 2

# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)

# Calculate t-value
t_value = m_slope / std_err_slope

# Calculate two-tailed p-value
p_value = 2 * (1 - t_distribution.cdf(abs(t_value), df))

# Output the results
print(f'Slope (m): {m_slope:.2f}')
print(f'Standard error of the slope: {std_err_slope:.2f}')
print(f't-value: {t_value:.2f}')
print(f'Two-tailed p-value: {p_value:.10f}')
print()

Python: Further Library: Statistical Properties: t-value

P Value and T Value

And also calculate p-value manually using critical t-value, using ppf and also cdf.

# Define the significance level (alpha)
alpha = 0.05  # for a two-tailed test

# Calculate critical t-value
critical_t = t_distribution.ppf(1 - alpha / 2, df)

# Calculate the p-value manually
td = t_distribution.cdf(abs(t_value), df)
if abs(t_value) > critical_t:
  p_value_manual = 2 * (1 - td)
else:
  p_value_manual = 2 * td

print(f'Critical t-value: {critical_t:.10f}')
print(f'Manually calculated p-value: {p_value_manual:.10f}')

Python: Further Library: Statistical Properties: Critical t-value

Output Result

We can have our result as follows:

Slope (m): 40.00
Standard error of the slope: 3.00
t-value: 13.33
Two-tailed p-value: 0.0000000391

Critical t-value: 2.2009851601
Manually calculated p-value: 0.0000000391

Python: Further Library: Statistical Properties: Output Result of p-value

That’s all. For further information, please contact your nearest statistician.

Interactive JupyterLab

You can obtain the interactive JupyterLab in this following link:


What’s the Next Exciting Step 🤔?

We are ready for more usage of this python helper, to give intepretation of those statistic properties above, in visualization.

Consider continuing your exploration by reading [ Trend - Properties - Visualization ].