### Preface

Goal: Getting to know statistical properties, using python library step by step.

You should know the statistical concept, before making a statistical model in worksheet. So you can understand how to debug, when something goes wrong in your calculation.

From equation, and statistic model in worksheet. we can step further to implement the model in scripting. This way we can go further with automation, nice visualization, and combine with other technologies.

### Manual Calculation

We can summarize all manual calculation of, statistic properties from previous worksheet with python:

#### Python Source

The source can be obtained here:

#### Data Source

Instead of hardcoded data, we can setup the source data in CSV.

The CSV looks like this file below:

```
x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485
```

#### Loading Data Series

First we need to load the data:

```
import numpy as np
# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_values = pairCSV[:, 0]
y_values = pairCSV[:, 1]
```

#### Basic Properties

Then we calculate basic statistic properties.

```
# Number of data points
n = len(x_observed)
# Calculate sums
x_sum = np.sum(x_observed)
y_sum = np.sum(y_observed)
# Calculate means
x_mean = np.mean(x_observed)
y_mean = np.mean(y_observed)
# Output of basic properties
print(f'{f"n":10s} = {n:4d}')
print(f'∑x (total) = {x_sum:7.2f}')
print(f'∑y (total) = {y_sum:7.2f}')
print(f'x̄ (mean) = {x_mean:7.2f}')
print(f'ȳ (mean) = {y_mean:7.2f}')
print()
```

#### Deviations

Then we can find the least square coefficient:

```
# Calculate deviations
x_deviation = x_observed - x_mean
y_deviation = y_observed - y_mean
# Calculate squared deviations
x_sq_deviation = np.sum(x_deviation ** 2)
y_sq_deviation = np.sum(x_deviation ** 2)
# Calculate cross-deviation
xy_cross_deviation = np.sum(x_deviation * y_deviation)
# Calculate slope (m) and intercept (b)
m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean
```

Let’s display it in terminal.

```
# Output of least square calculation
print(f'∑(xᵢ-x̄) = {np.sum(x_deviation):9.2f}')
print(f'∑(yᵢ-ȳ) = {np.sum(y_deviation):9.2f}')
print(f'∑(xᵢ-x̄)² = {x_sq_deviation:9.2f}')
print(f'∑(yᵢ-ȳ)² = {y_sq_deviation:9.2f}')
print(f'∑(xᵢ-x̄)(yᵢ-ȳ) = {xy_cross_deviation:9.2f}')
print(f'm (slope) = {m_slope:9.2f}')
print(f'b (intercept) = {b_intercept:9.2f}')
print()
print(f'Equation y = ' \
+ f'{b_intercept:5.2f} + {m_slope:5.2f}.x')
print()
```

#### Variance and Covariance

Then go further with variance and covariance, until we get the R square:

```
# Calculate variance
x_variance = x_sq_deviation / n
y_variance = y_sq_deviation / n
# Calculate covariance
xy_covariance = xy_cross_deviation / n
# Calculate standard deviations
x_std_dev = np.sqrt(x_variance)
y_std_dev = np.sqrt(y_variance)
# Calculate Pearson correlation coefficient (r)
r = xy_covariance / (x_std_dev * y_std_dev)
# Calculate R-squared (R²)
r_squared = r ** 2
```

```
# Output of correlation calculation
print(f'sₓ² (variance) = {x_variance:9.2f}')
print(f'sy² (variance) = {y_variance:9.2f}')
print(f'covariance = {xy_covariance:9.2f}')
print(f'sₓ (std dev) = {x_std_dev:9.2f}')
print(f'sy (std dev) = {y_std_dev:9.2f}')
print(f'r (pearson) = {r:9.2f}')
print(f'R² = {r_squared:9.2f}')
print()
```

#### Residuals

And also the correlation, until we get the t-value.

```
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
# Calculate sum of squared residuals
ss_residuals = np.sum(y_err ** 2)
# Calculate degrees of freedom
df = n - 2
# Calculate variance of residuals (MSE)
var_residuals = ss_residuals / df
# Calculate standard error of the slope
std_err_slope = np.sqrt(var_residuals / x_sq_deviation)
# Calculate t-value
t_value = m_slope / std_err_slope
```

```
# Output the results
print(f'SSR = ∑ϵ² = {ss_residuals:9.2f}')
print(f'MSE = ∑ϵ²/(n-2) = {var_residuals:9.2f}')
print(f'SE(β₁) = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()
```

#### Output Result

Now we can have our result:

```
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
∑(xᵢ-x̄) = 0.00
∑(yᵢ-ȳ) = 0.00
∑(xᵢ-x̄)² = 182.00
∑(yᵢ-ȳ)² = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 14.00
sy² (variance) = 23786.00
covariance = 560.00
sₓ (std dev) = 3.74
sy (std dev) = 154.23
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
```

#### Interactive JupyterLab

You can obtain the interactive `JupyterLab`

in this following link:

### Properties Helper

As you can see the the script above is too long, for just obtaining least square coefficient. We can simply use with python library, to get only what we need, then shorten the script.

We also need to refactor the code for reuse, So before that we need to solve this refactoring issue.

#### Handling Multiple Variables

Script above using a lot of statistics properties, these local variables are scattered, and the worst part is we don’t know what we need, while doing statistic exploration. This makes coding harder to modularized.

With python we can use python `locals()`

built-in method.

With this `locals()`

, we can bundle all variables as dictionary.

```
def get_properties(file_path) -> Dict:
...
return locals()
```

The in other python files, we can load properties from `Properties.py`` helper and unpack them into local variables.

```
properties = get_properties("50-samples.csv")
locals().update(properties)
```

Or in we can use in a function.

```
def display(properties: Dict) -> None:
p = Dict2Obj(properties)
...
```

It will be more comfortable to use object, than dictionary.

```
class Dict2Obj:
def __init__(self, properties_dict):
self.__dict__.update(properties_dict)
```

#### The Properties Helper

Le’ts create a python helper that is reusable basic properties, for use with other further calculation in different file.

In this helper we have two functions:

`get_properties()`

`display()`

The beginning part of the `get_properties()`

function,
is very similar with above scripting code.

```
def get_properties(file_path) -> Dict:
# Getting Matrix Values
pairCSV = np.genfromtxt(file_path,
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]
# Number of data points
n = len(x_observed)
# Calculate sums
x_sum = np.sum(x_observed)
y_sum = np.sum(y_observed)
# Calculate means
x_mean = np.mean(x_observed)
y_mean = np.mean(y_observed)
```

Then we can use standard numpy method:

```
# Calculate variance
x_variance = np.var(x_observed)
y_variance = np.var(y_observed)
# Calculate covariance
xy_covariance = np.cov(x_observed, y_observed)[0, 1]
# Calculate standard deviations
x_std_dev = np.std(x_observed)
y_std_dev = np.std(y_observed)
```

And then we can get the regression using `linregress()`

:

```
# Calculate slope (m), intercept (b),
# and other regression parameters
m_slope, b_intercept, r_value, p_value, \
std_err_slope = linregress(x_observed, y_observed)
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
return locals()
```

Then we can display the properties in different function. Separating the model and the view.

```
def display(properties: Dict) -> None:
p = Dict2Obj(properties)
# Output basic properties
print(f'{f"n":10s} = {p.n:4d}')
print(f'∑x (total) = {p.x_sum:7.2f}')
print(f'∑y (total) = {p.y_sum:7.2f}')
print(f'x̄ (mean) = {p.x_mean:7.2f}')
print(f'ȳ (mean) = {p.y_mean:7.2f}')
print()
# Output statistics properties
print(f'sₓ² (variance) = {p.x_variance:9.2f}')
print(f'sy² (variance) = {p.y_variance:9.2f}')
print(f'covariance = {p.xy_covariance:9.2f}')
print(f'sₓ (std dev) = {p.x_std_dev:9.2f}')
print(f'sy (std dev) = {p.y_std_dev:9.2f}')
print(f'm (slope) = {p.m_slope:9.2f}')
print(f'b (intercept) = {p.b_intercept:9.2f}')
print()
print(f'Equation y = ' \
+ f'{p.b_intercept:5.2f} + {p.m_slope:5.2f}.x')
print()
```

### Using The Helper

We can go further with other statistical properties.

#### Python Source

The source can be obtained here:

#### Loading The Helper

First we drop all calculated properties as local variables:

```
import numpy as np
# Local Library
from Properties import get_properties, display
# Load properties from Properties.py helper
# and unpack them into local variables
properties = get_properties("50-samples.csv")
display(properties)
locals().update(properties)
```

#### Proceed Further

Then we can proceed other calculation

```
# Calculate R-squared (R²)
r_squared = r_value ** 2
# Output of correlation calculation
print(f'sₓ (std dev) = {x_std_dev:9.2f}')
print(f'sy (std dev) = {y_std_dev:9.2f}')
print(f'r (pearson) = {r_value:9.2f}')
print(f'R² = {r_squared:9.2f}')
print()
```

And finally the t-value based on residual calculation.

```
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)
# Calculate t-value
t_value = m_slope / std_err_slope
# Output the results
print(f'MSE = ∑ϵ²/(n-2) = {var_residuals:9.2f}')
print(f'SE(β₁) = √(MSE/sₓ) = {std_err_slope:9.2f}')
print(f't-value = β̅₁/SE(β₁) = {t_value:9.2f}')
print()
```

#### Output Result

We can have our result as follows:

```
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
sₓ² (variance) = 14.00
sy² (variance) = 23786.00
covariance = 606.67
sₓ (std dev) = 3.74
sy (std dev) = 154.23
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ (std dev) = 3.74
sy (std dev) = 154.23
r (pearson) = 0.97
R² = 0.94
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
```

#### Interactive JupyterLab

You can obtain the interactive `JupyterLab`

in this following link:

### Further Library

For a curious one, we can go further,
getting to know how a calculation works.
For example the `p-value`

using `t_distribution`

.

#### Python Source

The source can be obtained here:

#### T Distribution

First we need to import the t-distribution,
Mote that we do not need `Properties`

helper,
to show how easy it is to obtain the value with python.

```
import numpy as np
from scipy.stats import linregress
from scipy.stats import t as t_distribution
```

Then getting source data as usual.

```
# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]
# Number of data points
n = len(x_observed)
```

#### Linear Regression

Then calculate linear regression using `linregress`

.

```
# Calculate slope (m), intercept (b),
# and other regression parameters
m_slope, b_intercept, r_value, p_value, \
std_err_slope = linregress(x_observed, y_observed)
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
```

Then calculate `p value`

using `cdf`

.

```
# Calculate degrees of freedom
df = n - 2
# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)
# Calculate t-value
t_value = m_slope / std_err_slope
# Calculate two-tailed p-value
p_value = 2 * (1 - t_distribution.cdf(abs(t_value), df))
# Output the results
print(f'Slope (m): {m_slope:.2f}')
print(f'Standard error of the slope: {std_err_slope:.2f}')
print(f't-value: {t_value:.2f}')
print(f'Two-tailed p-value: {p_value:.10f}')
print()
```

#### P Value and T Value

And also calculate `p-value`

manually using critical `t-value`

,
using `ppf`

and also `cdf`

.

```
# Define the significance level (alpha)
alpha = 0.05 # for a two-tailed test
# Calculate critical t-value
critical_t = t_distribution.ppf(1 - alpha / 2, df)
# Calculate the p-value manually
td = t_distribution.cdf(abs(t_value), df)
if abs(t_value) > critical_t:
p_value_manual = 2 * (1 - td)
else:
p_value_manual = 2 * td
print(f'Critical t-value: {critical_t:.10f}')
print(f'Manually calculated p-value: {p_value_manual:.10f}')
```

#### Output Result

We can have our result as follows:

```
Slope (m): 40.00
Standard error of the slope: 3.00
t-value: 13.33
Two-tailed p-value: 0.0000000391
Critical t-value: 2.2009851601
Manually calculated p-value: 0.0000000391
```

That’s all. For further information, please contact your nearest statistician.

#### Interactive JupyterLab

You can obtain the interactive `JupyterLab`

in this following link:

### What’s the Next Exciting Step 🤔?

We are ready for more usage of this python helper, to give intepretation of those statistic properties above, in visualization.

Consider continuing your exploration by reading [ Trend - Properties - Visualization ].