Preface
Goal: Explore statistic properties with Julia. Providing the data using linear model.
Julia is so cool that it can write mathematical function, with the form close to the original equation.
However, it is still safer to write the code, conservatively. The fact that Julia can do this, does not mean that we have to push using the feature everytime.
Statistic Properties
We will have look at how Julia
,
handle statistical properties.
Starting with least square.
Manual Calculation
You can check the detail of the manual calculation in the source code below:
After read data from CSV file, we extract x and y values from CSV data.
using CSV, DataFrames, Printf, Statistics
pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y
We calculate the number of data points, sums and means. then printout the basic properties.
n = length(x_observed)
x_sum = sum(x_observed)
y_sum = sum(y_observed)
x_mean = mean(x_observed)
y_mean = mean(y_observed)
@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", x_sum)
@printf("∑y (total) = %7.2f\n", y_sum)
@printf("x̄ (mean) = %7.2f\n", x_mean)
@printf("ȳ (mean) = %7.2f\n\n", y_mean)
Calculate further, from deviations, squared deviations, cross-deviation, slope (m) and intercept (b). then printout the least square calculation.
x_deviation = x_observed .- x_mean
y_deviation = y_observed .- y_mean
x_sq_deviation = sum(x_deviation .^ 2)
y_sq_deviation = sum(y_deviation .^ 2)
xy_cross_deviation = sum(x_deviation .* y_deviation)
m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean
@printf("∑(xᵢ-x̄) = %9.2f\n", sum(x_deviation))
@printf("∑(yᵢ-ȳ) = %9.2f\n", sum(y_deviation))
@printf("∑(xᵢ-x̄)² = %9.2f\n", x_sq_deviation)
@printf("∑(yᵢ-ȳ)² = %9.2f\n", y_sq_deviation)
@printf("∑(xᵢ-x̄)(yᵢ-ȳ) = %9.2f\n", xy_cross_deviation)
@printf("m (slope) = %9.2f\n", m_slope)
@printf("b (intercept) = %9.2f\n\n", b_intercept)
@printf("Equation y = %.2f + %.2f.x\n\n", b_intercept, m_slope)
Calculate further, from variance, covariance, standard deviations, Pearson correlation coefficient (r), and R-squared (R²), then printout the correlation calculations.
x_variance = x_sq_deviation / (n-1)
y_variance = y_sq_deviation / (n-1)
xy_covariance = xy_cross_deviation / (n-1)
x_std_dev = sqrt(x_variance)
y_std_dev = sqrt(y_variance)
r = xy_covariance / (x_std_dev * y_std_dev)
r_squared = r^2
@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance = %9.2f\n", xy_covariance)
@printf("sₓ (std dev) = %9.2f\n", x_std_dev)
@printf("sy (std dev) = %9.2f\n", y_std_dev)
@printf("r (pearson) = %9.2f\n", r)
@printf("R² = %9.2f\n\n", r_squared)
We need to create regression line, along with residual error, using array operations. Then calculate sum of squared residuals, also using array operations.
Based on degrees of freedom, calculate further from variance of residuals (MSE), standard error of the slope, and calculate t-value, then printout the output the results.
y_fit = m_slope .* x_observed .+ b_intercept
y_err = y_observed .- y_fit
ss_residuals = sum(y_err .^ 2)
df = n - 2
var_residuals = ss_residuals / df
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope
@printf("SSR = ∑ϵ² = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2) = %9.2f\n", var_residuals)
@printf("SE(β₁) = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)
We can see the result as follows.
❯ julia 51-lq-manual.jl
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
∑(xᵢ-x̄) = 0.00
∑(yᵢ-ȳ) = 0.00
∑(xᵢ-x̄)² = 182.00
∑(yᵢ-ȳ)² = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 15.17
sy² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sy (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
You can obtain the interactive JupyterLab
in this following link:
GLM Library
Using Built-in Method
We can simplified aboce calculation with built-in method.
We need the GLM
library.
using CSV, DataFrames, Printf, Statistics, GLM
Very similar with R
,
we can use lm()
to get fit a linear model.
From this linear model we can get coefficients,
then extract slope and intercept.
Then printout the putput of least square calculation.
model = lm(@formula(y ~ x), pairCSV)
coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]
@printf("m (slope) = %9.2f\n", m_slope)
@printf("b (intercept) = %9.2f\n\n", b_intercept)
@printf("Equation y = %.2f + %.2f.x\n\n", b_intercept, m_slope)
There is also a built-in method to calculate variance, covariance, standard deviations, and even the pearson correlation coefficient (r). From this we can manually calculate R-squared (R²).
x_variance = var(x_observed)
y_variance = var(y_observed)
xy_covariance = cov(x_observed, y_observed)
x_std_dev = std(x_observed)
y_std_dev = std(y_observed)
r = cor(x_observed, y_observed)
r_squared = r^2
@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance = %9.2f\n", xy_covariance)
@printf("sₓ (std dev) = %9.2f\n", x_std_dev)
@printf("sy (std dev) = %9.2f\n", y_std_dev)
@printf("r (pearson) = %9.2f\n", r)
@printf("R² = %9.2f\n\n", r_squared)
From previous this linear model, we can generate predicted values along with residuals. Then we can continue doing our manual calculation.
y_fit = predict(model)
y_err = residuals(model)
df = n - 2
ss_residuals = sum(y_err .^ 2)
var_residuals = ss_residuals / df
x_deviation = x_observed .- x_mean
x_sq_deviation = sum(x_deviation .^ 2)
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope
@printf("SSR = ∑ϵ² = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2) = %9.2f\n", var_residuals)
@printf("∑(xᵢ-x̄)² = %9.2f\n", x_sq_deviation)
@printf("SE(β₁) = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)
We can see the result as follows.
❯ julia 52-lq-built-in.jl
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 15.17
sy² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sy (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
∑(xᵢ-x̄)² = 182.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
You can obtain the interactive JupyterLab
in this following link:
UTF-8 Variable
Just like any other modern language, we can use utf-8 as variable name. Let’s have this experiment below:
First we are going to use xᵢ
and yᵢ
.
using CSV, DataFrames, Printf, Statistics, GLM
df = CSV.read("50-samples.csv", DataFrame)
xᵢ = df.x
yᵢ = df.y
Then using ∑
and mean symbol.
n = length(xᵢ)
∑x = sum(xᵢ)
∑y = sum(yᵢ)
x̄ = mean(xᵢ)
ȳ = mean(yᵢ)
@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", ∑x)
@printf("∑y (total) = %7.2f\n", ∑y)
@printf("x̄ (mean) = %7.2f\n", x̄)
@printf("ȳ (mean) = %7.2f\n\n", ȳ)
We can also goes further with variable names.
sₓ² = sum((xᵢ .- x̄).^2) / (n - 1)
sʸ² = sum((yᵢ .- ȳ).^2) / (n - 1)
cov = sum((xᵢ .- x̄) .* (yᵢ .- ȳ)) / (n - 1)
sₓ = sqrt(sₓ²)
sʸ = sqrt(sʸ²)
r = cov / (sₓ * sʸ)
r² = r^2
@printf("sₓ² (variance) = %9.2f\n", sₓ²)
@printf("sʸ² (variance) = %9.2f\n", sʸ²)
@printf("covariance = %9.2f\n", cov)
@printf("sₓ (std dev) = %9.2f\n", sₓ)
@printf("sʸ (std dev) = %9.2f\n", sʸ)
@printf("r (pearson) = %9.2f\n", r)
@printf("R² = %9.2f\n\n", r²)
Continue with other variables.
mᵣ = sum((xᵢ .- x̄) .* (yᵢ .- ȳ)) / sum((xᵢ .- x̄).^2)
bᵣ = ȳ - mᵣ * x̄
@printf("m (slope) = %9.2f\n", mᵣ)
@printf("b (intercept) = %9.2f\n\n", bᵣ)
@printf("Equation y = %.2f + %.2f.x\n\n", bᵣ, mᵣ)
Continue with other variables as well.
ŷᵢ = mᵣ .* xᵢ .+ bᵣ
ϵᵢ = yᵢ .- ŷᵢ
df = n - 2
∑ϵ² = sum(ϵᵢ .^ 2)
MSE = ∑ϵ² / df
SE_β₁ = sqrt(MSE / sum((xᵢ .- x̄).^2))
tᵥ = mᵣ / SE_β₁
@printf("SSR = ∑ϵ² = %9.2f\n", ∑ϵ²)
@printf("MSE = ∑ϵ²/(n-2) = %9.2f\n", MSE)
@printf("SE(β₁) = √(MSE/sₓ) = %9.2f\n", SE_β₁)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", tᵥ)
And test the result.
❯ julia 53-lq-utf.jl
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
sₓ² (variance) = 15.17
sʸ² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sʸ (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
You can obtain the interactive JupyterLab
in this following link:
It looks like that it works.
Math Equation
We can also define function with utf-8.
Let’s experiment with ∑(x)
and √(x)
.
∑(x) = sum(x) # Summation
√(x) = sqrt(x) # Square root
n = length(xᵢ)
∑x = ∑(xᵢ)
∑y = ∑(yᵢ)
x̄ = mean(xᵢ)
ȳ = mean(yᵢ)
Now we can see how close the code looks with the original equation.
It looks like we can put the math equation, right into the code.
sₓ² = ∑((xᵢ .- x̄).^2) / (n - 1)
sʸ² = ∑((yᵢ .- ȳ).^2) / (n - 1)
cov = ∑((xᵢ .- x̄) .* (yᵢ .- ȳ)) / (n - 1)
sₓ = √(sₓ²)
sʸ = √(sʸ²)
r = cov / (sₓ * sʸ)
r² = r^2
This way, we can see how the code works better.
mᵣ = ∑((xᵢ .- x̄) .* (yᵢ .- ȳ)) / ∑((xᵢ .- x̄).^2)
bᵣ = ȳ - mᵣ * x̄
ŷᵢ = mᵣ .* xᵢ .+ bᵣ
ϵᵢ = yᵢ .- ŷᵢ
Let’s continue with other operation as well.
df = n - 2
∑ϵ² = ∑(ϵᵢ .^ 2)
MSE = ∑ϵ² / df
SE_β₁ = √(MSE / ∑((xᵢ .- x̄).^2))
tᵥ = mᵣ / SE_β₁
And finally printout all the statistic properties, from basic properties, correlation calculation, regression coefficient until we get the t-value.
@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", ∑x)
@printf("∑y (total) = %7.2f\n", ∑y)
@printf("x̄ (mean) = %7.2f\n", x̄)
@printf("ȳ (mean) = %7.2f\n\n", ȳ)
@printf("sₓ² (variance) = %9.2f\n", sₓ²)
@printf("sʸ² (variance) = %9.2f\n", sʸ²)
@printf("covariance = %9.2f\n", cov)
@printf("sₓ (std dev) = %9.2f\n", sₓ)
@printf("sʸ (std dev) = %9.2f\n", sʸ)
@printf("r (pearson) = %9.2f\n", r)
@printf("R² = %9.2f\n\n", r²)
@printf("m (slope) = %9.2f\n", mᵣ)
@printf("b (intercept) = %9.2f\n\n", bᵣ)
@printf("Equation y = %.2f + %.2f.x\n\n", bᵣ, mᵣ)
@printf("SSR = ∑ϵ² = %9.2f\n", ∑ϵ²)
@printf("MSE = ∑ϵ²/(n-2) = %9.2f\n", MSE)
@printf("SE(β₁) = √(MSE/sₓ) = %9.2f\n", SE_β₁)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", tᵥ)
Again, let’s test the result.
❯ julia 54-lq-math.jl
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
sₓ² (variance) = 15.17
sʸ² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sʸ (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
You can obtain the interactive JupyterLab
in this following link:
This is just utf-8. Although this works, don’t be a fool of the symbol. You may use apples, fruit, vegetables, or any smiley faces, instead of greek letter.
Visualization
Now we can step into visualization part. We can read data from CSV file, and extract x and y values from CSV data.
using CSV, DataFrames, Plots, GLM
pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y
Then create a simple scatter plot of the observed data points. This is the base plot.
scatter_plot = scatter(
x_observed, y_observed,
xlabel = "x", ylabel = "y",
title = "Scatter Plot of Observed Data",
markersize = 2,
legend = false,
color = :blue,
grid = false,
size = (800, 400))
In order to calculate regression line, we have to make a fit a linear model. From this model, we can get coefficients, then extract slope and intercept.
model = lm(@formula(y ~ x), pairCSV)
coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]
Then we add another plot, the linear regression.
regression_line = plot!(
x -> m_slope * x + b_intercept,
xlims = extrema(x_observed),
color = :red, linewidth = 2)
Don’t forget to save.
# Save plot as PNG
savefig("55-lq-plot.png")
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Additional Properties
Beside basic properties, we can also calculate additional properties using built-in methods.
As usual we extract x and y values, from dataframe read from CSV data
using CSV, DataFrames, Statistics, Printf, Distributions
pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y
From this we can calculate number of data points, then maximum, minimum, and range. And printout the maximum, minimum, and range.
n = length(x_observed)
x_max = maximum(x_observed)
x_min = minimum(x_observed)
x_range = x_max - x_min
y_max = maximum(y_observed)
y_min = minimum(y_observed)
y_range = y_max - y_min
@printf("x (max, min, range) = (%7.2f, %7.2f, %7.2f)\n",
x_min, x_max, x_range)
@printf("y (max, min, range) = (%7.2f, %7.2f, %7.2f)\n\n",
y_min, y_max, y_range)
With built-in methods, we can also calculate median and mode.
# Calculate median
x_median = median(x_observed)
y_median = median(y_observed)
# Calculate mode
x_mode = mode(x_observed)
y_mode = mode(y_observed)
# Output of additional properties
@printf("x median = %9.2f\n", x_median)
@printf("y median = %9.2f\n", y_median)
@printf("x mode = %9.2f\n", x_mode)
@printf("y mode = %9.2f\n\n", y_mode)
Let’s check the result.
❯ julia 61-additional.jl
x (max, min, range) = ( 0.00, 12.00, 12.00)
y (max, min, range) = ( 5.00, 485.00, 480.00)
x median = 6.00
y median = 137.00
x mode = 0.00
y mode = 5.00
You can obtain the interactive JupyterLab
in this following link:
Actually we can also calculate the kurtosis and skewness, but the result is different with the PSPP counter part. I won’t judge which one is right or wrong, I think the issue is my understanding of the internal calculation of the properties. So I choose to defer my exploration, until I can find the right reference.
What’s the Next Chapter 🤔?
Let’s continue our previous Julia
journey,
with building class and statistical properties.
Let’s continue our various plot with julia,
also featuring gadfly
journey.
Consider continuing your exploration with [ Trend - Language - Julia - Part Three ].