Where to Discuss?

Local Group

Preface

Goal: Explore statistic properties with Julia. Providing the data using linear model.

Julia is so cool that it can write mathematical function, with the form close to the original equation.

Julia: Mathematical Equation: Combined

However, it is still safer to write the code, conservatively. The fact that Julia can do this, does not mean that we have to push using the feature everytime.


Statistic Properties

We will have look at how Julia, handle statistical properties. Starting with least square.

Manual Calculation

You can check the detail of the manual calculation in the source code below:

After read data from CSV file, we extract x and y values from CSV data.

using CSV, DataFrames, Printf, Statistics

pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y

Julia: Statistic Properties: Manual Calculation

We calculate the number of data points, sums and means. then printout the basic properties.

n = length(x_observed)

x_sum = sum(x_observed)
y_sum = sum(y_observed)

x_mean = mean(x_observed)
y_mean = mean(y_observed)

@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", x_sum)
@printf("∑y (total) = %7.2f\n", y_sum)
@printf("x̄ (mean)   = %7.2f\n", x_mean)
@printf("ȳ (mean)   = %7.2f\n\n", y_mean)

Julia: Statistic Properties: Manual Calculation

Calculate further, from deviations, squared deviations, cross-deviation, slope (m) and intercept (b). then printout the least square calculation.

x_deviation = x_observed .- x_mean
y_deviation = y_observed .- y_mean

x_sq_deviation = sum(x_deviation .^ 2)
y_sq_deviation = sum(y_deviation .^ 2)
xy_cross_deviation = sum(x_deviation .* y_deviation)

m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean

@printf("∑(xᵢ-x̄)    = %9.2f\n", sum(x_deviation))
@printf("∑(yᵢ-ȳ)    = %9.2f\n", sum(y_deviation))
@printf("∑(xᵢ-x̄)²   = %9.2f\n", x_sq_deviation)
@printf("∑(yᵢ-ȳ)²   = %9.2f\n", y_sq_deviation)
@printf("∑(xᵢ-x̄)(yᵢ-ȳ)  = %9.2f\n", xy_cross_deviation)
@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)

@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

Julia: Statistic Properties: Manual Calculation

Calculate further, from variance, covariance, standard deviations, Pearson correlation coefficient (r), and R-squared (R²), then printout the correlation calculations.

x_variance = x_sq_deviation / (n-1)
y_variance = y_sq_deviation / (n-1)
xy_covariance = xy_cross_deviation / (n-1)

x_std_dev = sqrt(x_variance)
y_std_dev = sqrt(y_variance)

r = xy_covariance / (x_std_dev * y_std_dev)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

Julia: Statistic Properties: Manual Calculation

We need to create regression line, along with residual error, using array operations. Then calculate sum of squared residuals, also using array operations.

Based on degrees of freedom, calculate further from variance of residuals (MSE), standard error of the slope, and calculate t-value, then printout the output the results.

y_fit = m_slope .* x_observed .+ b_intercept
y_err = y_observed .- y_fit

ss_residuals = sum(y_err .^ 2)
df = n - 2

var_residuals = ss_residuals / df
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

Julia: Statistic Properties: Manual Calculation

We can see the result as follows.

❯ julia 51-lq-manual.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

∑(xᵢ-x̄)    =      0.00
∑(yᵢ-ȳ)    =      0.00
∑(xᵢ-x̄)²   =    182.00
∑(yᵢ-ȳ)²   = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ)  =   7280.00
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: Manual Calculation

You can obtain the interactive JupyterLab in this following link:

GLM Library

Using Built-in Method

We can simplified aboce calculation with built-in method.

We need the GLM library.

using CSV, DataFrames, Printf, Statistics, GLM

Julia: Statistic Properties: GLM Library

Very similar with R, we can use lm() to get fit a linear model. From this linear model we can get coefficients, then extract slope and intercept. Then printout the putput of least square calculation.

model = lm(@formula(y ~ x), pairCSV)
coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]

@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)
@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

Julia: Statistic Properties: GLM Library

There is also a built-in method to calculate variance, covariance, standard deviations, and even the pearson correlation coefficient (r). From this we can manually calculate R-squared (R²).

x_variance = var(x_observed)
y_variance = var(y_observed)
xy_covariance = cov(x_observed, y_observed)

x_std_dev = std(x_observed)
y_std_dev = std(y_observed)

r = cor(x_observed, y_observed)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

Julia: Statistic Properties: GLM Library

From previous this linear model, we can generate predicted values along with residuals. Then we can continue doing our manual calculation.

y_fit = predict(model)
y_err = residuals(model)

df = n - 2
ss_residuals = sum(y_err .^ 2)
var_residuals = ss_residuals / df

x_deviation = x_observed .- x_mean
x_sq_deviation = sum(x_deviation .^ 2)
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("∑(xᵢ-x̄)²            = %9.2f\n", x_sq_deviation)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

Julia: Statistic Properties: GLM Library

We can see the result as follows.

❯ julia 52-lq-built-in.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
∑(xᵢ-x̄)²            =    182.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: GLM Library

You can obtain the interactive JupyterLab in this following link:

UTF-8 Variable

Just like any other modern language, we can use utf-8 as variable name. Let’s have this experiment below:

First we are going to use xᵢ and yᵢ.

using CSV, DataFrames, Printf, Statistics, GLM

df = CSV.read("50-samples.csv", DataFrame)
xᵢ = df.x
yᵢ = df.y

Julia: Statistic Properties: UTF-8 Variable

Then using and mean symbol.

n = length(xᵢ)

∑x = sum(xᵢ)
∑y = sum(yᵢ)

x̄ = mean(xᵢ)
ȳ = mean(yᵢ)

@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", ∑x)
@printf("∑y (total) = %7.2f\n", ∑y)
@printf("x̄ (mean)   = %7.2f\n", x̄)
@printf("ȳ (mean)   = %7.2f\n\n", ȳ)

Julia: Statistic Properties: UTF-8 Variable

We can also goes further with variable names.

sₓ² = sum((xᵢ .- x̄).^2) / (n - 1)
sʸ² = sum((yᵢ .- ȳ).^2) / (n - 1)
cov = sum((xᵢ .- x̄) .* (yᵢ .- ȳ)) / (n - 1)

sₓ = sqrt(sₓ²)
sʸ = sqrt(sʸ²)

r = cov / (sₓ * sʸ)
r² = r^2

@printf("sₓ² (variance) = %9.2f\n", sₓ²)
@printf("sʸ² (variance) = %9.2f\n", sʸ²)
@printf("covariance     = %9.2f\n", cov)
@printf("sₓ (std dev)   = %9.2f\n", sₓ)
@printf("sʸ (std dev)   = %9.2f\n", sʸ)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r²)

Julia: Statistic Properties: UTF-8 Variable

Continue with other variables.

mᵣ = sum((xᵢ .- x̄) .* (yᵢ .- ȳ)) / sum((xᵢ .- x̄).^2)
bᵣ =- mᵣ *@printf("m (slope)      = %9.2f\n", mᵣ)
@printf("b (intercept)  = %9.2f\n\n", bᵣ)
@printf("Equation     y = %.2f + %.2f.x\n\n", bᵣ, mᵣ)

Julia: Statistic Properties: UTF-8 Variable

Continue with other variables as well.

ŷᵢ = mᵣ .* xᵢ .+ bᵣ
ϵᵢ = yᵢ .- ŷᵢ

df = n - 2
∑ϵ² = sum(ϵᵢ .^ 2)

MSE = ∑ϵ² / df
SE_β₁ = sqrt(MSE / sum((xᵢ .- x̄).^2))
tᵥ = mᵣ / SE_β₁

@printf("SSR = ∑ϵ²           = %9.2f\n", ∑ϵ²)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", MSE)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", SE_β₁)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", tᵥ)

Julia: Statistic Properties: UTF-8 Variable

And test the result.

❯ julia 53-lq-utf.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

sₓ² (variance) =     15.17
sʸ² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sʸ (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: UTF-8 Variable

You can obtain the interactive JupyterLab in this following link:

It looks like that it works.

Math Equation

We can also define function with utf-8.

Let’s experiment with ∑(x) and √(x).

∑(x) = sum(x)   # Summation
(x) = sqrt(x)  # Square root

n = length(xᵢ)
∑x = ∑(xᵢ)
∑y = ∑(yᵢ)

x̄ = mean(xᵢ)
ȳ = mean(yᵢ)

Julia: Statistic Properties: Math Equation

Now we can see how close the code looks with the original equation.

It looks like we can put the math equation, right into the code.

sₓ² = ∑((xᵢ .- x̄).^2) / (n - 1)
sʸ² = ∑((yᵢ .- ȳ).^2) / (n - 1)
cov = ∑((xᵢ .- x̄) .* (yᵢ .- ȳ)) / (n - 1)

sₓ = (sₓ²)
sʸ = (sʸ²)

r = cov / (sₓ * sʸ)
r² = r^2

Julia: Statistic Properties: Math Equation

This way, we can see how the code works better.

mᵣ = ∑((xᵢ .- x̄) .* (yᵢ .- ȳ)) / ∑((xᵢ .- x̄).^2)
bᵣ =- mᵣ * x̄

ŷᵢ = mᵣ .* xᵢ .+ bᵣ
ϵᵢ = yᵢ .- ŷᵢ

Julia: Statistic Properties: Math Equation

Let’s continue with other operation as well.

df = n - 2
∑ϵ² = ∑(ϵᵢ .^ 2)

MSE = ∑ϵ² / df
SE_β₁ = (MSE / ∑((xᵢ .- x̄).^2))
tᵥ = mᵣ / SE_β₁

Julia: Statistic Properties: Math Equation

And finally printout all the statistic properties, from basic properties, correlation calculation, regression coefficient until we get the t-value.

@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", ∑x)
@printf("∑y (total) = %7.2f\n", ∑y)
@printf("x̄ (mean)   = %7.2f\n", x̄)
@printf("ȳ (mean)   = %7.2f\n\n", ȳ)

@printf("sₓ² (variance) = %9.2f\n", sₓ²)
@printf("sʸ² (variance) = %9.2f\n", sʸ²)
@printf("covariance     = %9.2f\n", cov)
@printf("sₓ (std dev)   = %9.2f\n", sₓ)
@printf("sʸ (std dev)   = %9.2f\n", sʸ)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r²)

@printf("m (slope)      = %9.2f\n", mᵣ)
@printf("b (intercept)  = %9.2f\n\n", bᵣ)
@printf("Equation     y = %.2f + %.2f.x\n\n", bᵣ, mᵣ)

@printf("SSR = ∑ϵ²           = %9.2f\n", ∑ϵ²)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", MSE)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", SE_β₁)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", tᵥ)

Again, let’s test the result.

❯ julia 54-lq-math.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

sₓ² (variance) =     15.17
sʸ² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sʸ (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: Math Equation

You can obtain the interactive JupyterLab in this following link:

This is just utf-8. Although this works, don’t be a fool of the symbol. You may use apples, fruit, vegetables, or any smiley faces, instead of greek letter.

Visualization

Now we can step into visualization part. We can read data from CSV file, and extract x and y values from CSV data.

using CSV, DataFrames, Plots, GLM

pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y

Julia: Statistic Properties: Plot Visualization

Then create a simple scatter plot of the observed data points. This is the base plot.

scatter_plot = scatter(
  x_observed, y_observed,
  xlabel = "x", ylabel = "y",
  title = "Scatter Plot of Observed Data",
  markersize = 2,
  legend = false,
  color = :blue,
  grid = false,
  size = (800, 400))

Julia: Statistic Properties: Plot Visualization

In order to calculate regression line, we have to make a fit a linear model. From this model, we can get coefficients, then extract slope and intercept.

model = lm(@formula(y ~ x), pairCSV)

coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]

Julia: Statistic Properties: Plot Visualization

Then we add another plot, the linear regression.

regression_line = plot!(
  x -> m_slope * x + b_intercept,
  xlims = extrema(x_observed),
  color = :red, linewidth = 2)

Don’t forget to save.

# Save plot as PNG
savefig("55-lq-plot.png")

Julia: Statistic Properties: Plot Visualization

The plot result can be shown as follows:

Julia: Least Square: Simple Plot

You can obtain the interactive JupyterLab in this following link:


Additional Properties

Beside basic properties, we can also calculate additional properties using built-in methods.

As usual we extract x and y values, from dataframe read from CSV data

using CSV, DataFrames, Statistics, Printf, Distributions

pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y

Julia: Statistic Properties: Additional Properties

From this we can calculate number of data points, then maximum, minimum, and range. And printout the maximum, minimum, and range.

n = length(x_observed)

x_max = maximum(x_observed)
x_min = minimum(x_observed)
x_range = x_max - x_min

y_max = maximum(y_observed)
y_min = minimum(y_observed)
y_range = y_max - y_min

@printf("x (max, min, range) = (%7.2f, %7.2f, %7.2f)\n",
  x_min, x_max, x_range)
@printf("y (max, min, range) = (%7.2f, %7.2f, %7.2f)\n\n",
  y_min, y_max, y_range)

Julia: Statistic Properties: Additional Properties

With built-in methods, we can also calculate median and mode.

# Calculate median
x_median = median(x_observed)
y_median = median(y_observed)

# Calculate mode
x_mode = mode(x_observed)
y_mode = mode(y_observed)

# Output of additional properties
@printf("x median         = %9.2f\n", x_median)
@printf("y median         = %9.2f\n", y_median)
@printf("x mode           = %9.2f\n", x_mode)
@printf("y mode           = %9.2f\n\n", y_mode)

Julia: Statistic Properties: Additional Properties

Let’s check the result.

❯ julia 61-additional.jl
x (max, min, range) = (   0.00,   12.00,   12.00)
y (max, min, range) = (   5.00,  485.00,  480.00)

x median         =      6.00
y median         =    137.00
x mode           =      0.00
y mode           =      5.00

Julia: Statistic Properties: Additional Properties

You can obtain the interactive JupyterLab in this following link:

Actually we can also calculate the kurtosis and skewness, but the result is different with the PSPP counter part. I won’t judge which one is right or wrong, I think the issue is my understanding of the internal calculation of the properties. So I choose to defer my exploration, until I can find the right reference.


What’s the Next Chapter 🤔?

Let’s continue our previous Julia journey, with building class and statistical properties.

Let’s continue our various plot with julia, also featuring gadfly journey.

Consider continuing your exploration with [ Trend - Language - Julia - Part Three ].