Trend - Language - Julia

Trend: Introduction

Trend: Enhanced

Trend: Language

Local Group

Preface

Goal: Explore statistic properties with Julia. Providing the data using linear model.

Julia is so cool that it can write mathematical function, with the form close to the original equation.

Julia: Mathematical Equation: Combined

However, it is still safer to write the code, conservatively. The fact that Julia can do this, does not mean that we have to push using the feature everytime.

Statistic Properties

We will have look at how Julia, handle statistical properties. Starting with least square.

Manual Calculation

You can check the detail of the manual calculation in the source code below:

github.com/…/trend/51-lq-manual.jl

After read data from CSV file, we extract x and y values from CSV data.

using CSV, DataFrames, Printf, Statistics

pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y

Julia: Statistic Properties: Manual Calculation

We calculate the number of data points, sums and means. then printout the basic properties.

n = length(x_observed)

x_sum = sum(x_observed)
y_sum = sum(y_observed)

x_mean = mean(x_observed)
y_mean = mean(y_observed)

@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", x_sum)
@printf("∑y (total) = %7.2f\n", y_sum)
@printf("x̄ (mean)   = %7.2f\n", x_mean)
@printf("ȳ (mean)   = %7.2f\n\n", y_mean)

Julia: Statistic Properties: Manual Calculation

Calculate further, from deviations, squared deviations, cross-deviation, slope (m) and intercept (b). then printout the least square calculation.

x_deviation = x_observed .- x_mean
y_deviation = y_observed .- y_mean

x_sq_deviation = sum(x_deviation .^ 2)
y_sq_deviation = sum(y_deviation .^ 2)
xy_cross_deviation = sum(x_deviation .* y_deviation)

m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean

@printf("∑(xᵢ-x̄)    = %9.2f\n", sum(x_deviation))
@printf("∑(yᵢ-ȳ)    = %9.2f\n", sum(y_deviation))
@printf("∑(xᵢ-x̄)²   = %9.2f\n", x_sq_deviation)
@printf("∑(yᵢ-ȳ)²   = %9.2f\n", y_sq_deviation)
@printf("∑(xᵢ-x̄)(yᵢ-ȳ)  = %9.2f\n", xy_cross_deviation)
@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)

@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

Julia: Statistic Properties: Manual Calculation

Calculate further, from variance, covariance, standard deviations, Pearson correlation coefficient (r), and R-squared (R²), then printout the correlation calculations.

x_variance = x_sq_deviation / (n-1)
y_variance = y_sq_deviation / (n-1)
xy_covariance = xy_cross_deviation / (n-1)

x_std_dev = sqrt(x_variance)
y_std_dev = sqrt(y_variance)

r = xy_covariance / (x_std_dev * y_std_dev)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

Julia: Statistic Properties: Manual Calculation

We need to create regression line, along with residual error, using array operations. Then calculate sum of squared residuals, also using array operations.

Based on degrees of freedom, calculate further from variance of residuals (MSE), standard error of the slope, and calculate t-value, then printout the output the results.

y_fit = m_slope .* x_observed .+ b_intercept
y_err = y_observed .- y_fit

ss_residuals = sum(y_err .^ 2)
df = n - 2

var_residuals = ss_residuals / df
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

Julia: Statistic Properties: Manual Calculation

We can see the result as follows.

❯ julia 51-lq-manual.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

∑(xᵢ-x̄)    =      0.00
∑(yᵢ-ȳ)    =      0.00
∑(xᵢ-x̄)²   =    182.00
∑(yᵢ-ȳ)²   = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ)  =   7280.00
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: Manual Calculation

You can obtain the interactive JupyterLab in this following link:

github.com/…/trend/51-lq-manual.ipynb

GLM Library

Using Built-in Method

We can simplified aboce calculation with built-in method.

github.com/…/trend/52-lq-built-in.jl

We need the GLM library.

using CSV, DataFrames, Printf, Statistics, GLM

Julia: Statistic Properties: GLM Library

Very similar with R, we can use lm() to get fit a linear model. From this linear model we can get coefficients, then extract slope and intercept. Then printout the putput of least square calculation.

model = lm(@formula(y ~ x), pairCSV)
coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]

@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)
@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

Julia: Statistic Properties: GLM Library

There is also a built-in method to calculate variance, covariance, standard deviations, and even the pearson correlation coefficient (r). From this we can manually calculate R-squared (R²).

x_variance = var(x_observed)
y_variance = var(y_observed)
xy_covariance = cov(x_observed, y_observed)

x_std_dev = std(x_observed)
y_std_dev = std(y_observed)

r = cor(x_observed, y_observed)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

Julia: Statistic Properties: GLM Library

From previous this linear model, we can generate predicted values along with residuals. Then we can continue doing our manual calculation.

y_fit = predict(model)
y_err = residuals(model)

df = n - 2
ss_residuals = sum(y_err .^ 2)
var_residuals = ss_residuals / df

x_deviation = x_observed .- x_mean
x_sq_deviation = sum(x_deviation .^ 2)
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("∑(xᵢ-x̄)²            = %9.2f\n", x_sq_deviation)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

Julia: Statistic Properties: GLM Library

We can see the result as follows.

❯ julia 52-lq-built-in.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
∑(xᵢ-x̄)²            =    182.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: GLM Library

You can obtain the interactive JupyterLab in this following link:

github.com/…/trend/52-lq-built-in.ipynb

UTF-8 Variable

Just like any other modern language, we can use utf-8 as variable name. Let’s have this experiment below:

github.com/…/trend/53-lq-utf.jl

First we are going to use xᵢ and yᵢ.

using CSV, DataFrames, Printf, Statistics, GLM

df = CSV.read("50-samples.csv", DataFrame)
xᵢ = df.x
yᵢ = df.y