Trend - Language - Julia

Trend: Prediction

Trend: Regression

Trend: Enhanced

Trend: Language

Local Group

Preface

Goal: Explore statistic properties with Julia. Providing the data using linear model.

Julia is that one friend, who not only solves complex math, but also writes it beautifully in Unicode. It can express mathematical functions, in ways that look like they just walked off the page of a textbook. Seriously, it’s like LaTeX and Python had a mathematically gifted child.

Julia: Mathematical Equation: Combined

Readable Code

However… just because we can write:

$\mathbf{y} = X \mathbf{\beta} + \mathbf{\varepsilon}$

It doesn’t mean we should throw Greek letters everywhere, like we’re decorating a fraternity house.

Readable code beats aesthetic code in production.
Unicode is a flex, but clarity is a lifeline.

We statisticians know that beauty matters, but so does readability, specially when debugging code at 2AM while questioning life choices. So while Unicode looks gorgeous, we’ll proceed with a conservative coding approach, that still gets the job done and keeps future us from rage-quitting.

Statistic Properties

Let’s peek under the hood of how Julia handles statistical properties. We’ll start with the classic workhorse of regression: least squares. This is where statisticians get misty-eyed—right, before they argue about assumptions over coffee.

Manual Calculation

The Spreadsheet Monk Way

Full code available. You can check the detail of the manual calculation in the source code below:

github.com/…/trend/51-lq-manual.jl

After reading data from CSV, we extract our favorite bickering pair: x and y.

using CSV, DataFrames, Printf, Statistics

pairCSV = CSV.read("50-samples.csv", DataFrame)
x_observed = pairCSV.x
y_observed = pairCSV.y

Julia: Statistic Properties: Manual Calculation

We compute sample size, totals, and means, the boring part of any romantic relationship then printout the basic properties.

n = length(x_observed)

x_sum = sum(x_observed)
y_sum = sum(y_observed)

x_mean = mean(x_observed)
y_mean = mean(y_observed)

@printf("%-10s = %4d\n", "n", n)
@printf("∑x (total) = %7.2f\n", x_sum)
@printf("∑y (total) = %7.2f\n", y_sum)
@printf("x̄ (mean)   = %7.2f\n", x_mean)
@printf("ȳ (mean)   = %7.2f\n\n", y_mean)

These are the building blocks for everything else. Miss a mean, and the whole regression falls like a badly balanced ANOVA.

Julia: Statistic Properties: Manual Calculation

Let’s take those deviations for a walk: square them, multiply them, and mash them into slope (m) and intercept (b). The usual statistics hazing ritual. Then printout the least square calculation.

x_deviation = x_observed .- x_mean
y_deviation = y_observed .- y_mean

x_sq_deviation = sum(x_deviation .^ 2)
y_sq_deviation = sum(y_deviation .^ 2)
xy_cross_deviation = sum(x_deviation .* y_deviation)

m_slope = xy_cross_deviation / x_sq_deviation
b_intercept = y_mean - m_slope * x_mean

@printf("∑(xᵢ-x̄)    = %9.2f\n", sum(x_deviation))
@printf("∑(yᵢ-ȳ)    = %9.2f\n", sum(y_deviation))
@printf("∑(xᵢ-x̄)²   = %9.2f\n", x_sq_deviation)
@printf("∑(yᵢ-ȳ)²   = %9.2f\n", y_sq_deviation)
@printf("∑(xᵢ-x̄)(yᵢ-ȳ)  = %9.2f\n", xy_cross_deviation)
@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)

@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

We’re trying to find the line that best hugs our data. Mathematically speaking, least squares is the most socially acceptable line of best fit.

Julia: Statistic Properties: Manual Calculation

Now we flex our calculator muscles to find: variance, standard deviation, covariance, Pearson correlation coefficient (r), and the beloved R-squared (R²). The crowd cheers. Then again printout the correlation calculations as usual.

x_variance = x_sq_deviation / (n-1)
y_variance = y_sq_deviation / (n-1)
xy_covariance = xy_cross_deviation / (n-1)

x_std_dev = sqrt(x_variance)
y_std_dev = sqrt(y_variance)

r = xy_covariance / (x_std_dev * y_std_dev)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

This tells us how tight the data hugs the trend line. In statistics, we call this “relationship goals.”

Julia: Statistic Properties: Manual Calculation

Residuals time:

subtract predictions from reality, just like grading our expectations after a family reunion. That gap, between expectations and reality, is like a residual in a regression.

We need to create regression line, along with residual error, using array operations. Then calculate sum of squared residuals, also using array operations.

Based on degrees of freedom, calculate further from variance of residuals (MSE), standard error of the slope, and calculate t-value, then printout the output the results,’ again and again, and again…

y_fit = m_slope .* x_observed .+ b_intercept
y_err = y_observed .- y_fit

ss_residuals = sum(y_err .^ 2)
df = n - 2

var_residuals = ss_residuals / df
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

This is the beating heart of inference. We’re not just fitting a line. We’re making statistically sound claims.

Julia: Statistic Properties: Manual Calculation

Execution result:

❯ julia 51-lq-manual.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

∑(xᵢ-x̄)    =      0.00
∑(yᵢ-ȳ)    =      0.00
∑(xᵢ-x̄)²   =    182.00
∑(yᵢ-ȳ)²   = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ)  =   7280.00
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: Manual Calculation

📓 Interactive Jupyter version here:

You can obtain the interactive JupyterLab in this following link:

github.com/…/trend/51-lq-manual.ipynb

GLM Library

Using Built-in Method

Now boarding: the GLM train. All manual passengers, please stay seated.

We can simplified aboce calculation with built-in method.

Link to code:

github.com/…/trend/52-lq-built-in.jl

We import GLM. Julia's own statistics power tool.

using CSV, DataFrames, Printf, Statistics, GLM

Julia: Statistic Properties: GLM Library

Like in R, we fit a linear model with lm(). The syntax feels like R, with a fresh paint job and no memory leaks.

model = lm(@formula(y ~ x), pairCSV)
coefs = coef(model)
m_slope = coefs[2]
b_intercept = coefs[1]

@printf("m (slope)      = %9.2f\n", m_slope)
@printf("b (intercept)  = %9.2f\n\n", b_intercept)
@printf("Equation     y = %.2f + %.2f.x\n\n", b_intercept, m_slope)

Julia: Statistic Properties: GLM Library

We still do our ritual variance-covariance dance. This time, using built-in methods that actually want to be used: variance, covariance, standard deviations, and even the pearson correlation coefficient (r). From this we can manually calculate R-squared (R²).

x_variance = var(x_observed)
y_variance = var(y_observed)
xy_covariance = cov(x_observed, y_observed)

x_std_dev = std(x_observed)
y_std_dev = std(y_observed)

r = cor(x_observed, y_observed)
r_squared = r^2

@printf("sₓ² (variance) = %9.2f\n", x_variance)
@printf("sy² (variance) = %9.2f\n", y_variance)
@printf("covariance     = %9.2f\n", xy_covariance)
@printf("sₓ (std dev)   = %9.2f\n", x_std_dev)
@printf("sy (std dev)   = %9.2f\n", y_std_dev)
@printf("r (pearson)    = %9.2f\n", r)
@printf("R²             = %9.2f\n\n", r_squared)

These are sanity checks. Even if the model does the heavy lifting, we still want to peek inside , nd make sure it’s not drunk on assumptions.

Julia: Statistic Properties: GLM Library

From previous this linear model, we can generate predicted values along with residuals. Then we can continue doing our manual calculation. Just ask politely and GLM serves it on a silver platter.

y_fit = predict(model)
y_err = residuals(model)

df = n - 2
ss_residuals = sum(y_err .^ 2)
var_residuals = ss_residuals / df

x_deviation = x_observed .- x_mean
x_sq_deviation = sum(x_deviation .^ 2)
std_err_slope = sqrt(var_residuals / x_sq_deviation)
t_value = m_slope / std_err_slope

@printf("SSR = ∑ϵ²           = %9.2f\n", ss_residuals)
@printf("MSE = ∑ϵ²/(n-2)     = %9.2f\n", var_residuals)
@printf("∑(xᵢ-x̄)²            = %9.2f\n", x_sq_deviation)
@printf("SE(β₁)  = √(MSE/sₓ) = %9.2f\n", std_err_slope)
@printf("t-value = β̅₁/SE(β₁) = %9.2f\n\n", t_value)

Julia: Statistic Properties: GLM Library

Execution result:

❯ julia 52-lq-built-in.jl
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
∑(xᵢ-x̄)²            =    182.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

Julia: Statistic Properties: GLM Library

📓 Interactive Jupyter version here:

You can obtain the interactive JupyterLab in this following link:

github.com/…/trend/52-lq-built-in.ipynb

Final Thoughts

Whether we go monk-mode and write everything by hand, or let GLM do the heavy lifting, Julia gives us tools that are not just elegant, but trustworthy.

As long as we understand what’s going on under the hood, we can keep our statistical integrity—, nd maybe even crack a smile while doing it.

Unicode Symbols as Variables

Making Greek Great Again

Let’s take a detour from dry regressions, and dive into something a bit more fun. Unicode madness.

Julia, unlike our old high-school calculators, is happy to speak Greek. This section explores just how readable, and elegant code can be, when our variables wear togas and quote Pythagoras.

In short: our math professor’s chalkboard now runs Julia.

Why should our variables be limited to boring x and y, when we could use the entire Greek fraternity? Let’s treat our dataset like a philosophy class, with xᵢ, ȳ, and ∑ϵ² all mingling like Socratic thinkers.

Defining UTF-8 Variables

Yes, Julia lets us use UTF-8 characters for variable names, no arcane flags, no weird syntax. Just copy-paste the Greek. Let’s have this experiment below:

github.com/…/trend/53-lq-utf.jl

Let’s start by loading data and switching to Greek mode. First we are going to use xᵢ and yᵢ.

using CSV, DataFrames, Printf, Statistics, GLM

df = CSV.read("50-samples.csv", DataFrame)
xᵢ = df.x
yᵢ = df.y

Code readability skyrockets when math notation in our code matches the textbook. It’s like translating math into executable form. No decoder ring required.