Trend - Properties

Trend: Prediction

Trend: Regression

Trend: Enhanced

Trend: Language

Local Group

Preface

Goal: Getting to know statistical properties, using worksheet formula step by step.

When we look at regression equations in textbooks, it can feel like walking into a math dungeon without a torch. But don’t worry, we’re bringing the flashlight and a map. In this article, we’ll take the scary symbols and turn them, into something we can click and calculate in Excel. Or maybe someday Google Sheets if I, the author, have time to learn.

We’ll start with manual calculations, because understanding why something works, makes us smarter than just knowing that it works. Then we’ll show how to use built-in worksheet functions which feel like finding out, that we didn’t need to count rice grains by hand after all.

Anyone can click buttons. But knowing the guts of the formula helps us catch errors, explain our results, and feel smarter at dinner parties (or Zoom meetings).

Complete Worksheet

Yes, it looks long. Yes, it has lots of rows. But once we walk through it section by section, we will see that it is well organized. And don’t let the visual complexity fool you, this monster includes regression and correlation analysis. Two for the price of one!

Trend: Worksheet: Least Square: Complete Worksheet

This article starting with manual calculation using Excel cells. After this section, we will reveal the built-in formulas that do all this faster.

Worksheet Source

Playsheet Artefact

You don’t have to build this from scratch (unless you’re into that). Grab the actual Excel file and tweak, break, and experiment to your heart’s content.

github.com/…/math/trend/examples-properties.xlsx

Known Values from Samples

Example Time! Let’s make those numbers sweat.

Imagine we’re handed a list of data points. Just plain (x, y) pairs. We don’t know where they came from, but they’re ours now, and they need analysis.

Suppose our x_observed-series lives in B7:B19, and our y_observed-series in C7:C19. Here’s what we’re working with:

(x, y) = [(0, 5), (1, 12), (2, 25), (3, 44), (4, 69), (5, 100), (6, 137), (7, 180), (8, 229), (9, 284), (10, 345), (11, 412), (12, 485)]

Trend: Worksheet: Known Values from Samples

We start with the basics: the mean. It is crucial for measuring how much each value wanders off.

$\begin{align*} \bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n} x_i && \bar{y} = \frac{1}{n}\sum\limits_{i=1}^{n} y_i \\ \end{align*}$

You can calculate these with Excel’s humble built-ins: count, sum, and average.

B22=SUM(B7:B19)
C22=SUM(C7:C19)

B26=COUNT(B7:B19)
C26=COUNT(C7:C19)

B30=AVERAGE(B7:B19)
C30=AVERAGE(C7:C19)

Or go DIY and divide the total using the sum and the count.

B30=B22/B26
C30=C22/C26

The result of the statistic properties, total and mean are:

$\begin{align*} n = 13 && \sum x = 78 && \sum y = 2327 && \bar{x} = 6 && \bar{y} = 179 && \end{align*}$

These means will serve as the anchors for every calculation that follows. Like a good coffee shop, everything starts here.

Plot Result

📈 Time to chart it!

Plotting these values helps us see the story our data wants to tell, before we bury it in math. Let’s see how a visualization interpret the mean properties.

Trend: Plot: Variance

Basic Statistic Properties

Now we can subtract, square, and stare at variances like true statisticians.

Let’s calculate how far each x_observed and y_observed, deviate from their respective means. This gives us the foundational ingredients for: variance, covariance, and other buzzwords in further calculation.

Trend: Worksheet: Basic Statistic Properties

For the series range E7:E19 and F7:F19 the formula are:

E7=B7-$B$30
F7=C7-$C$30
...
E19=B19-$B$30
F19=C19-$C$30

Deviations measure “how far off”, each point is from average behavior. We also need to know how many values we have, and adjust for sample size (hello, degrees of freedom).

For samples, the unbiased estimate is (n-1). I’m just following the convention to use (n-1). The math behind is a hard topic, and beyond my knowledge, so I refused to explain.

$n - 1 = 12$

We’re not quite using df (degrees of freedom) yet, but here’s the teaser:

$df = n - k - 1 = 11$

Translated to Excel/Calc, we can write this as:

E26=COUNT(B7:B19)-1
F26=COUNT(C7:C19)-1

E30=COUNT(B7:B19)-2

Least Square Calculation

Let’s regress it

We now compute the least squares regression, which is just a fancy way of drawing the best straight line, through our cloud of data points.

We have discuss in depth about the math behind in previous article. The slope m (from the “famous” line equation y = mx + b) is defined as:

$m = \frac{\sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum\limits_{i=1}^{n} (x_i - \bar{x})^2}$

What we need is to find each total these properties:

(xᵢ-x̄)²: how scattered x-values are
(xᵢ-x̄)(yᵢ-ȳ): how x and y vary together

Trend: Worksheet: Least Square Calculation

We need to find the nominator and the denominator.

$\begin{align*} \sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) && \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 \\ \end{align*}$

For the series range H7:H19 (xᵢ-x̄)² and I7:I19 (xᵢ-x̄)(yᵢ-ȳ) the formula are:

H7=E7^2
I7=E7*F7
...
H19=E19^2
I19=E19*F19

The total ∑(xᵢ-x̄)² and ∑(xᵢ-x̄)(yᵢ-ȳ) can be defined as:

H22=SUM(H7:H19)
I22=SUM(I7:I19)

The result of the each total are:

$\begin{align*} \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 &= 182 \\ \sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) &= 7280 \\ \end{align*}$

Now we got the m (slope) and b (intercept) as:

H26=I22/H22
I26=C30-H26*B30

Which gives us:

$\begin{align*} m (slope) &= 40 \\ b (intercept) &= 61 \\ \end{align*}$

Finally we can get the full regression equation (y = b + mx) as:

H30=CONCATENATE("y = ";TEXT(I26;"#.##0,00");" + ";TEXT(H26;"#.##0,00");".x")

Let’s write it in human form:

$y = -61 + 40x$

This formula gives us a predictive model. Turn any x into an estimated y. It’s like statistical fortune telling, except with math instead of incense.

Again, we can see how easy it is to write down the equation in tabular spreadsheet form.

Plot Result

Time to visualize your best-fit line. Now we will see how well our equation hugs the data.

Trend: Plot: Linear Regression

Variance and Covariance

Let’s now wander into the realm of statistical relationships, where variables whisper secrets to each other via variance and covariance. These aren’t just fancy words to impress our dinner guests, they quantify how our data behaves around its mean, and how two variables dance together.

The equations for sample variance and covariance, for a sample of data series (x, y) is given as follows:

$\begin{align*} s_x^2 &= \frac{1}{n-1}\sum\limits_{i=1}^{n} (x_i - \bar{x})^2 \\ s_y^2 &= \frac{1}{n-1}\sum\limits_{i=1}^{n} (y_i - \bar{y})^2 \\ \text{Cov}(x,y) &= \frac{1}{n-1}\sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \\ \end{align*}$

We’re calculating these to understand:

(xᵢ-x̄)² and (yᵢ-ȳ)²: How much variation each variable carries
(xᵢ-x̄)(yᵢ-ȳ): Whether the variables move together or not

For the series range K7:K19 (xᵢ-x̄)², L7:L19 (yᵢ-ȳ)² and M7:M19 (xᵢ-x̄)(yᵢ-ȳ) the formula are:

K7=E7^2
L7=F7^2
M7=E7*F7
...
K19=E19^2
L19=F19^2
M19=E19*F19

This prepares our data for calculating how “spread out” the data is, and whether x and y have any statistical chemistry.

Now for the totals:

$\begin{align*} \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 && \sum\limits_{i=1}^{n} (y_i - \bar{y})^2 && \sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \\ \end{align*}$

The total variation of the predictor, ∑(xᵢ-x̄)², (yᵢ-ȳ)² and ∑(xᵢ-x̄)(yᵢ-ȳ) can be defined as:

K22=SUM(K7:K19)
L22=SUM(L7:L19)
M22=SUM(M7:M19)

The result of the each total are:

$\begin{align*} \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 &= 182 \\ \sum\limits_{i=1}^{n} (y_i - \bar{y})^2 &= 309218 \\ \sum\limits_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) &= 7280 \\ \end{align*}$

To normalize these (and avoid bias ) we divide by degrees of freedom (n - 1). Now we got the variance sₓ², sy² and also the covariance as:

K22=K22/E26
L22=L22/F26
M22=M22/E26

The result of the statistic properties are:

$\begin{align*} s_X^2 &= 15.17 \\ s_Y^2 &= 25768.17 \\ \text{Cov}(X,Y) &= 606.67 \\ \end{align*}$

Variance tells us how unpredictable the data is, and covariance gives us a sneak peek into, whether the two variables move together (positive), apart (negative), or ignore each other entirely (zero, the statistical equivalent of ghosting).

The tabular spreadsheet can be shown as follows:

Trend: Worksheet: Correlation Calculation

Oh, and standard deviation is just the square root of variance.

Plot Result

We need a visualization so we can interpret the standard deviation properties against original observed y.

Trend: Plot: Variance

Correlation Calculation

Time to put the cool shades on our statistics. Let’s measure correlation, the standardized version of covariance, that fits nicely between -1 and 1.

From calculated properties above we can continue to standard deviation sₓ, sy, along with the r (pearson) value:

K30=SQRT(K26)
L30=SQRT(L26)
M30=M26/(K30*L30)

This yields:

$\begin{align*} s_x &= \sqrt{s_X^2} &= && 3.89 \\ s_y &= \sqrt{s_Y^2} &= && 160.52 \\ r_{x,y} &= \frac{\text{Cov}(x,y)}{s_x . s_y} &= && 0,97 \\ \end{align*}$

A correlation coefficient of 0.97? That’s about as close to a linear relationship as you can get, without your data actually proposing marriage.

Note that the pearson coefficient can be denoted in different form. Alternative formula for r (same dance, different choreography):

r = cov/sₓsy
r = (∑(xᵢ-x̄).(yᵢ-ȳ))/√(∑(xᵢ-x̄)².∑(yᵢ-ȳ)²)

This way we can calculate the R square (R²) and also the adjusted R square. Note that R² for simple linear sample is the same as r². In this case R² = 0,9417. But the R² might be different for complex cases.

M33=M30^2

Also there is no need any adjusted R², for simple least square population. But for sample we need to adjust with k=1, for linear equation then adjusted R² = 0,9364.

M35=1-(1-M33)*(B26-1)/(B26-1-1)

We can have a look at the result, focusing on this part of following spreadsheet:

Trend: Worksheet: Residual Calculation

R² = 0.94 → “94% of the variance in y is explained by x.” That’s nearly everything, the other 6% probably noise, or maybe just forgot to update their spreadsheets.

Plot Result

We need a visualization so we can interpret the standard deviation properties against predicted y.

Trend: Plot: Variance

Residual Calculation

Finally, we measure how far off our predictions are. Every prediction has regrets, and in statistics, they’re called residuals.

We plug in our linear model from the coefficient of the linear least square,

m (slope)
b (intercept)

We can built our curve fitting equation (y = b + mx), and predict every yᵢ value.

$\begin{align*} \hat{y}_i &= mx_i + b \\ e_i &= y_i - \hat{y}_i \\ \sum\limits_{i=1}^{n} (e_i)^2 &= \sum\limits_{i=1}^{n} (y_i - \hat{y}_i)^2 \\ \end{align*}$

Use the spreadsheet to compute. For the series range O7:O19 fit(xᵢ), P7:P19 ϵᵢ (residual), and Q7:Q19 ϵᵢ² the formula are:

O7=$I$26+$H$26*B7
P7=C7-O7
M7=P7^2
...
O19=$I$26+$H$26*B19
P19=C19-O19
M19=P19^2

Note that we can use any curve fitting formula, such as quadratic, cubic and so on, but for this example we will use our linear one.

Now total residual error. The total of ϵᵢ² can be defined as:

Q22=SUM(Q7:Q19)

The result of the each total are:

$\sum\limits_{i=1}^{n} (e_i)^2 = 18018$

The tabular spreadsheet can be shown as follows:

Trend: Worksheet: Residual Calculation

Residuals show us where our model is off. If the residuals are too wild, the model might just be blowing statistical smoke.

Plot Result

We need a plot to show, how far off the actual y_predicted is, from our model’s predictions.

Trend: Plot: Variance

Plotting residuals is like proofreading our math. Weou don’t always like what we see, but it makes everything better.

Regressions Sum Square

Welcome to the world of SSR, Regression Sum of Squares, or as we statisticians like to call it: “How much our predictions flex on the average.” Here is the RSS:

$\sum\limits_{i=1}^{n} (\hat{y}_i - \bar{y})^2$

This represents the variation explained by our model. The bigger it is (relative to total variance), the more our model is saying, “Hey, I actually know what’s going on!”

Trend: Worksheet: Regressions Sum Square

Naturally, this brings us to SST. Sum of Squares Total, a.k.a. the full buffet of variance:

The SST (Sum Square Total) is

$SST = SSR + RSS$

Mathematically, for each observed yᵢ, we can say:

$\sum\limits_{i=1}^{n} (y_i - \bar{y})^2$

Yes, it’s a love triangle. The total variation (SST) is shared between,

what the model explains (RSS, Regressions Sum Square) and,
what it misses (SSR, Sum of Squares of Residual).

Why does SST = SSR + RSS?

It is easier to understand using the spreadsheet.

This equation is rooted in partitioning variance. Essentially slicing up the total variation in our data into:

RSS = Regression Sum of Squares (explained): The variation due to the model. It’s how much our predicted values ŷᵢ deviate from the mean ȳᵢ. If this is large, our model is picking up real structure in the data.
SSR = Sum of Squares of Residual (unexplained): What’s left over, the part of the variance our model fails to explain, .e., the noisy, chaotic behavior of real-world data. It is how far off our predictions are from the actual yᵢ values. Ideally, this is small.
SST = Total Sum of Squares: The total variation in the dependent variable yᵢ. essentially: “How spread out is y overall, around its mean?”

Our model projects the data onto a line, and splits perfectly between “explained” and “unexplained.”

Back to R²

This is the statistical backbone of R². Want to know how good our regression line is? Start here. It’s the source of model validation.

R² (coefficient of determination) is defined as:

$R^2 = \frac{SSR}{SST} = 1 - \frac{RSS}{SST}$

It answers, “How much of the variation in y is captured by x?”

If R² = 1, our model explains everything. Congratulations, we’ve either modeled reality perfectly, or we’ve committed statistical heresy (i.e., overfitting).
If R² = 0, our model explains nothing, It’s as useful as a horoscope.

t-value and p-value

Truth or Dare?

Now for the dramatic part, the hypothesis testing, using t-value and p-value. We want to know: “Is our slope real or just a fluke?”

First, calculate the degrees of freedom, For linear equation k=1.

$df = n - k - 1 = 11$

In Excel/Calc we can write this as:

E30=COUNT(B7:B19)-2

Degrees of freedom are like free will in data. The more we have, the more trust we can put in our analysis.

Next, enter the Mean Squared Error. It tells us how wrong the model is, on average, for each point:

$MSE = \frac{SSE}{df}$

In Excel/Calc we can write this as:

=Q22/$E$30

Then we compute the Standard Error of the Slope SE(β1) using MSE.

$SE(\beta_1) = \sqrt{\frac{MSE}{\sum_{i=1}^{n}(x_i - \bar{x})^2}}$

In Excel/Calc we can write this as:

=SQRT(Q25/K22)

Next, the t-value: This is where we look slope in the eye, and say: “Prove you matter.”

The t-value for the slope coefficient (β₁) is computed, as the ratio of the estimated slope (βˉ₁), to the standard error of the slope (SE(β₁)). This t-value will be used to test the null hypothesis, that the true slope coefficient is equal to zero.

$t = \frac{\bar{\beta}_1}{SE(\beta_1)}$

In Excel/Calc we can write this as:

=H26/Q30

And finally, the p-value, the real drama queen. It’s the probability that our slope is due to random chance. The smaller the p-vlue, the more confident we are, that our model actually means something.

Here is the p-value for two tail test.

$p = 2 \times P(T > |t|)$

This is to complex for manual calculation. so I use built-in formula instead.

In spreadsheet speak:

=T.DIST.2T(Q35;$E$30)

The p-value is our model’s confession. It either says: “"Yep, I’m legit,”, or “Oops, maybe I was just noise.”

Here are the results of this statistical soap opera:

$\begin{align*} df &= 11 \\ MSE &= 1638 \\ SE(\beta_1) &= 3.00 \\ t-value &= 13.33 \\ p-value &= 0,0000000391 \\ \end{align*}$

We have an incredibly significant result. Oour slope isn’t just a good guess, it’s a statistically valid superstar.

Check it out here:

Trend: Worksheet: p value

We have a good confidence in our curve fitting function.

Alternative Worksheet

If you'‘e not into one worksheet layout, don’t worry. Spreadsheet is like statistics: it tolerates many truths.

Here’s an alternative way of laying things out:

Trend: Worksheet: Correlation: Tabular Worksheet

Just remember, the numbers must dance the same. No matter how we rearrange the ballroom.

Built In Formula

The Lazy Genius of Regression

Who needs to crunch numbers by hand, when spreadsheets and Python will gladly do our bidding? Let’s explore how built-in formulas make regression analysis, feel less like statistical calculus, and more like statistical calculus-on-autopilot.

Worksheet Source

Your DIY Regression Playground

Welcome to spreadsheet heaven. You’ll find all the formulas preloaded in this .ods file.

github.com/…/math/trend/examples-properties.ods

Even though everything here can work in Excel, we’re using LibreOffice. Why? Because .ods lets us play with named ranges more flexibly, and frankly, .xlsx sometimes throws a tantrum.

Still prefer Excel? We got you:

github.com/…/math/trend/examples-properties.xlsx

Meet the Data

Let’s revisit our familiar cast of statistical characters:

Trend: Worksheet: Built-in Formula: Data Series

xᵢ observed: B7:B19,
yᵢ observed: C7:C19,
ŷᵢ predicted: D7:D19 calculated from m and b.

Named Range

B7:B19 is Not a Personality

To make the equation less cryptic, we can define named range. This can be done in both Excel and Calc:

Trend: Worksheet: Built-in Formula: Named Range

Let’s give our ranges real names. “x_observed” just sounds more mature than “B7:B19”.

B7:B19: x_observed.
C7:C19: y_observed.
D7:D19: y_predicted.

Think of it as giving our variables, name tags at a regression conference.

Using Array Operation

We can rebuild tabular woksheet from previous article, into a simple one using Excel/Calc formula with the help of SUMSQ formula.

With just using array formulas and built-in functions, we can have direct result into a cell. Because if Excel has a button for it, why type three columns by hand?

Trend: Worksheet: Built-in Formula: Array

Here the predicted value of ŷᵢ in D7:D19, are calculated from m ($G$17) and b ($G$18).

The complete formula can be shown here:

properties	formula
∑x	=SUM(x_observed)
∑y	=SUM(y_observed)
n	=COUNT(x_observed)
n-1	=G9-1
df = n-k-1	=G9-2
x̄ (mean)	=AVERAGE(x_observed)
ȳ (mean)	=AVERAGE(y_observed)
∑(xᵢ-x̄)²	{=SUMSQ(x_observed-$G$12)}
∑(yᵢ-ȳ)²	{=SUMSQ(y_observed-$G$13)}
∑(xᵢ-x̄)(yᵢ-ȳ)	=SUMPRODUCT((x_observed-$G$12),(y_observed-$G$13))
β̅₁ = m (slope)	=G16/G14
β̅₀ = b (intercept)	=G13-G17*G12
sₓ²=∑(xᵢ-x̄)²/(n-1)	=G14/G10
sy²=∑(yᵢ-ȳ)²/(n-1)	=G15/G10
cov=∑(xᵢ-x̄)(yᵢ-ȳ)/(n-1)	=G16/G10
sₓ (std dev)	=SQRT(G19)
sy (std dev)	=SQRT(G20)
r (pearson)=cov/sₓsy	=G21/(G22*G23)
R² for linear is r²	=G24^2
adjusted R²	=1-(1-G25)*G10/G11
SSE = ∑ϵᵢ²	{=SUMSQ(y_observed-D7:D19)}
MSE = SSE/df	=G27/G11
SE(β₁) = std err slope	=SQRT(G28/G14)
t-value = β̅₁/SE(β₁)	=G17/G29
p-value	=T.DIST.2T(G30;G11)

We’re covering all our greatest statistical hits: slope, variance, correlation, R², t-tests, and the always-glamorous p-value.

This condenses a ton of regression math, into digestible spreadsheet logic. Less time formatting cells, more time eating snacks and doing science.

Using Statistic Formula

The Built-in Shortcut

Furthermore we can use statistic formula.

Want to double-check your math? Or just feel like living dangerously by trusting Excel’s statistical engine? Let’s name new ranges for a fresh worksheet:

B7:B19: x_sample.
C7:C19: y_sample.
D7:D19: y_fit.

Then we let built-in stats formulas take over.

Trend: Worksheet: Built-in Formula: Statistic

properties	formula
∑x	=SUM(x_sample)
∑y	=SUM(y_sample)
n	=COUNT(x_sample)
n-1	=G9-1
df = n-k-1	=G9-2
x̄ (mean)	=AVERAGE(x_sample)
ȳ (mean)	=AVERAGE(y_sample)
∑(xᵢ-x̄)²	{=SUMSQ(x_sample-$G$12)}
∑(yᵢ-ȳ)²	{=SUMSQ(y_sample-$G$13)}
∑(xᵢ-x̄)(yᵢ-ȳ)	=SUMPRODUCT((x_sample-$G$12),(y_sample-$G$13))
β̅₁ = m (slope)	=SLOPE(y_sample,x_sample)
β̅₀ = b (intercept)	=INTERCEPT(y_sample,x_sample)
sₓ² (variance)	=VAR.S(x_sample)
sy² (variance)	=VAR.S(y_sample)
covariance	=COVARIANCE.S(x_sample,y_sample)
sₓ (std dev)	=STDEV.S(x_sample)
sy (std dev)	=STDEV.S(y_sample)
r (pearson)	=PEARSON(x_sample,y_sample)
R² coeff	=RSQ(y_sample,x_sample)
adjusted R²	=1-(1-G25)*G10/G11
SSE = ∑ϵᵢ²	{=SUMSQ(y_sample-y_fit)}
MSE = SSE/df	=G27/G11
SE(β₁) = std err slope	=SQRT(G28/G14)
t-value = β̅₁/SE(β₁)	=G17/G29
p-value	=T.DIST.2T(G30;G11)

The parameter required for each statistic formula is clear now. These functions reduce error and save time. Perfect when we’ve had our third coffee, and still can’t remember what ∑(xᵢ - x̄)² stands for.

Pearson Correlation Coeffecient

Measuring Statistical Chemistry

Need to measure the linear love between x and y? Use Pearson’s r correlation coefficient:

=PEARSON(B7:B19;C7:C19)

The formula for pearson is:

$r_{x,y} \frac{\text{Cov}(x,y)}{s_x . s_y} \\$

Which we can write manually in Excel/Calc as:

=COVARIANCE.P(B7:B19, C7:C19) / (STDEV.P(B7:B19) * STDEV.P(C7:C19))

And if you’re feeling masochistic, the Pearson correlation coefficient can also be expressed as follows:

$r = \frac{\sum (x_i - \bar{x}) \cdot (y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}$

Sometimes you want to check if your equation right. Again, the Excel/Calc formula can be rewritten as follows, and having the same result.

= (SUM((B7:B19 - AVERAGE(B7:B19)) * (C7:C19 - AVERAGE(C7:C19)))) 
/ (SQRT(SUM((B7:B19 - AVERAGE(B7:B19))^2) * SUM((C7:C19 - AVERAGE(C7:C19))^2)))

Now you know, how helpful the built-in formula is, compared to this complex formula above.

R Square

The MVP of Model Metrics

Want to know how much of the variation in y is explained by our model? Excel/Calc has a built-in formula for R Square (R²):

=RSQ(C7:C19;B7:B19)

The R Square (R²) for simple linear regression can be expressed as follows:

$\begin{align*} r_{x,y} &= \frac{\text{Cov}(x,y)}{s_x . s_y} \\ R^2 &= r_{x,y}^2 \end{align*}$

Here’s the long-form version. The general R Square (R²) is defined as:

$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$

Or furthermore, we can go low-level.

$R^2 = 1 - \frac{\sum (y_i - (\hat{\beta}_1 x_i + \hat{\beta}_0))^2}{\sum (y_i - \bar{y})^2}$

Down to the nuts and bolts, the Excel/Calc formula can be rewritten as:

=1-SUM((C7:C19 - (B7:B19*G22+G24))^2)/SUM((C7:C19 - H9)^2)

The way you manage the calculation is depend on your situation. Most of the time we can use simple rsq formula, but sometimes we need to go low level, for use with programming or just intellectual curiosity.

Let’s see how RSQ behave in non-linear curve fitting discussed later on.

Compare and Contrast: Manual vs Built-in

Feel free to compare both methods, side-by-side. Between manual formula and built-in formula as shown below:

Trend: Worksheet: Correlation: Built-in Worksheet

Like a true stats nerd, always double-check our work. But also, enjoy the shortcuts modern tools give us. Regression shouldn’t feel like a root canal.

Now we are ready to make statistical model, in our worksheet to suit our specific needs.

How you make the model is up to you.

What Lies Ahead 🤔?

That method above? Elegant, yes. But let’s not kid ourselves. Naturally, there’s a simpler way.

If we want a spreadsheet model that doesn’t require a PhD. Or three cups of coffee and a blank stare, we’ll need a method more fit for everyday use. Something robust, readable, and less likely to induce formula fatigue.

The easier it is to calculate, the more likely we’ll actually use it. That’s why in the next part of our adventure, we shift gears from cells and formulas, to the comforting embrace of Python. There, we’ll build tools that are both reusable, and less fiddly than spreadsheet gymnastics.

Grab our virtual lab coat and join us in: [ Trend - Properties - Python Tools ].

Trend - Properties - Formula

Preface

Complete Worksheet

Worksheet Source

Known Values from Samples

Plot Result

Basic Statistic Properties

Least Square Calculation

Plot Result

Variance and Covariance

Plot Result

Correlation Calculation

Plot Result

Residual Calculation

Plot Result

Regressions Sum Square

Why does SST = SSR + RSS?

Back to R²

t-value and p-value

Alternative Worksheet

Built In Formula

Worksheet Source

Meet the Data

Named Range

Using Array Operation

Using Statistic Formula

Pearson Correlation Coeffecient

R Square

Compare and Contrast: Manual vs Built-in

What Lies Ahead 🤔?