Preface
Goal: Curve fitting examples for daily basis, using least squares method.
I’m not a statistician. That’s why I have to be careful when I talk about statistic. First of all, I need to talk about the math behind, then I will continue to spreadsheet, using Calc which is compatible with excel, then I will close with python script.
From this we are going to continue to, regression analysis and correlation.
1: Statistic Notation
Population and Samples
Statistical Properties
Statistic differ between population and samples.
For least squares, the calculation is the same,
even though the notation is different.
For example the Mean
notation.
For correlation and regression, the equation has different degree of freedom. This means, the equation is different. We will talk about this later.
1: The Math Behind
Which Least Square
There are a few approach to solve least square.
- Equation from direct derivation
- Equation using Mean
- Using Matrix
All three have the same result, and with spreadsheet we can calculate with any method easily. You can choose what equation suitable for your case.
Worksheet Source
Playsheet Artefact
The Excel file is available, so you can have fun, modify as you want.
Slope and Intercept
Consider a series of data points in pair (x, y), we want to fit a linear equation for this data series.
With the same x, from observed value (x,y), we can get the prediction value of y. Now we have two kind of y, the observed value as the calculation source, and the predicted value as the result of the curve fitting equation.
The fit equation for the predicted y value (Å·áµ¢):
Residual
The formula for the slope (m) and intercept (b) of the least squares regression line can be derived from the principle of minimizing the sum of squared errors (residuals) between the observed values of y and the values predicted by the regression line.
The residual for each observation is the difference between the observed value of y and the predicted value of y:
The sum of squared residuals (SSE) is defined as:
Our goal is to minimize SSE by choosing appropriate values of m and b. To find the values of m and b that minimize SSE, we take the derivatives of SSE with respect to m and b and set them both equal to zero:
Direct Equation
Solving these equations simultaneously will give us the values of m and b that minimize SSE, which are the least squares estimates of the slope and intercept.
After some algebraic manipulation, dark magic and friendly spirit, the formulas for m and b can be derived as:
Equation with Mean
We can rewrite further the formula, in terms of sample means:
Matrix of System
The least squares solution for linear regression can be derived using matrix notation.
In matrix notation, the equations are represented as:
To solve for m and b, you can use matrix inversion. The solution is given by:
Those three are the same least square equations. Those would produce the same result. There are other approach as well in the realm of least square, but let’s skip them this time.
2: Direct Solution
Let’s apply our equation to find the solution:
Known Values
Supposed you have series of points:
(x, y) = [(0, 5), (1, 12), (2, 25), (3, 44), (4, 69), (5, 100), (6, 137), (7, 180), (8, 229), (9, 284), (10, 345), (11, 412), (12, 485)]
We can calculate the statistic properties:
Note that we can also write each count
and sum
as:
In spreadsheet you can use count
and sum
as shown below:
Tabular Calculation
Let’s have a look again at our direct equation
We can calculate further:
Now we can calculate, the m (slope):
and also the b (intercept):
The equation result is:
With spreadsheet you can utilize tabular calculation, and having the same result:
Now you can see how easy it is to write down the equation in tabular spreadsheet form.
3: Solution Using Mean
This is easy when you want to combine with tabular calculation of regression and correlation.
Tabular Calculation
To calculate the least squares (y=mx+b) for a data series, we need to
- Calculate the mean of x and y
- Calculate the slope (m)
- Calculate the y-intercept (b)
The final equation for the least squares regression line is:
Now let’s calculate the statistic properties for given data point above:
And also:
Now we can calculate, the m (slope):
and the b (intercept):
With spreadsheet you can utilize tabular calculation, and having the same result:
4: Solution Using Matrix
This is suitable as a basis, when you need to extend to non linier curve fitting.
Tabular Calculation
Let’s have a look again at our matrix equation.
Now we can calculate, the m (slope) and b(intercept), using matrix inversion.
With spreadsheet you can utilize matrix multiplication and inverse, and having the same result:
5: Built-in Formula
With spreadsheet such as Excel, or Calc, this can be easier.
Let’s get the formula for sum
for both x
and y
:
Properties | x Formula | y Formula |
---|---|---|
Total | =SUM(B7:B19) | =SUM(C7:C19) |
Count | =COUNT(B7:B19) | =COUNT(C7:C19) |
Mean | =AVERAGE(B7:B19) | =AVERAGE(C7:C19) |
You can also get the mean value without using AVERAGE formula. This might looks silly, but this is required to build model, before coding in scripting language.
Properties | Formula |
---|---|
Mean x | =SUM(B7:B19)/COUNT(B7:B19) |
Mean y | =SUM(C7:C19)/COUNT(C7:C19) |
With this data above we can get more properties.
And don’t forget to press ctrl+shift+enter
for array operation.
Properties | Formula |
---|---|
slope (m) | {=SLOPE(C7:C19;B7:B19)} |
intercept (b) | {=INTERCEPT(C7:C19;B7:B19)} |
You can also calculate manually.
Properties | Formula |
---|---|
slope (m) | =SUM((B7:B19 - G10)*(C7:C19 - H10))/SUM((B7:B19 - G10)^2) |
intercept (b) | =H9-G14*G9 |
This is all for now. We are going to dive to more complex formula, with regression and correlation.
6: Python
Python offer different way to solve curve fitting.
- numpy.polyfit
- scipy.stats.linregress
- statsmodels.api.OLS
Let’s say we have this data series:
# Given data
x_values = np.array(
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y_values = np.array(
[5, 12, 25, 44, 69, 100, 137,
180, 229, 284, 345, 412, 485])
Polyfit
Using numpy
library.
import numpy as np
# Perform least squares regression
# for a linear model (polynomial degree 1)
order = 1
mC = np.polyfit(x_values, y_values, deg=order)
m = mC[0]
b = mC[1]
print('Using polyfit')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
Linregress
Using scipy
library.
from scipy import stats
# SciPy: scientific computing built on top of NumPy.
m, b, _, _, _ = stats.linregress(x_values, y_values)
print('Using linregress')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
OLS
Using statsmodels
library.
import statsmodels.api as sm
# Statsmodels: statistical modeling and hypothesis testing.
x = sm.add_constant(x_values)
model = sm.OLS(y_values, x).fit()
b = model.params[0]
m = model.params[1]
print('Using Statsmodels')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
We can have the same output as below:
Using polyfit
Coefficients (b = -61.00, m = 40.00)
Using linregress
Coefficients (b = -61.00, m = 40.00)
Using Statsmodels
Coefficients (b = -61.00, m = 40.00)
There is no need for fancy chart plot this time.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Manual Calculation
Of course you can calculate manually to learn how it works:
First we calculate basic statistic properties.
# Number of data points
n = len(x_values)
# Calculate sums
sum_x = np.sum(x_values)
sum_y = np.sum(y_values)
# Calculate means
mean_x = np.mean(x_values)
mean_y = np.mean(y_values)
# Output of basic properties
print(f'n = {n:5d}')
print(f'∑x (total) = {sum_x:7.2f}')
print(f'∑y (total) = {sum_y:7.2f}')
print(f'x̄ (mean) = {mean_x:7.2f}')
print(f'ȳ (mean) = {mean_y:7.2f}')
print()
Then go further:
# Calculate deviations
deviation_x = x_values - mean_x
deviation_y = y_values - mean_y
# Calculate squared deviations
sq_deviation_x = np.sum(deviation_x ** 2)
# Calculate cross-deviation
cross_deviation_xy = np.sum(deviation_x * deviation_y)
# Calculate slope (m) and intercept (b)
slope_m = cross_deviation_xy / sq_deviation_x
intercept_b = mean_y - slope_m * mean_x
print('Manual Calculation')
print(f'Coefficients (b = {intercept_b:2,.2f},'
+ f' m = {slope_m:2,.2f})\n')
With the result as below:
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
Manual Calculation
Coefficients (b = -61.00, m = 40.00)
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
What’s Our Next Move 🤔?
We can describe mathematical equation, in practical way. rewrite it in spreadsheet in tabular manner, and implement it in python script. It is fun, right? What do you think?
From basic statistical properties, we can continue to more statistical properties, for use with regression and correlation.
Consider continuing to read [ Trend - Properties - Cheatsheet ].