Where to Discuss?

Local Group

Preface

Goal: Curve fitting examples for daily basis, using least squares method.

I’m not a statistician. That’s why I have to be careful when I talk about statistic. First of all, I need to talk about the math behind, then I will continue to spreadsheet, using Calc which is compatible with excel, then I will close with python script.

From this we are going to continue to, regression analysis and correlation.


1: Statistic Notation

Population and Samples

Statistical Properties

Statistic differ between population and samples. For least squares, the calculation is the same, even though the notation is different. For example the Mean notation.

For correlation and regression, the equation has different degree of freedom. This means, the equation is different. We will talk about this later.

1: The Math Behind

Which Least Square

There are a few approach to solve least square.

  1. Equation from direct derivation
  2. Equation using Mean
  3. Using Matrix

Spreadsheet: Workbook with Multiple Worksheet

All three have the same result, and with spreadsheet we can calculate with any method easily. You can choose what equation suitable for your case.

Worksheet Source

Playsheet Artefact

The Excel file is available, so you can have fun, modify as you want.

Slope and Intercept

Consider a series of data points in pair (x, y), we want to fit a linear equation for this data series.

With the same x, from observed value (x,y), we can get the prediction value of y. Now we have two kind of y, the observed value as the calculation source, and the predicted value as the result of the curve fitting equation.

The fit equation for the predicted y value (Å·áµ¢):

Residual

The formula for the slope (m) and intercept (b) of the least squares regression line can be derived from the principle of minimizing the sum of squared errors (residuals) between the observed values of y and the values predicted by the regression line.

The residual for each observation is the difference between the observed value of y and the predicted value of y:

The sum of squared residuals (SSE) is defined as:

Our goal is to minimize SSE by choosing appropriate values of m and b. To find the values of m and b that minimize SSE, we take the derivatives of SSE with respect to m and b and set them both equal to zero:

Direct Equation

Solving these equations simultaneously will give us the values of m and b that minimize SSE, which are the least squares estimates of the slope and intercept.

After some algebraic manipulation, dark magic and friendly spirit, the formulas for m and b can be derived as:

Equation with Mean

We can rewrite further the formula, in terms of sample means:

Matrix of System

The least squares solution for linear regression can be derived using matrix notation.

In matrix notation, the equations are represented as:

To solve for m and b, you can use matrix inversion. The solution is given by:

Those three are the same least square equations. Those would produce the same result. There are other approach as well in the realm of least square, but let’s skip them this time.


2: Direct Solution

Let’s apply our equation to find the solution:

Known Values

Supposed you have series of points:

(x, y) = [(0, 5), (1, 12), (2, 25), (3, 44), (4, 69), (5, 100), (6, 137), (7, 180), (8, 229), (9, 284), (10, 345), (11, 412), (12, 485)]

We can calculate the statistic properties:

Note that we can also write each count and sum as:

In spreadsheet you can use count and sum as shown below:

Least Square: Worksheet: Sample Series

Tabular Calculation

Let’s have a look again at our direct equation

We can calculate further:

Now we can calculate, the m (slope):

and also the b (intercept):

The equation result is:

With spreadsheet you can utilize tabular calculation, and having the same result:

Least Square: Worksheet: Direct Equation

Now you can see how easy it is to write down the equation in tabular spreadsheet form.


3: Solution Using Mean

This is easy when you want to combine with tabular calculation of regression and correlation.

Tabular Calculation

To calculate the least squares (y=mx+b) for a data series, we need to

  1. Calculate the mean of x and y
  1. Calculate the slope (m)
  1. Calculate the y-intercept (b)

The final equation for the least squares regression line is:

Now let’s calculate the statistic properties for given data point above:

And also:

Now we can calculate, the m (slope):

and the b (intercept):

With spreadsheet you can utilize tabular calculation, and having the same result:

Least Square: Worksheet: Using Means


4: Solution Using Matrix

This is suitable as a basis, when you need to extend to non linier curve fitting.

Tabular Calculation

Let’s have a look again at our matrix equation.

Now we can calculate, the m (slope) and b(intercept), using matrix inversion.

With spreadsheet you can utilize matrix multiplication and inverse, and having the same result:

Least Square: Worksheet: Using Matrix


5: Built-in Formula

With spreadsheet such as Excel, or Calc, this can be easier.

Least Square: Worksheet: Using Matrix

Let’s get the formula for sum for both x and y:

Properties x Formula y Formula
Total =SUM(B7:B19) =SUM(C7:C19)
Count =COUNT(B7:B19) =COUNT(C7:C19)
Mean =AVERAGE(B7:B19) =AVERAGE(C7:C19)

You can also get the mean value without using AVERAGE formula. This might looks silly, but this is required to build model, before coding in scripting language.

Properties Formula
Mean x =SUM(B7:B19)/COUNT(B7:B19)
Mean y =SUM(C7:C19)/COUNT(C7:C19)

With this data above we can get more properties. And don’t forget to press ctrl+shift+enter for array operation.

Properties Formula
slope (m) {=SLOPE(C7:C19;B7:B19)}
intercept (b) {=INTERCEPT(C7:C19;B7:B19)}

You can also calculate manually.

Properties Formula
slope (m) =SUM((B7:B19 - G10)*(C7:C19 - H10))/SUM((B7:B19 - G10)^2)
intercept (b) =H9-G14*G9

This is all for now. We are going to dive to more complex formula, with regression and correlation.


6: Python

Python offer different way to solve curve fitting.

  1. numpy.polyfit
  2. scipy.stats.linregress
  3. statsmodels.api.OLS

Let’s say we have this data series:

# Given data
x_values = np.array(
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y_values = np.array(
  [5, 12, 25, 44, 69, 100, 137,
    180, 229, 284, 345, 412, 485])

Python: Built-In: Curve Fitting: Data Series

Polyfit

Using numpy library.

import numpy as np

# Perform least squares regression
# for a linear model (polynomial degree 1)
order = 1
mC    = np.polyfit(x_values, y_values, deg=order)
m     = mC[0]
b     = mC[1]

print('Using polyfit')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')

Python: Built-In: Curve Fitting: Polyfit

Linregress

Using scipy library.

from scipy import stats

# SciPy: scientific computing built on top of NumPy.
m, b, _, _, _ = stats.linregress(x_values, y_values)

print('Using linregress')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')

Python: Built-In: Least Square: Linear Regression

OLS

Using statsmodels library.

import statsmodels.api as sm

# Statsmodels: statistical modeling and hypothesis testing.
x = sm.add_constant(x_values)
model = sm.OLS(y_values, x).fit()
b = model.params[0]
m = model.params[1]

print('Using Statsmodels')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')

Python: Built-In: Least Square: Stats Models: OLS

We can have the same output as below:

Using polyfit
Coefficients (b = -61.00, m = 40.00)

Using linregress
Coefficients (b = -61.00, m = 40.00)

Using Statsmodels
Coefficients (b = -61.00, m = 40.00)

Python: Built-In: Least Square: Output Result

There is no need for fancy chart plot this time.

Interactive JupyterLab

You can obtain the interactive JupyterLab in this following link:

Manual Calculation

Of course you can calculate manually to learn how it works:

First we calculate basic statistic properties.

# Number of data points
n = len(x_values)

# Calculate sums
sum_x = np.sum(x_values)
sum_y = np.sum(y_values)

# Calculate means
mean_x = np.mean(x_values)
mean_y = np.mean(y_values)

# Output of basic properties
print(f'n = {n:5d}')
print(f'∑x (total) = {sum_x:7.2f}')
print(f'∑y (total) = {sum_y:7.2f}')
print(f'x̄ (mean)   = {mean_x:7.2f}')
print(f'ȳ (mean)   = {mean_y:7.2f}')
print()

Python: Manual Calculation: Least Square: Mean Properties

Then go further:

# Calculate deviations
deviation_x = x_values - mean_x
deviation_y = y_values - mean_y

# Calculate squared deviations
sq_deviation_x = np.sum(deviation_x ** 2)

# Calculate cross-deviation
cross_deviation_xy = np.sum(deviation_x * deviation_y)

# Calculate slope (m) and intercept (b)
slope_m = cross_deviation_xy / sq_deviation_x
intercept_b = mean_y - slope_m * mean_x

print('Manual Calculation')
print(f'Coefficients (b = {intercept_b:2,.2f},'
    + f' m = {slope_m:2,.2f})\n')

Python: Manual Calculation: Least Square: Deviation Properties

With the result as below:

n =    13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

Manual Calculation
Coefficients (b = -61.00, m = 40.00)

Python: Manual Calculation: Least Square: Result Output

Interactive JupyterLab

You can obtain the interactive JupyterLab in this following link:


What’s Our Next Move 🤔?

We can describe mathematical equation, in practical way. rewrite it in spreadsheet in tabular manner, and implement it in python script. It is fun, right? What do you think?

From basic statistical properties, we can continue to more statistical properties, for use with regression and correlation.

Consider continuing to read [ Trend - Properties - Cheatsheet ].