### Preface

Goal: Curve fitting examples for daily basis, using least squares method.

I’m not a statistician. That’s why I have to be careful when I talk about statistic. First of all, I need to talk about the math behind, then I will continue to spreadsheet, using Calc which is compatible with excel, then I will close with python script.

From this we are going to continue to, regression analysis and correlation.

### 1: Statistic Notation

#### Population and Samples

Statistical Properties

Statistic differ between population and samples.
For least squares, the calculation is the same,
even though the notation is different.
For example the `Mean`

notation.

For correlation and regression, the equation has different degree of freedom. This means, the equation is different. We will talk about this later.

### 1: The Math Behind

Which Least Square

There are a few approach to solve least square.

- Equation from direct derivation
- Equation using Mean
- Using Matrix

All three have the same result, and with spreadsheet we can calculate with any method easily. You can choose what equation suitable for your case.

#### Worksheet Source

Playsheet Artefact

The Excel file is available, so you can have fun, modify as you want.

#### Slope and Intercept

Consider a series of data points in pair (x, y), we want to fit a linear equation for this data series.

With the same x, from observed value (x,y), we can get the prediction value of y. Now we have two kind of y, the observed value as the calculation source, and the predicted value as the result of the curve fitting equation.

The fit equation for the predicted y value (Å·áµ¢):

#### Residual

The formula for the slope (m) and intercept (b) of the least squares regression line can be derived from the principle of minimizing the sum of squared errors (residuals) between the observed values of y and the values predicted by the regression line.

The residual for each observation is the difference between the observed value of y and the predicted value of y:

The sum of squared residuals (SSE) is defined as:

Our goal is to minimize SSE by choosing appropriate values of m and b. To find the values of m and b that minimize SSE, we take the derivatives of SSE with respect to m and b and set them both equal to zero:

#### Direct Equation

Solving these equations simultaneously will give us the values of m and b that minimize SSE, which are the least squares estimates of the slope and intercept.

After some algebraic manipulation, dark magic and friendly spirit, the formulas for m and b can be derived as:

#### Equation with Mean

We can rewrite further the formula, in terms of sample means:

#### Matrix of System

The least squares solution for linear regression can be derived using matrix notation.

In matrix notation, the equations are represented as:

To solve for m and b, you can use matrix inversion. The solution is given by:

Those three are the same least square equations. Those would produce the same result. There are other approach as well in the realm of least square, but let’s skip them this time.

### 2: Direct Solution

Let’s apply our equation to find the solution:

#### Known Values

Supposed you have series of points:

(x, y) = [(0, 5), (1, 12), (2, 25), (3, 44), (4, 69), (5, 100), (6, 137), (7, 180), (8, 229), (9, 284), (10, 345), (11, 412), (12, 485)]

We can calculate the statistic properties:

Note that we can also write each `count`

and `sum`

as:

In spreadsheet you can use `count`

and `sum`

as shown below:

#### Tabular Calculation

Let’s have a look again at our direct equation

We can calculate further:

Now we can calculate, the m (slope):

and also the b (intercept):

The equation result is:

With spreadsheet you can utilize tabular calculation, and having the same result:

Now you can see how easy it is to write down the equation in tabular spreadsheet form.

### 3: Solution Using Mean

This is easy when you want to combine with tabular calculation of regression and correlation.

#### Tabular Calculation

To calculate the least squares (y=mx+b) for a data series, we need to

- Calculate the mean of x and y

- Calculate the slope (m)

- Calculate the y-intercept (b)

The final equation for the least squares regression line is:

Now let’s calculate the statistic properties for given data point above:

And also:

Now we can calculate, the m (slope):

and the b (intercept):

With spreadsheet you can utilize tabular calculation, and having the same result:

### 4: Solution Using Matrix

This is suitable as a basis, when you need to extend to non linier curve fitting.

#### Tabular Calculation

Let’s have a look again at our matrix equation.

Now we can calculate, the m (slope) and b(intercept), using matrix inversion.

With spreadsheet you can utilize matrix multiplication and inverse, and having the same result:

### 5: Built-in Formula

With spreadsheet such as Excel, or Calc, this can be easier.

Let’s get the formula for `sum`

for both `x`

and `y`

:

Properties | x Formula | y Formula |
---|---|---|

Total | =SUM(B7:B19) | =SUM(C7:C19) |

Count | =COUNT(B7:B19) | =COUNT(C7:C19) |

Mean | =AVERAGE(B7:B19) | =AVERAGE(C7:C19) |

You can also get the mean value without using AVERAGE formula. This might looks silly, but this is required to build model, before coding in scripting language.

Properties | Formula |
---|---|

Mean x | =SUM(B7:B19)/COUNT(B7:B19) |

Mean y | =SUM(C7:C19)/COUNT(C7:C19) |

With this data above we can get more properties.
And don’t forget to press `ctrl+shift+enter`

for array operation.

Properties | Formula |
---|---|

slope (m) | {=SLOPE(C7:C19;B7:B19)} |

intercept (b) | {=INTERCEPT(C7:C19;B7:B19)} |

You can also calculate manually.

Properties | Formula |
---|---|

slope (m) | =SUM((B7:B19 - G10)*(C7:C19 - H10))/SUM((B7:B19 - G10)^2) |

intercept (b) | =H9-G14*G9 |

This is all for now. We are going to dive to more complex formula, with regression and correlation.

### 6: Python

Python offer different way to solve curve fitting.

- numpy.polyfit
- scipy.stats.linregress
- statsmodels.api.OLS

Let’s say we have this data series:

```
# Given data
x_values = np.array(
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y_values = np.array(
[5, 12, 25, 44, 69, 100, 137,
180, 229, 284, 345, 412, 485])
```

#### Polyfit

Using `numpy`

library.

```
import numpy as np
# Perform least squares regression
# for a linear model (polynomial degree 1)
order = 1
mC = np.polyfit(x_values, y_values, deg=order)
m = mC[0]
b = mC[1]
print('Using polyfit')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
```

#### Linregress

Using `scipy`

library.

```
from scipy import stats
# SciPy: scientific computing built on top of NumPy.
m, b, _, _, _ = stats.linregress(x_values, y_values)
print('Using linregress')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
```

#### OLS

Using `statsmodels`

library.

```
import statsmodels.api as sm
# Statsmodels: statistical modeling and hypothesis testing.
x = sm.add_constant(x_values)
model = sm.OLS(y_values, x).fit()
b = model.params[0]
m = model.params[1]
print('Using Statsmodels')
print(f'Coefficients (b = {b:2,.2f}, m = {m:2,.2f})\n')
```

We can have the same output as below:

```
Using polyfit
Coefficients (b = -61.00, m = 40.00)
Using linregress
Coefficients (b = -61.00, m = 40.00)
Using Statsmodels
Coefficients (b = -61.00, m = 40.00)
```

There is no need for fancy chart plot this time.

#### Interactive JupyterLab

You can obtain the interactive `JupyterLab`

in this following link:

#### Manual Calculation

Of course you can calculate manually to learn how it works:

First we calculate basic statistic properties.

```
# Number of data points
n = len(x_values)
# Calculate sums
sum_x = np.sum(x_values)
sum_y = np.sum(y_values)
# Calculate means
mean_x = np.mean(x_values)
mean_y = np.mean(y_values)
# Output of basic properties
print(f'n = {n:5d}')
print(f'âˆ‘x (total) = {sum_x:7.2f}')
print(f'âˆ‘y (total) = {sum_y:7.2f}')
print(f'xÌ„ (mean) = {mean_x:7.2f}')
print(f'yÌ„ (mean) = {mean_y:7.2f}')
print()
```

Then go further:

```
# Calculate deviations
deviation_x = x_values - mean_x
deviation_y = y_values - mean_y
# Calculate squared deviations
sq_deviation_x = np.sum(deviation_x ** 2)
# Calculate cross-deviation
cross_deviation_xy = np.sum(deviation_x * deviation_y)
# Calculate slope (m) and intercept (b)
slope_m = cross_deviation_xy / sq_deviation_x
intercept_b = mean_y - slope_m * mean_x
print('Manual Calculation')
print(f'Coefficients (b = {intercept_b:2,.2f},'
+ f' m = {slope_m:2,.2f})\n')
```

With the result as below:

```
n = 13
âˆ‘x (total) = 78.00
âˆ‘y (total) = 2327.00
xÌ„ (mean) = 6.00
yÌ„ (mean) = 179.00
Manual Calculation
Coefficients (b = -61.00, m = 40.00)
```

#### Interactive JupyterLab

You can obtain the interactive `JupyterLab`

in this following link:

### What’s Our Next Move ðŸ¤”?

We can describe mathematical equation, in practical way. rewrite it in spreadsheet in tabular manner, and implement it in python script. It is fun, right? What do you think?

From basic statistical properties, we can continue to more statistical properties, for use with regression and correlation.

Consider continuing to read [ Trend - Properties - Cheatsheet ].