Preface
Goal: Goes further from linear regression to polynomial regression. The practical implementation, shows how to implement the theory in python code.
From theoretical foundations that explains why the math works, we are going shift to practical implementation in daily basis with real tools. After spreadsheet, we still have to cover, python tools, and visualizations.
While a ready to use spreadsheet template is, enough for admin to infer the statistical properties, without even doing analysis at all. We need to a ready to use code, that we can creatively modified, for any different cases.
1: Statisctic Properties
We are going to start from minimalist to solve this situation, using numpy manual calculation only, then utilize other library as well.
Since most topic is already covered in previous section, we are going to skip the explanation, and goes to ready to use class, for practical application.
Skeleton
We are going to use two classes. This class is using master-detail relationshiop. One main class, manage three kind of polynomial order.
import ...
class PolynomialOrderAnalyzer:
...
class CurveFitting:
...
def main() -> int:
...
if __name__ == "__main__":
raise SystemExit(main())
The first is the main CurveFitting
class :
class CurveFitting:
def __init__(self, xs, ys : List[int]) -> None:
...
def print_props_general(self) -> None:
...
def process(self) -> None:
...
Then comes the PolynomialOrderAnalyzer class to analyse each order.
class PolynomialOrderAnalyzer:
COEFF_TEXTS = {1: '(a, b)', 2: '(a, b, c)', 3: '(a, b, c, d)'}
ORDER_TEXTS = {1: 'Linear', 2: 'Quadratic', 3: 'Cubic'}
def __init__(self, xs, ys: np.ndarray, order: int) -> None:
...
def calc_props_matrix(self) -> None:
...
def calc_props_mse(self) -> None:
...
def calc_props_t_p_value(self) -> None:
...
Import Statements
We need to include a few import statements.
import numpy as np
from numpy.polynomial import Polynomial
from scipy.stats import t
from sklearn.metrics import r2_score
from typing import List
You can see that we are ging to use a few different library.
Data Series
For example using these series,
you may choose ys1
, ys2
, or ys3
:
xs, ys1, ys2, ys3
0, 5, 5, 5
1, 9, 12, 14
2, 13, 25, 41
3, 17, 44, 98
4, 21, 69, 197
5, 25, 100, 350
6, 29, 137, 569
7, 33, 180, 866
8, 37, 229, 1253
9, 41, 284, 1742
10, 45, 345, 2345
11, 49, 412, 3074
12, 53, 485, 3941
Entry Point
The entry point is simply as below:
if __name__ == "__main__":
raise SystemExit(main())
While the main function itself is just CSV loader, featuring column chooser.
def main() -> int:
# Getting Matrix Values
mCSV = np.genfromtxt("series.csv",
skip_header=1, delimiter=",", dtype=float)
mCSVt = np.transpose(mCSV)
example = CurveFitting(mCSVt[0], mCSVt[3])
example.process()
return 0
CurveFitting: Initialization
We just need to convert he list to np.array, for further use.
class CurveFitting:
def __init__(self, xs, ys : List[int]) -> None:
# Given data
self.xs = np.array(xs)
self.ys = np.array(ys)
CurveFitting: General Properties
Then print general properties.
def print_props_general(self) -> None:
y_mean = np.mean(self.ys)
self.y_sq_deviation = np.sum((self.ys-y_mean) ** 2)
print('General')
print(f'\tȳ (mean) : {y_mean:15,.2f}')
print(f'\t∑(yᵢ-ȳ)² : {self.y_sq_deviation:15,.2f}')
print()
CurveFitting: Process
Here we print all the properties.
def process(self) -> None:
# Print Statistical Properties
self.print_props_general()
for order in [1, 2, 3]:
case = PolynomialOrderAnalyzer(self.xs, self.ys, order)
case.calc_props_matrix()
case.calc_props_mse()
case.calc_props_t_p_value()
print()
Polynomial Order Analyzer: Initialization
Here we take care of each polynomial order.
class PolynomialOrderAnalyzer:
COEFF_TEXTS = {1: '(a, b)', 2: '(a, b, c)', 3: '(a, b, c, d)'}
ORDER_TEXTS = {1: 'Linear', 2: 'Quadratic', 3: 'Cubic'}
def __init__(self, xs, ys: np.ndarray, order: int) -> None:
self.xs = xs
self.ys = ys
self.order = order
self.df = len(self.xs) - order -1
self.mA = None
self.mC = None
self.yp = None
self.diag = None
self.MSE = None
# Display
self.coeff_text = self.COEFF_TEXTS[order]
self.order_text = self.ORDER_TEXTS[order]
Properties Calculation: Matrix
The actual matrix calculation is short.
def calc_props_matrix(self) -> None:
# Perform regression using polynomial.
poly = Polynomial.fit(self.xs, self.ys, deg=self.order)
self.mC = poly.convert().coef
self.yp = poly(self.xs)
# Matrix Operation
self.mA = np.flip(np.vander(self.xs, self.order+1), axis=1)
mAt = self.mA.T
mAt_A = mAt @ self.mA # gram matrix
mI = np.linalg.inv(mAt_A)
self.diag = np.diag(mI)
print(f'Using polyfit : {self.order_text}')
print(f'Coefficients : {self.coeff_text}: '
+ f'{self.mC}')
But for broader overview, we have different kind of calculations. I commented the alterantive calculation.
Design Matrix A
# Design Matrix A
# mA_1 = np.column_stack([np.ones_like(self.xs), self.xs, self.xs**2])
# mA_2 = np.column_stack([self.xs**j for j in range(self.order+1)]) # Design matrix
self.mA = np.flip(np.vander(self.xs, self.order+1), axis=1)
Matrix Operation
mB = self.ys
mAt = self.mA.T
mAt_A = mAt @ self.mA # gram matrix
mAt_B = mAt @ mB
# mC_1 = np.linalg.solve(mAt_A, mAt_B)
# mC_2 = np.polyfit(self.xs, self.ys, deg=order)
# mC_3 = np.linalg.inv(self.mA.T @ self.mA) @ (self.mA.T @ mB)
# mC_4 = np.linalg.lstsq(self.mA, mB, rcond=None)[0]
# print(mC_4)
mI = np.linalg.inv(mAt_A)
self.diag = np.diag(mI)
Getting Prediction Series
poly = Polynomial.fit(self.xs, self.ys, deg=self.order)
self.mC = poly.convert().coef
# yp_1 = np.polyval(mC, self.xs)
self.yp = poly(self.xs)
# y_pred_2 = self.mA @ mC
# print(y_pred_2)
Print The Statistic Properties Header
print(f'Using polyfit : {self.order_text}')
print(f'Coefficients : {self.coeff_text}: '
+ f'{self.mC}')
Properties Calculation: MSE
Mean Square Error
This is how we calculate MSE manually.
def calc_props_mse(self) -> None:
# Calculate SST and SSE
y_mean = np.mean(self.ys)
SST = y_sq_deviation = np.sum((self.ys-y_mean) ** 2)
SSR = np.sum((self.ys - self.yp) ** 2) # ∑ϵᵢ²
R_squared = 1 - (SSR / SST)
# r2 = r2_score(self.ys, yp)
self.MSE = SSR/self.df
print(f'\t∑(yᵢ - ŷᵢ)² : {SSR:15,.2f}')
print(f'\tR² : {R_squared:15,.4f}')
print(f'\tMSE : {self.MSE:15,.2f}')
Note that we can calculate R² using r2_score()
function from sklearn.metrics
.
But why can’t we calculate MSE using mean_squared_error()
function from sklearn.metrics
.
This is because of difference of degree freedom.
This is due of the focus of the sklearn.metrics
is not for the statistical properties,
but designed for general use (e.g., evaluating predictions without knowing model internals).
The DoF in sklearn.metrics
use n as divisor, while we are using (n - k -1).
Properties Calculation: t-value, p-value
The calculation using t distribution is simple.
def calc_props_t_p_value(self) -> None:
# t-value, p-value
SE = np.sqrt(self.MSE * self.diag)
t_value = self.mC/SE
p_value = 2 * (1 - t.cdf(abs(t_value), self.df))
But we need to convert the array inpit, so that this would have nice looks in output.
diag_formatted = [f"{x:15,.4f}" for x in self.diag]
SE_formatted = [f"{x:15,.4f}" for x in SE]
t_v_formatted = [f"{x:15,.2e}" for x in t_value]
p_v_formatted = [f"{x:15,.10f}" for x in p_value]
Now we can finally print the result.
print(f"diag : [" + " ".join(diag_formatted) + "]")
print(f"SE(β) : [" + " ".join(SE_formatted) + "]")
print(f"t_value : [" + " ".join(t_v_formatted) + "]")
print(f"p_value : [" + " ".join(p_v_formatted) + "]")
4: Simple Plot
After printing the result we can continue with usual plot. But this time with emphasis of selected properties.
Skeleton
We have only three new functions here:
class CurveFitting:
def __init__(self, xs, ys : List[int]) -> None:
...
def print_props_general(self) -> None:
...
def calc_plot_all(self) -> None:
...
def add_plot_text(self, plt) -> None:
...
def plot_trend(self, plt) -> None:
...
def draw_plot(self) -> None:
...
def process(self) -> None:
...
Prediction Calculation
The calculation is as simple as this one below:
def calc_plot_all(self) -> None:
self.x_plot = xp = np.linspace(
min(self.xs), max(self.xs), 100)
# Calculate coefficients directly
self.y1_plot = Polynomial.fit(self.xs, self.ys, deg=1)(xp)
self.y2_plot = Polynomial.fit(self.xs, self.ys, deg=2)(xp)
self.y3_plot = Polynomial.fit(self.xs, self.ys, deg=3)(xp)
Plot Trend
The plot is also the same as in previous section, but I put iyt in separate function.
def plot_trend(self, plt) -> None:
# Scatter plot with Seaborn color
sns.scatterplot(
x=self.xs, y=self.ys, color=self.colors[0],
s=100, label='Data points', edgecolor='w', linewidth=0.5)
# Polynomial curves with Seaborn colors
plt.plot(self.x_plot, self.y1_plot, color=self.colors[1],
linewidth=2.5, label='Linear fit')
plt.plot(self.x_plot, self.y2_plot, color=self.colors[2],
linewidth=2.5, label='Quadratic fit')
plt.plot(self.x_plot, self.y3_plot, color=self.colors[3],
linewidth=2.5, label='Cubic fit')
plt.legend(fontsize=10, framealpha=0.9)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Polynomial Regression with Statistics')
plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()
Main Plot Function
Now we can merge that trend plot into the main plot function. this way we can add additional plot later easily.
def draw_plot(self) -> None:
plt.figure()
self.plot_trend(plt)
self.add_plot_text(plt)
plt.tight_layout()
plt.show()
Notes that I add this weird line here:
self.add_plot_text(plt)
Selected Properties
This line add a box for any text. For example, if we want to emphasis, the result of the second order polynomial, We can display the statistical properties easily.
def add_plot_text(self, plt) -> None:
# Add statistics for quadratic fit (example)
poly = Polynomial.fit(self.xs, self.ys, deg=2)
mC = poly.convert().coef
yp = poly(self.xs)
SST = self.y_sq_deviation
SSR = np.sum((self.ys - yp) ** 2)
R_squared = 1 - (SSR / SST)
MSE = SSR / (len(self.xs) - 3) # deg=2 → 3 params (a, b, c)
# Format equation and stats
eqn = f"$y = {mC[0]:.2f} + {mC[1]:.2f}x + {mC[2]:.2f}x^2$"
stats_text = (
f"$R^2 = {R_squared:.3f}$\n"
f"$MSE = {MSE:.2f}$\n"
)
# Place text on plot
plt.text(
0.05, 0.95,
f"{eqn}\n{stats_text}",
transform=plt.gca().transAxes,
ha='left', va='top',
bbox=dict(facecolor='white', alpha=0.8)
)
Execute The Process
Then add the plot to the main process.
def process(self) -> None:
# Print Statistical Properties
self.print_props_general()
for order in [1, 2, 3]:
case = PolynomialOrderAnalyzer(self.xs, self.ys, order)
case.calc_props_matrix()
case.calc_props_mse()
case.calc_props_t_p_value()
print()
self.calc_plot_all()
self.draw_plot()
With the result as below:
It is still the humble old chart, but this time with additional text.
5: Visualization
I actually do not have any idea for the visualization of the statistical properties. But I think we can recap the coefficient result in a simple chart. so people can just read our result easily.
Save Statistic Result
First we need to get the statisri result that we need to compare.
class PolynomialOrderAnalyzer:
...
def calc_props_t_p_value(self) -> None:
...
return {
'SE' : SE,
't-value' : t_value,
'p-value' : p_value
}
No need to add abs(t_value). The direction in positive and negative is matter.
Statistic Plot
Then choose one of the order to be plotted.
def plot_stats(self, plt) -> None:
# Statistics bar plots
order = 2
stats = self.stats[order]
labels = [f"β{i}" for i in range(order+1)]
def stat_bar(ax, values, title, color):
ax.bar(labels, values, color=color)
ax.set_title(title)
ax.grid(True, axis='y', linestyle='--', alpha=0.3)
ax1 = plt.subplot(1, 3, 1)
stat_bar(ax1, stats['SE'], "Standard Error (SE)", 'orange')
ax2 = plt.subplot(1, 3, 2)
stat_bar(ax2, stats['t-value'], "t-values", 'skyblue')
ax3 = plt.subplot(1, 3, 3)
stat_bar(ax3, stats['p-value'], "p-values", 'salmon')
# log scale to better show small p-values
ax3.set_yscale('log')
Main Plot Function
Now we can add the second plot as well.
def draw_plot(self) -> None:
# Plot 1: Trend + Equation
plt.figure()
self.plot_trend(plt)
self.add_plot_text(plt)
# Plot 2: Bar Plots
plt.figure()
self.plot_stats(plt)
plt.show()
I’m still curious how to visualize further. But this time I still have no idea.
What’s the Next Exciting Step 🤔?
Since we also need to visualize the interpretation of statistics properties against the distribution plot curve, then we need to get the basic of making distribution plot curve.
Consider continuing your exploration by reading [ Trend - Visualizing Distribution ].