Where to Discuss?

Local Group

Preface

Goal: Goes further from linear regression to polynomial regression. The practical implementation, shows how to implement the theory in python code.

From theoretical foundations that explains why the math works, we are going shift to practical implementation in daily basis with real tools. After spreadsheet, we still have to cover, python tools, and visualizations.

While a ready to use spreadsheet template is, enough for admin to infer the statistical properties, without even doing analysis at all. We need to a ready to use code, that we can creatively modified, for any different cases.


1: Statisctic Properties

We are going to start from minimalist to solve this situation, using numpy manual calculation only, then utilize other library as well.

Since most topic is already covered in previous section, we are going to skip the explanation, and goes to ready to use class, for practical application.

Skeleton

We are going to use two classes. This class is using master-detail relationshiop. One main class, manage three kind of polynomial order.

import ...

class PolynomialOrderAnalyzer:
    ...

class CurveFitting:
    ...

def main() -> int:
    ...

if __name__ == "__main__":
    raise SystemExit(main())

Python: Statistical Properties: Skeleton

The first is the main CurveFitting class :

class CurveFitting:
    def __init__(self, xs, ys : List[int]) -> None:
        ...

    def print_props_general(self) -> None:
        ...

    def process(self) -> None:
        ...

Then comes the PolynomialOrderAnalyzer class to analyse each order.

class PolynomialOrderAnalyzer:
    COEFF_TEXTS = {1: '(a, b)', 2: '(a, b, c)', 3: '(a, b, c, d)'}
    ORDER_TEXTS = {1: 'Linear', 2: 'Quadratic', 3: 'Cubic'}

    def __init__(self, xs, ys: np.ndarray, order: int) -> None:
        ...
       
    def calc_props_matrix(self) -> None:
        ...

    def calc_props_mse(self) -> None:
        ...

    def calc_props_t_p_value(self) -> None:
        ...

Import Statements

We need to include a few import statements.

Python: Statistical Properties: Import Statement

import numpy as np
from numpy.polynomial import Polynomial
from scipy.stats import t
from sklearn.metrics import r2_score
from typing import List

You can see that we are ging to use a few different library.

Data Series

For example using these series, you may choose ys1, ys2, or ys3:

xs, ys1, ys2, ys3
0,  5,   5,   5
1,  9,   12,  14
2,  13,  25,  41
3,  17,  44,  98
4,  21,  69,  197
5,  25,  100, 350
6,  29,  137, 569
7,  33,  180, 866
8,  37,  229, 1253
9,  41,  284, 1742
10, 45,  345, 2345
11, 49,  412, 3074
12, 53,  485, 3941

Python: Manual Calculation Result: Statistical Properties: CSV Source

Entry Point

The entry point is simply as below:

if __name__ == "__main__":
    raise SystemExit(main())

While the main function itself is just CSV loader, featuring column chooser.

Python: Statistical Properties: Entry Point: Main

def main() -> int:
    # Getting Matrix Values
    mCSV = np.genfromtxt("series.csv",
      skip_header=1, delimiter=",", dtype=float)
    mCSVt   = np.transpose(mCSV)

    example = CurveFitting(mCSVt[0], mCSVt[3])
    example.process()

    return 0

CurveFitting: Initialization

We just need to convert he list to np.array, for further use.

Python: Curve Fitting: Initialization

class CurveFitting:
    def __init__(self, xs, ys : List[int]) -> None:
        # Given data
        self.xs = np.array(xs)
        self.ys = np.array(ys)

CurveFitting: General Properties

Then print general properties.

Python: Curve Fitting: General Properties

    def print_props_general(self) -> None:
        y_mean = np.mean(self.ys)
        self.y_sq_deviation = np.sum((self.ys-y_mean) ** 2)
        print('General')
        print(f'\tȳ (mean)   : {y_mean:15,.2f}')
        print(f'\t∑(yᵢ-ȳ)²   : {self.y_sq_deviation:15,.2f}')
        print()

Python: Manual Calculation Result: Statistical Properties: General

CurveFitting: Process

Here we print all the properties.

Python: Curve Fitting: Execute Process

    def process(self) -> None:
        # Print Statistical Properties
        self.print_props_general()
        
        for order in [1, 2, 3]:
            case = PolynomialOrderAnalyzer(self.xs, self.ys, order)
            case.calc_props_matrix()
            case.calc_props_mse()
            case.calc_props_t_p_value()
            print()

Polynomial Order Analyzer: Initialization

Here we take care of each polynomial order.

Python: Polynomial Order Analyzer: Initialization

class PolynomialOrderAnalyzer:
    COEFF_TEXTS = {1: '(a, b)', 2: '(a, b, c)', 3: '(a, b, c, d)'}
    ORDER_TEXTS = {1: 'Linear', 2: 'Quadratic', 3: 'Cubic'}

    def __init__(self, xs, ys: np.ndarray, order: int) -> None:
        self.xs = xs
        self.ys = ys
        self.order = order
        self.df = len(self.xs) - order -1

        self.mA   = None
        self.mC   = None
        self.yp   = None
        self.diag = None
        self.MSE  = None

        # Display
        self.coeff_text = self.COEFF_TEXTS[order]
        self.order_text = self.ORDER_TEXTS[order]

Properties Calculation: Matrix

The actual matrix calculation is short.

Python: Polynomial Order Analyzer: Matrix Properties

    def calc_props_matrix(self) -> None:
        # Perform regression using polynomial.
        poly = Polynomial.fit(self.xs, self.ys, deg=self.order)
        self.mC = poly.convert().coef
        self.yp = poly(self.xs)

        # Matrix Operation
        self.mA = np.flip(np.vander(self.xs, self.order+1), axis=1)
        mAt   = self.mA.T
        mAt_A = mAt @ self.mA # gram matrix
        mI   = np.linalg.inv(mAt_A)
        self.diag = np.diag(mI)

        print(f'Using polyfit : {self.order_text}')
        print(f'Coefficients  : {self.coeff_text}: '
            + f'{self.mC}')

Python: Manual Calculation Result: Statistical Properties: Matrix

But for broader overview, we have different kind of calculations. I commented the alterantive calculation.

Design Matrix A

        # Design Matrix A
        # mA_1 = np.column_stack([np.ones_like(self.xs), self.xs, self.xs**2])
        # mA_2 = np.column_stack([self.xs**j for j in range(self.order+1)])  # Design matrix
        self.mA = np.flip(np.vander(self.xs, self.order+1), axis=1)

Matrix Operation

        mB = self.ys
        mAt   = self.mA.T
        mAt_A = mAt @ self.mA # gram matrix
        mAt_B = mAt @ mB 
        # mC_1 = np.linalg.solve(mAt_A, mAt_B)
        # mC_2 = np.polyfit(self.xs, self.ys, deg=order)
        # mC_3 = np.linalg.inv(self.mA.T @ self.mA) @ (self.mA.T @ mB)
        # mC_4 = np.linalg.lstsq(self.mA, mB, rcond=None)[0]
        # print(mC_4)
        mI   = np.linalg.inv(mAt_A)
        self.diag = np.diag(mI)

Getting Prediction Series

        poly = Polynomial.fit(self.xs, self.ys, deg=self.order)
        self.mC = poly.convert().coef
        # yp_1    = np.polyval(mC, self.xs)
        self.yp = poly(self.xs)
        # y_pred_2 = self.mA @ mC
        # print(y_pred_2)

Print The Statistic Properties Header

        print(f'Using polyfit : {self.order_text}')
        print(f'Coefficients  : {self.coeff_text}: '
            + f'{self.mC}')

Properties Calculation: MSE

Mean Square Error

This is how we calculate MSE manually.

Python: Polynomial Order Analyzer: Error Properties

    def calc_props_mse(self) -> None:
        # Calculate SST and SSE
        y_mean = np.mean(self.ys)
        SST = y_sq_deviation = np.sum((self.ys-y_mean) ** 2)
        SSR = np.sum((self.ys - self.yp) ** 2) # ∑ϵᵢ²
        R_squared = 1 - (SSR / SST)
        # r2 = r2_score(self.ys, yp)
        self.MSE = SSR/self.df

        print(f'\t∑(yᵢ - ŷᵢ)² : {SSR:15,.2f}')
        print(f'\tR²          : {R_squared:15,.4f}')
        print(f'\tMSE         : {self.MSE:15,.2f}')

Python: Manual Calculation Result: Statistical Properties: Mean Squared Error

Note that we can calculate R² using r2_score() function from sklearn.metrics. But why can’t we calculate MSE using mean_squared_error() function from sklearn.metrics.

This is because of difference of degree freedom. This is due of the focus of the sklearn.metrics is not for the statistical properties, but designed for general use (e.g., evaluating predictions without knowing model internals). The DoF in sklearn.metrics use n as divisor, while we are using (n - k -1).

Properties Calculation: t-value, p-value

The calculation using t distribution is simple.

Python: Polynomial Order Analyzer: t-value, p-value

    def calc_props_t_p_value(self) -> None:
        # t-value, p-value
        SE   = np.sqrt(self.MSE * self.diag)
        t_value = self.mC/SE
        p_value = 2 * (1 - t.cdf(abs(t_value), self.df))

But we need to convert the array inpit, so that this would have nice looks in output.

        diag_formatted = [f"{x:15,.4f}" for x in self.diag]
        SE_formatted   = [f"{x:15,.4f}" for x in SE]
        t_v_formatted  = [f"{x:15,.2e}" for x in t_value]
        p_v_formatted  = [f"{x:15,.10f}" for x in p_value]

Now we can finally print the result.

        print(f"diag    : [" + " ".join(diag_formatted) + "]")
        print(f"SE(β)   : [" + " ".join(SE_formatted) + "]")
        print(f"t_value : [" + " ".join(t_v_formatted) + "]")
        print(f"p_value : [" + " ".join(p_v_formatted) + "]")

Python: Manual Calculation Result: Statistical Properties: t-value, p value


4: Simple Plot

After printing the result we can continue with usual plot. But this time with emphasis of selected properties.

Skeleton

We have only three new functions here:

Python: Matplotlib: Curve Fitting: Skeleton

class CurveFitting:
    def __init__(self, xs, ys : List[int]) -> None:
        ...

    def print_props_general(self) -> None:
        ...

    def calc_plot_all(self) -> None:
        ...

    def add_plot_text(self, plt) -> None:
        ...

    def plot_trend(self, plt) -> None:
        ...

    def draw_plot(self) -> None:
        ...

    def process(self) -> None:
        ...

Prediction Calculation

The calculation is as simple as this one below:

Python: Matplotlib: Curve Fitting: Prediction Calculation

    def calc_plot_all(self) -> None:
        self.x_plot = xp = np.linspace(
            min(self.xs), max(self.xs), 100)

        # Calculate coefficients directly
        self.y1_plot = Polynomial.fit(self.xs, self.ys, deg=1)(xp)
        self.y2_plot = Polynomial.fit(self.xs, self.ys, deg=2)(xp)
        self.y3_plot = Polynomial.fit(self.xs, self.ys, deg=3)(xp)

Plot Trend

The plot is also the same as in previous section, but I put iyt in separate function.

Python: Matplotlib: Curve Fitting: Plot Trend

    def plot_trend(self, plt) -> None:
        # Scatter plot with Seaborn color
        sns.scatterplot(
            x=self.xs, y=self.ys, color=self.colors[0], 
            s=100, label='Data points', edgecolor='w', linewidth=0.5)

        # Polynomial curves with Seaborn colors
        plt.plot(self.x_plot, self.y1_plot, color=self.colors[1], 
                linewidth=2.5, label='Linear fit')
        plt.plot(self.x_plot, self.y2_plot, color=self.colors[2], 
                linewidth=2.5, label='Quadratic fit')
        plt.plot(self.x_plot, self.y3_plot, color=self.colors[3], 
                linewidth=2.5, label='Cubic fit')

        plt.legend(fontsize=10, framealpha=0.9)
        plt.xlabel('x', fontsize=12)
        plt.ylabel('y', fontsize=12)
        plt.title('Polynomial Regression with Statistics')
        plt.grid(True, linestyle='--', alpha=0.3)

        plt.tight_layout()

Main Plot Function

Now we can merge that trend plot into the main plot function. this way we can add additional plot later easily.

Python: Matplotlib: Curve Fitting: Plot Text

    def draw_plot(self) -> None:
        plt.figure()
        self.plot_trend(plt)
        self.add_plot_text(plt)

        plt.tight_layout()
        plt.show()

Notes that I add this weird line here:

        self.add_plot_text(plt)

Selected Properties

This line add a box for any text. For example, if we want to emphasis, the result of the second order polynomial, We can display the statistical properties easily.

Python: Matplotlib: Curve Fitting: Draw Plot

    def add_plot_text(self, plt) -> None:
        # Add statistics for quadratic fit (example)
        poly = Polynomial.fit(self.xs, self.ys, deg=2)
        mC = poly.convert().coef
        yp = poly(self.xs)
        SST = self.y_sq_deviation
        SSR = np.sum((self.ys - yp) ** 2)
        R_squared = 1 - (SSR / SST)
        MSE = SSR / (len(self.xs) - 3)  # deg=2 → 3 params (a, b, c)

        # Format equation and stats
        eqn = f"$y = {mC[0]:.2f} + {mC[1]:.2f}x + {mC[2]:.2f}x^2$"
        stats_text = (
            f"$R^2 = {R_squared:.3f}$\n"
            f"$MSE = {MSE:.2f}$\n"
        )

        # Place text on plot
        plt.text(
            0.05, 0.95, 
            f"{eqn}\n{stats_text}",
            transform=plt.gca().transAxes,
            ha='left', va='top',
            bbox=dict(facecolor='white', alpha=0.8)
        )

Execute The Process

Then add the plot to the main process.

    def process(self) -> None:
        # Print Statistical Properties
        self.print_props_general()
        
        for order in [1, 2, 3]:
            case = PolynomialOrderAnalyzer(self.xs, self.ys, order)
            case.calc_props_matrix()
            case.calc_props_mse()
            case.calc_props_t_p_value()
            print()

        self.calc_plot_all()
        self.draw_plot()

With the result as below:

Python: Matplotlib Result: Polynomial Prediction: Statistical Properties

It is still the humble old chart, but this time with additional text.


5: Visualization

I actually do not have any idea for the visualization of the statistical properties. But I think we can recap the coefficient result in a simple chart. so people can just read our result easily.

Save Statistic Result

First we need to get the statisri result that we need to compare.

Python: Visualization: Curve Fitting: t-value, p-value

class PolynomialOrderAnalyzer:
    ...

    def calc_props_t_p_value(self) -> None:
        ...

        return {
            'SE'      :  SE,
            't-value' :  t_value,
            'p-value' :  p_value
        }

No need to add abs(t_value). The direction in positive and negative is matter.

Statistic Plot

Then choose one of the order to be plotted.

Python: Visualization: Curve Fitting: Comparing Statistic Properties

    def plot_stats(self, plt) -> None:
        # Statistics bar plots
        order = 2
        stats = self.stats[order]
        labels = [f"β{i}" for i in range(order+1)]

        def stat_bar(ax, values, title, color):
            ax.bar(labels, values, color=color)
            ax.set_title(title)
            ax.grid(True, axis='y', linestyle='--', alpha=0.3)

        ax1 = plt.subplot(1, 3, 1)
        stat_bar(ax1, stats['SE'], "Standard Error (SE)", 'orange')

        ax2 = plt.subplot(1, 3, 2)
        stat_bar(ax2, stats['t-value'], "t-values", 'skyblue')

        ax3 = plt.subplot(1, 3, 3)
        stat_bar(ax3, stats['p-value'], "p-values", 'salmon')

        # log scale to better show small p-values
        ax3.set_yscale('log')

Main Plot Function

Now we can add the second plot as well.

Python: Visualization: Curve Fitting: Drawing Second Plot

    def draw_plot(self) -> None:
        # Plot 1: Trend + Equation
        plt.figure()
        self.plot_trend(plt)
        self.add_plot_text(plt)

        # Plot 2: Bar Plots
        plt.figure()
        self.plot_stats(plt)
        plt.show()

Python: Matplotlib Result: Polynomial Prediction: Compare Statistical Properties

I’m still curious how to visualize further. But this time I still have no idea.


What’s the Next Exciting Step 🤔?

Since we also need to visualize the interpretation of statistics properties against the distribution plot curve, then we need to get the basic of making distribution plot curve.

Consider continuing your exploration by reading [ Trend - Visualizing Distribution ].