Where to Discuss?

Local Group

Preface

Goal: Getting to know the basic of visualizing distribution curve.

Since we also need to visualize the interpretation of statistics properties against the distribution plot curve, we need to get the basic of making distribution plot curve.

Before we get too excited about t-values and p-values, let’s remember where they live: in the comforting arms of the distribution curve. If statistics had a family tree, the normal distribution would be the well-dressed grandparent, who everyone references but nobody fully understands.

To better interpret our regression output, it’s wise to revisit how distributions work. Starting from scratch, let’s draw them, shade them, and twist them into funny shapes. Behind every t-test is a bell curve just waiting to shine.


Distribution

Ah, plotting! The visual symphony of statistics. Let’s roll up our sleeves and start composing. Crafting a plot is interesting.

Probability Density Function (PDF)

Our first guest: the standard normal distribution. Elegant. Symmetrical. Like a bell that never rings.

Its probability density function (PDF) is:

This is the benchmark of “normality” in statistics. Most tests, including our regression’s p-values, assume something roughly bell-shaped underneath. Knowing how it looks helps us spot when things go awry.

Normal

Start by generating a set of values along the x-axis:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 1000)

We can implement the equation for the probability density function (PDF) of a standard normal distribution as follows:

y = 1 / np.sqrt(2 * np.pi) * np.exp(-0.5 * x**2)
plt.plot(x, y, color='blue')

🎨 Voilà: the classic bell curve.

Python: Distribution Curve: Probability Density Function

The result of the plot can be visualized as below:

Python: Distribution Curve Plot: Normal

Not in the mood for calculus? Let scipy do the heavy lifting:

Instead of manually calculating, Let scipy.stats do the heavy lifting:

from scipy.stats import norm
y = norm.pdf(x)

We can play around with it in this Jupyter notebook:

Quantiles

With normal distribution, we can go further with visualizing quantiles. Let’s add some color by slicing the bell into pieces.

y = norm.pdf(x)

percentiles = [25, 50, 75, 100]
quantiles = np.percentile(x, percentiles)

plt.plot(x, y, color='black')

Now we can shade regions corresponding to percentiles as follows:

for i, q in enumerate(quantiles):
  plt.fill_between(
    x[x <= q], y[x <= q],
    color=f'C{i}', alpha=0.3)

The terms quartiles and quantiles, are related but not exactly the same. Quartiles divide a dataset into four equal parts. Quantiles, on the other hand, divide a dataset into any number of equal parts.

Python: Distribution Curve: Quantiles using percentiles

The result of the plot can be visualized as below:

Python: Distribution Curve Plot: Quantiles

Notebook for tinkering:

Quantiles tell yus where your data falls. Great for understanding outliers, or explaining why our boss’s sales numbers are in the 95th percentile.

Kurtosis

Let’s meet our first rebel: kurtosis. It measures whether the tails of a distribution are chunky or skinny.

  • Think of it as how dramatic your data likes to be.

We’ll use skewnorm.pdf with different scale values to simulate kurtosis.

Let’s compare them visually. Below arrangement are visualization examples of distributions with different levels of kurtosis. To differ with the normal distribution, I move the curve to the right.

from scipy.stats import skewnorm, norm, kurtosis

x = np.linspace(-5, 5, 1000)
y_standard = norm.pdf(x)

y_kurtosis_1 = skewnorm.pdf(x, a=0, loc=1, scale=1)
y_kurtosis_2 = skewnorm.pdf(x, a=0, loc=1, scale=0.5)
y_kurtosis_3 = skewnorm.pdf(x, a=0, loc=1, scale=2)

Python: Distribution Curve: Kurtosis: Calculate

Now we can plot distributions with different levels of kurtosis, on the same plot with the normal distribution.

plt.plot(x, y_standard, label='Standard Normal')

plt.plot(x, y_kurtosis_1, ls='-.',
  label='Standard Kurtosis = 0')
plt.plot(x, y_kurtosis_2, ls='-.',
  label='Lower Kurtosis')
plt.plot(x, y_kurtosis_3, ls='-.',
  label='Higher Kurtosis')

Python: Distribution Curve: Kurtosis: Display

The result of the plot can be visualized as below:

Python: Distribution Curve Plot: Kurtosis

Notebook for hands-on play:

High kurtosis means more extreme values. Great for finance. Terrible for blood pressure.

Skewness

Next up: skewness, the art of asymmetry. If our distribution leans to one side like a suspicious alibi, it’s skewed. We can utilize skewnorm.pdf again to visualize skewness.

The same method of kurtosis above applied for skewness. Below arrangement are visualization examples of distributions with different skewness parameters.

y_skewed_1 = skewnorm.pdf(x, a=-4)
y_skewed_2 = skewnorm.pdf(x, a=2)
y_skewed_3 = skewnorm.pdf(x, a=6)

Let’s draw their curious curves.

Python: Distribution Curve: Skewness: Calculate

plt.plot(x, y_standard, label='Standard Normal')

plt.plot(x, y_skewed_1, ls='--',
         label='Negative Skewness = -4')
plt.plot(x, y_skewed_2, ls='--',
         label='Moderate Positive Skewness = 2')
plt.plot(x, y_skewed_3, ls='--',
         label='High Positive Skewness = 6')

Python: Distribution Curve: Skewness: Display

The result of the plot can be visualized as below:

Python: Distribution Curve Plot: Skewness

Notebook for exploration:

Skewness tells us where most data is hiding. Useful when averages lie, like when a few rich folks mess up the “average income”.

Scaled Skewness

Statistically speaking, it’s not fair to compare plots, when one is shouting and the other’s whispering. The actual of skewness has different height. For simplicity we can scale so that, the height looks visually the same. Let’s level the playing field.

We’ll scale the skewed curves to match the standard normal’s peak height: Set the skewed distributions to have the same maximum height as the standard normal distribution. This way we can compare better.

max_height_standard = np.max(y_standard)

y_skewed_1_scaled = y_skewed_1 \
  / np.max(y_skewed_1) * max_height_standard
y_skewed_2_scaled = y_skewed_2 \
  / np.max(y_skewed_2) * max_height_standard
y_skewed_3_scaled = y_skewed_3 \
  / np.max(y_skewed_3) * max_height_standard

Python: Distribution Curve: Skewness: Scale

The result of the plot can be visualized as below:

Python: Distribution Curve Plot: Skewness: Scaled

Notebook available here:

When comparing shapes, height shouldn’t distract from symmetry (or the lack thereof). Scaling gives a fair visual comparison.


What’s the Next Chapter 🤔?

There are also common properties for statistics not related with trend. In trend context let’s call them additional properties. This properties is important for other statistics analysis.

Distribution is just the appetizer. Now that we know how the data looks, we’re ready to inspect some behind-the-scenes properties, that aren’t directly about trendlines, but still impact analysis.

Up next: [ Trend - Properties - Additional ] Because even statistics have bonus features.