Preface
Goal: Visualizing interpretation of statistic properties, using python matplotlib.
We are ready for more usage of our statistic properties python helper, to give intepretation of those statistic properties, in visualization. I guess we can read statistic charts better, if we understand how to write them.
I welcome any other useful interpretation, or any feedback. If you think my visualization, or interpretation is wrong, please let me know.
Visualizing Interpretation
We can utilize the matplotlib to visualize the interpreation of statistic properties. Of course not everything can be visualized, some properties are just a number, without any need to be visualized at all.
You have seen some of the plot below in previous article. This article tell you how to make those plots. If you think my interpretation, or calculation is wrong, I welcome any better opinion.
Skeleton
Let’s use our previous Properties.py
helper.
Instead of hardcoded data, we can setup the source data in CSV.
The plot using helper above have this skeleton below:
import matplotlib.pyplot as plt
# Local Library
from Properties import get_properties, display
properties = get_properties("50-samples.csv")
display(properties)
locals().update(properties)
def plot() -> int:
...
return 0
if __name__ == "__main__":
raise SystemExit(plot())
The script will return zero exit code, if everything goes well.
We can utilize this pattern for all our visualization.
Basic Data Series
We can start with basic data series:
We use scatter
to plot the data series.
def plot() -> int:
plt.figure(figsize=(10, 6))
# Plot the data series
plt.scatter(x_observed, y_observed, color='blue',
s=100, label='Data Points')
Then draw the mean as horizontal axis. And also the lines to show deviation from the mean, for each independent oberserved x.
# Plot deviation from mean
plt.axhline(y=y_mean, color='orange',
linestyle='--', label='Mean of y')
plt.vlines(x_observed, y_observed, y_mean,
linestyle='--', color='teal',
label='Deviation from Mean (y)')
In every chart plot, we define any decoration, such as label, legend, title and so on. The we show the plot.
def plot() -> int:
...
# Chart Decoration
plt.title('Mean and Deviation')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()
return 0
And then we can plot this data points and mean (average), along with it’s (yáµ¢-yÌ„) interpretation.
Plotting is pretty simple, right?
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Standard Deviation
Let’s continue to interpretation of standard deviation relative to mean. This is the interpretation when there is no information about linear regression at all.
I added my own color pallete to enhanced the color output. This pallete is using google material color.
blueScale = {
0: '#E3F2FD', 1: '#BBDEFB', 2: '#90CAF9',
3: '#64B5F6', 4: '#42A5F5', 5: '#2196F3',
6: '#1E88E5', 7: '#1976D2', 8: '#1565C0',
9: '#0D47A1'
}
Now we can draw previous plot, but with nicer color output:
# Plot the data series
plt.scatter(x_observed, y_observed,
color=blueScale[9], s=100, zorder=5,
label='Data Points')
# Plot deviation from mean
plt.axhline(y=y_mean, color=blueScale[7],
linestyle='--', label='Mean of y')
plt.vlines(x_observed, y_observed, y_mean,
linestyle='--', color=blueScale[5],
label='Deviation from Mean (y)')
And append shadowed region to draw the standard deviation.
# Plot shaded region for standard deviation
plt.fill_between(x_observed,
y_mean - y_std_dev, y_mean + y_std_dev,
color=blueScale[1], alpha=0.3, zorder=1,
label='Standard Deviation')
# Plot covariance
plt.text(x_mean, max(y_observed),
f'Covariance: {xy_covariance:.2f}',
fontsize=12, color=blueScale[9])
Now we can plot the interpretation of standard deviation relative to mean.
With a simple touch, the plot is already looks better, right?
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Mean and Standard Deviation
The above chart is not the only representation, we have other simpler interpretation. Not very pretty, looks naive, but clear.
First, plot the data points, and also both mean as axis x and axis y.
plt.scatter(x_observed, y_observed,
color='blue', label='Data Points')
plt.axvline(x=x_mean, color='green',
linestyle='--', label='Mean of x')
plt.axhline(y=y_mean, color='orange',
linestyle='--', label='Mean of y')
Then we can plot standard deviation as error bars.
plt.errorbar(x_mean, y_mean,
xerr=x_std_dev, yerr=y_std_dev,
fmt='o', color='purple',
label='Standard Deviation')
Now we can naively plot the interpretation of both mean and standard deviation in the middle of the chart.
Not visually very useful, but now we know that this kind of interpretation exist.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Linear Regression
Next, we will plot linear regression based on our calculated least square.
# Plot the data and regression line
plt.scatter(x_observed, y_observed,
color=tealScale[9], label='Data Points')
plt.plot(x_observed, y_fit,
color=tealScale[5], label='Regression Line')
I use teal color scale from google material color.
tealScale = {
0: '#E0F2F1', 1: '#B2DFDB', 2: '#80CBC4',
3: '#4DB6AC', 4: '#26A69A', 5: '#009688',
6: '#00897B', 7: '#00796B', 8: '#00695C',
9: '#004D40'
}
No interpretation this time, just plain chart:
- Observed y, and
- Predicted Å· = fit(x) as line.
Very simple. No Comment.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Residual
How about error (ϵ)? Of course we can draw using vlines
.
# Plot the data and regression line
plt.scatter(x_observed, y_observed,
color=blueScale[9], label='Data Points')
plt.plot(x_observed, y_fit,
color=blueScale[5], label='Regression Line')
# Plot residual errors
plt.vlines(x_observed, y_observed, y_fit,
linestyle='--', color=blueScale[3],
label='Residual')
The interpretation of residual or error (ϵ), is simple as shown in below plot:
Simple, but enough to to interpret the (yáµ¢-Å·) difference.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Standard Deviation
How about interpretation of standard deviation relative to predicted values, of regression line?
First, we need to draw our data series, and the regression line.
# Plot the data and regression line
plt.scatter(x_observed, y_observed,
color=tealScale[9], label='Data Points')
plt.plot(x_observed, y_fit,
color=tealScale[5], label='Regression Line')
Then plot standard deviation, on both above and below the curve fitting trend.
plt.plot(x_observed, y_fit + y_std_dev,
c=tealScale[1], linestyle='--')
plt.plot(x_observed, y_fit - y_std_dev,
c=tealScale[1], linestyle='--',
label='Regression ± Standard Deviation')
Then we fill a shaded region, between upper and lower bounds:
plt.fill_between(x_observed,
y_fit - y_std_dev, y_fit + y_std_dev,
color=tealScale[1], alpha=0.3,
label='Standard Deviation')
With settings above, we can plot the interpretation of standard deviation relative to curve fitting trend.
This should be pretty cool.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Standard Error with Level of Confidence
It is basically the same with shaded region of standard deviation, but instead of just Standard Error, we can add confidence interval.
This confidence of interval can be predicted using OLS. For example, let’s use 95% confidence level, with the resul of approximately 1.96.
def get_CI() -> float:
# Create regression line
y_fit = m_slope * x_observed + b_intercept
y_err = y_observed - y_fit
# Calculate variance of residuals (MSE)
var_residuals = np.sum(y_err ** 2) / (n - 2)
SE = np.sqrt(var_residuals)
# Calculate the confidence interval
# for the predictions using 95% confidence
return 1.96 * SE
Then use this calculation to fill the standard region.
# Fill between upper and lower bounds
CI = get_CI()
plt.fill_between(x_observed,
y_fit - CI, y_fit + CI,
color=tealScale[1], alpha=0.3,
label='Standard Error')
With calculation of the confidence interval above, we can plot the interpretation of standard error of the curve fitting trend.
Ultimately, the choice of your plot depends on the specific interpretation and communication goals of your analysis. You should contact your nearest statistician to get most valid visual interpretation.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
Other Tools: Seaborn
Matplotlib is not the only tools.
There is also this seaborn
tools,
that render really cool graphic.
Add the sns
in import clause.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Getting Matrix Values
pairCSV = np.genfromtxt("50-samples.csv",
skip_header=1, delimiter=",", dtype=float)
# Extract x and y values from CSV data
x_observed = pairCSV[:, 0]
y_observed = pairCSV[:, 1]
And this cool oneliner.
# Scatter plot with regression line
plt.figure(figsize=(8, 6))
sns.regplot(x=x_observed, y=y_observed)
plt.title('Scatter Plot with Regression Line')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True)
plt.show()
And you will get the plot instantly.
By this time I write this article, I don’t know how the seaborn calculate the curve. So I refuse further exploration.
All I know is this plot is cool. And have pretty color pallete too.
Interactive JupyterLab
You can obtain the interactive JupyterLab
in this following link:
At Last
I welcome any other useful interpretation. Or any feedback.
What’s the Next Exciting Step 🤔?
Since we also need to visualize the interpretation of statistics properties against the distribution plot curve, then we need to get the basic of making distribution plot curve.
Consider continuing your exploration by reading [ Trend - Visualizing Distribution ].