Where to Discuss?

Local Group

Preface

Goal: Pretty statistics visualization with Seaborn, equipped with example script for each plots.

We need tool to make easy for us to visualize statistic properties. Fortunately, there is this seaborn library with ready to use plot chart, specifically made for statistics.

Example chart plot in this article provided with source code. There will be no explanation step by step tutorial, as there is already a bunch of tutorial in the internet anyway. Our focus is what you can do with Seaborn, related with statistics properties.

Note that in real life we would face complex data analysis, so the script would also be more complex than just these simple examples.

Let’s have a tour, enjoy the view of each chart plot.


Visualizing Linear Regression

Yes we are still talking about trend.

Data Series

Instead of just one series, we would like to use three series: ys1, ys2, or ys3:

xs, ys1, ys2, ys3
0,  5,   5,   5
1,  9,   12,  14
2,  13,  25,  41
3,  17,  44,  98
4,  21,  69,  197
5,  25,  100, 350
6,  29,  137, 569
7,  33,  180, 866
8,  37,  229, 1253
9,  41,  284, 1742
10, 45,  345, 2345
11, 49,  412, 3074
12, 53,  485, 3941

Python: Seaborn: Statistical Properties: CSV Source

Regression Plot

Plotting linear regression plot is straightforward. You can plot all these three series at once in one plot figure.

# Getting Matrix Values
pairCSV = np.genfromtxt("series.csv",
  skip_header=1, delimiter=",", dtype=float)

# Extract x and y values from CSV data
xs, ys1, ys2, ys3 = pairCSV.T

# Scatter plot with regression line
plt.figure(figsize=(8, 6))
sns.regplot(x=xs, y=ys1)
sns.regplot(x=xs, y=ys2)
sns.regplot(x=xs, y=ys3)

Python: Seaborn: Linear Regression: Regression Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Regression Plot

You can obtain the interactive JupyterLab in this following link:

Or if you wish you can have three subplots in one figure with the help of tight layout,

Prepare our data first. Getting Matrix Values, and extract x and y values from CSV data.

pairCSV = np.genfromtxt("series.csv",
  skip_header=1, delimiter=",", dtype=float)

xs, ys1, ys2, ys3 = pairCSV.T

Create the subplots. And also defining seaborn color palette. You can specify the number of colors here.

# Creating subplots
fig, axs = plt.subplots(1, 3, figsize=(12, 4))

palette = sns.color_palette("husl", 3)

Then plotting each scatter plot with regression line.

pairs = zip([ys1, ys2, ys3], ['ys1', 'ys2', 'ys3'])

for i, (ys, title) in enumerate(pairs):
  sns.regplot(x=xs, y=ys,
    ax=axs[i], color=palette[i])

  axs[i].set_title(title)
  axs[i].set_xlabel('x')
  axs[i].set_ylabel('y')

plt.tight_layout()
plt.show()

Python: Seaborn: Linear Regression: Regression Plot

The result of the plot can be visualized as follows. All with pretty color. You can see the color is better than matplotlib.

Python: Visualization with Seaborne: Multiple Regression Plot

You can obtain the interactive JupyterLab in this following link:

Linear Model Plot

LM: Linear Model

We can make the code above simpler with lmplot.

With panda dataframe, we can read data from CSV directly. But beware of the strip leading spaces from column names.

Before using the dataframe, we need to transform the DataFrame to long format for linear model plot. We can do this using melt method from panda.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

df_melted = pd.melt(df,
  id_vars='xs', var_name='y', value_name='value')

Then we can draw scatter plot with regression line. For convenience, I adjust the title position a bit, so the title fit in small sized figure.

plt.figure(figsize=(8, 6))
sns.lmplot(x='xs', y='value',
  data=df_melted, hue='y')

plt.subplots_adjust(top=0.9)

Python: Seaborn: Linear Regression: Linear Model Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Linear Model Plot

You can obtain the interactive JupyterLab in this following link:

Facet Grid

Grid of Plot

Instead of using subplots, we can arrange our plot in a grid. I give you two different examples. One with shared y-axis, and the other having different y-axis for each.

First we need to get the matrix values. Then convert the values to pandas dataframe. For use with this facetgrid, we need to melt the dataframe to long format.

pairCSV = np.genfromtxt("series.csv",
  skip_header=1, delimiter=",", dtype=float)

cols_all = ['xs', 'ys1', 'ys2', 'ys3']
cols_sel = ['ys1', 'ys2', 'ys3']

df = pd.DataFrame(pairCSV, columns=cols_all)

df_melted = pd.melt(df,
  id_vars='xs', var_name='y', value_name='value')

We need to create a facetgrid with one row and three columns, with different y-axis for each. Then we can map regplot to each facet.

g = sns.FacetGrid(df_melted,
  col='y', col_wrap=3, height=4, sharey=False)

g.map_dataframe(sns.regplot,
  x='xs', y='value', color='b')

We can iterate over selected columns and map regplot to each column in the facetgrid.

In the iteration, we should filter dataframe subset for each ys category. Also for each ys category we can use different color, based on sns.color_palette.

for ax, ys_name in zip(g.axes.flat, cols_sel):
  df_subset = df_melted[
    df_melted['y'] == ys_name]

  color = sns.color_palette("husl", 3)[
    cols_sel.index(ys_name)]

  sns.regplot(x='xs', y='value',
    data=df_subset, ax=ax,
    color=color)

Python: Seaborn: Linear Regression: Facet Grid

The result of the plot can be visualized as below. They all shared the same y-axis.

Python: Visualization with Seaborne: Facet Grid

You can obtain the interactive JupyterLab in this following link:

If you want you can have different y-axis for each grid.

With panda dataframe, we can read data from CSV directly. Do not firget to strip leading spaces from column names. Now we define selected columns for ys series.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())
  
cols_sel = ['ys1', 'ys2', 'ys3']

As usual we should melt the DataFrame to long format for facetgrid. So we can create a facetgrid with seaborn

df_melted = df.melt(
  id_vars='xs', value_vars=cols_sel)

g = sns.FacetGrid(df_melted,
  col='variable', col_wrap=3,
  sharex=False, sharey=True)

Like previous example, we can iterate over selected columns and map regplot to each column in the facetgrid.

for ax, col in zip(g.axes.flatten(), cols_sel):
  df_subset = df.melt(
    id_vars='xs', value_vars=col)

  color = sns.color_palette("husl", 3)[
    cols_sel.index(col)]

  sns.regplot(x='xs', y='value',
    data=df_subset, ax=ax, color=color)

Python: Seaborn: Linear Regression: Facet Grid

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: acet Grid

You can obtain the interactive JupyterLab in this following link:


Visualizing Statistics Properties

We have four plots with almost identical settings

  1. Boxplot
  2. Violinplot
  3. Swarmplot
  4. Striplot

Preparing Dataframe

These plot required the the same data preparation. As usual, you might either read the dataframe from panda directly, or using numpy’s np.genfromtxt.

First we need to get the matrix values. Then convert the values to pandas dataframe. For use with this these four kinds of plot, we need to melt the dataframe to long format.

pairCSV = np.genfromtxt("series.csv",
  skip_header=1, delimiter=",", dtype=float)

cols_all = ['xs', 'ys1', 'ys2', 'ys3']
df = pd.DataFrame(pairCSV, columns=cols_all)

df_melted = pd.melt(df,
  id_vars='xs', var_name='y', value_name='value')

Python: Seaborn: Statistics Properties: Preparing Dataframe

Box Plot

The box plot is the most common visualization.

Creating boxplot is as simple as below:

plt.figure(figsize=(8, 6))
sns.boxplot(x='y', y='value', data=df_melted)

Python: Seaborn: Statistics Properties: Box Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Box Plot

You can obtain the interactive JupyterLab in this following link:

Violin Plot

This violin plot is the most common visualization. This is basically the sum of normal distribution.

Creating violinplot is also simple.

plt.figure(figsize=(8, 6))
sns.violinplot(x='y', y='value', data=df_melted)

Python: Seaborn: Statistics Properties: Violin Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Violin Plot

You can obtain the interactive JupyterLab in this following link:

Swarm Plot

There is also other visualization as well.

We can define colors for swarmplot, by adjust the number of colors as needed, so we can create swarmplot with different colors

colors = sns.color_palette("husl", 3)

plt.figure(figsize=(8, 6))
sns.swarmplot(x='y', y='value',
  hue='y', data=df_melted, palette=colors)

Python: Seaborn: Statistics Properties: Swarm Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Swarm Plot

You can obtain the interactive JupyterLab in this following link:

Strip Plot

This looks like swarm plot, but with some kind of offset for each dots, so we can see how the data overlapped with the other.

Just like swarmplot, We can define colors by adjust the number of colors as needed, so we can create the striplot with different colors

colors = sns.color_palette("husl", 3)

plt.figure(figsize=(8, 6))
sns.stripplot(x='y', y='value', data=df_melted,
  hue='y', palette=colors, dodge=True)

Python: Seaborn: Statistics Properties: Strip Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Strip Plot

You can obtain the interactive JupyterLab in this following link:


Visualizing Distribution

Compared to matplotlib, visualizing distribution is much more easier with seaborn.

KDE Plot

Kernel Density Estimation

This is the sum of normal distribution for each points for a data series.

As usual we can prepare the data. Then seaborn decoration such as the style. And also define a color palette for the KDE plot, with adjustable number of colors as you needed.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

sns.set_style("whitegrid")
palette = sns.color_palette("husl", 3)

plt.figure(figsize=(8, 6))

And create a KDE plot for each ys category.

for i, col in enumerate(['ys1', 'ys2', 'ys3']):
  sns.kdeplot(data=df[col],
    color=palette[i], label=col)

Python: Seaborn: Visualizing Distribution: KDE Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: KDE Plot

You can obtain the interactive JupyterLab in this following link:

If you wish, you can customize the style, with other parameters.

df = pd.read_csv("series.csv")
df_melted = pd.melt(df, id_vars='xs',
  var_name='Category', value_name='Value')

sns.set_style("darkgrid")

plt.figure(figsize=(8, 6))

Then we can create KDE plot for all categories with oneliner settings.

sns.kdeplot(data=df_melted,
  x='Value', hue='Category', palette='deep',
  alpha=0.7, multiple='stack', linewidth=2)

Python: Seaborn: Visualizing Distribution: KDE Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: KDE Plot

You can obtain the interactive JupyterLab in this following link:

Rug Plot

Sometimes all you need is just the ticks. You can do this with rugs plot.

As usual we need to melt the dataframe to long format for rugplot.

df = pd.read_csv("series.csv")

df_melted = pd.melt(df, id_vars='xs',
  var_name='Category', value_name='Value')

For decoration purpose we need to define a color palette for the rug plots. With using one less color for ‘xs’

palette = sns.color_palette(
  "husl", len(df.columns) - 1)  

plt.figure(figsize=(8, 6))

Then we can create rug plot for each category, with ‘xs’ column excluded.

for i, col in enumerate(df.columns[1:]):
  df_subset = df_melted[df_melted['Category'] == col]
  sns.rugplot(data=df_subset, x='Value',
    color=palette[i], label=col, alpha=0.7)

Python: Seaborn: Visualizing Distribution: Rug Plot

The result of the plot can be visualized as below. This looks like an empty chart as first. But you can see the ticks at the below of the figure.

Python: Visualization with Seaborne: Rug Plot

You can obtain the interactive JupyterLab in this following link:

Histogram Plot

Histogram is a very basic plot and available in matplotlib. So what is so special with this histogram?

With seaborn we can have additional KDE plot with histogram plot.

As usual we need to prepare data. Then select columns such as ys1, ys2, and `ys3. Then create a figure and axis objects.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

cols_selected = ['ys1', 'ys2', 'ys3']

plt.figure(figsize=(8, 6))

This way we can plot displot for selected columns.

sns.histplot(data=df[cols_selected],
  kde=True, element='step',
  multiple='layer', palette='husl')

Python: Seaborn: Visualizing Distribution: Histogram Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Histogram Plot

You can obtain the interactive JupyterLab in this following link:

Distribution Plot

This is similar to above plot, but instead of having KDE Plot feature in histogram. Here we have histogram feature in KDE plot.

As above, we need to select columns, such as ys1, ys2, and ys3.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

cols_selected = ['ys1', 'ys2', 'ys3']
df_selected = df[cols_selected]

Let’s decorate the figure as usual. Defining a color palette for the displot.

palette = sns.color_palette(
  "husl", len(cols_selected))

plt.figure(figsize=(8, 6))

Now we can create displot for selected columns.

sns.displot(data=df_selected,
  kind='hist', rug=True, kde=True,
  palette=palette, alpha=0.7, multiple='layer')

Python: Seaborn: Visualizing Distribution: Distribution Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Distribution Plot

You can obtain the interactive JupyterLab in this following link:


Further Visualization

We can combine different in information in one figure. For example these two plots below have marginal side on top and right.

Joint Plot

The first approach is using plot, and putting the marginal settings inside the plots.

For eaxmple, we can use seaborn’s jointplot to create a scatter plot, with KDE at the marginal.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

sns.jointplot(data=df, x='xs', y='ys3',
  kind='reg', marginal_kws={'fill': True})

Python: Seaborn: Further Visualization: Joint Plot

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Joint Plot

You can obtain the interactive JupyterLab in this following link:

Joint Grid

Or the second approach by using grid, so we can customize the marginal manually.

First we need to create a JointGrid object. Then plot the scatter plot in the center, and also set the histograms plot on the marginal axes.

df = pd.read_csv("series.csv") \
  .rename(columns=lambda x: x.strip())

g = sns.JointGrid(data=df, x='xs', y='ys3')
g.plot_joint(sns.regplot)
g.plot_marginals(sns.boxplot)

Python: Seaborn: Further Visualization: Joint Grid

The result of the plot can be visualized as below:

Python: Visualization with Seaborne: Joint Grid

You can obtain the interactive JupyterLab in this following link:


What Comes Next 🤔?

We can visualize statistical properties, in practical way.

I would really like to explore PSPPire. PSPP is the open source version of SPSS.

Consider continuing your exploration with [ Trend - Properties - PSPPire ].