Preface
Goal: Explore Julia statistic plot visualization. Providing the data using linear model.
There are multiple libraries in Julia, from StatPlots, Gadfly, and Vega lite. I haven’t explore them deeply.
Distribution
We can start with normal distribution.
This pdf
(probabilty density function) can be used to
calculate the corresponding y-values
for the standard normal distribution
Normal Distribution
We need to make ane data series.
Using distributions
library,
we can generate data points for x-axis,
then calculate the corresponding y-values
for a standard normal distribution.
using StatsPlots, Distributions
x = range(-5, 5, length=1000)
y = pdf(Normal(), x)
From this x
and y
series,
we can plot the normal distribution,
along with the labels and title.
plot(
x, y, fillrange = zero(x), fillalpha = 0.35,
color=:black,
label="Standard Normal Distribution", lw=1)
xlabel!("x")
ylabel!("Density")
title!("Standard Normal Distribution with Quantiles")
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Normal Distribution with Quantiles
No luck
I’ve got no luck of visualizing quantiles with Julia.
Kurtosis
With the pdf
method,
we can simulate kurtosis and skewness.
We start with making series by generating data points for x-axis, and calculating the corresponding y-values for the standard normal distribution.
using StatsPlots, Distributions
x = range(-5, 5, length=1000)
y_standard = pdf.(Normal(), x)
Let’s make examples of distributions with different levels of kurtosis.
- Standard normal distribution (Kurtosis = 0)
- Lower kurtosis
- Higher kurtosis
y_kurtosis_1 = pdf.(Normal(1, 1), x)
y_kurtosis_2 = pdf.(Normal(1, 0.5),
y_kurtosis_3 = pdf.(Normal(1, 2), x)
Make our first plot, using normal distribution.
# Plot the normal distribution and
plot(
x, y_standard, color=:black,
label="Standard Normal",
title = "Normal Distribution "
* "with Different Kurtosis",
xlabel = "x", ylabel = "Density",
)
Then add each different levels of kurtosis to the plot grammar.
# distributions with different levels of kurtosis
plot!(
x, y_kurtosis_1, color=:red,
label="Standard Kurtosis = 0",
linestyle=:dash,
)
plot!(
x, y_kurtosis_2, color=:green,
label="Lower Kurtosis",
linestyle=:dash,
)
plot!(
x, y_kurtosis_3, color=:blue,
label="Higher Kurtosis",
linestyle=:dash,
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Skewness
The same can be applied with skewness.
using StatsPlots, Distributions
x = range(-5, 5, length=1000)
y_standard = pdf.(Normal(), x)
Let’s make examples of distributions with different skewness parameters.
- Negative skewness
- Moderate positive skewness
- High positive skewness
y_skewed_1 = (2 * pdf.(Normal(), x)
.* cdf.(Normal(), x))
y_skewed_2 = (2 * pdf.(Normal(), -x)
.* cdf.(Normal(), -x))
y_skewed_3 = (2 * pdf.(Normal(), x)
.* cdf.(Normal(), x) * 2)
Make our first plot, using normal distribution.
plot(
x, y_standard, color=:black,
label="Standard Normal",
title = "Normal Distribution "
* "with Different Skewness",
xlabel = "x", ylabel = "Density",
)
Then add each distributions with different skewness parameters to the plot grammar.
plot!(
x, y_skewed_1, color=:red,
label="Negative Skewness = -4",
linestyle=:dash,
)
plot!(
x, y_skewed_2, color=:green,
label="Moderate Positive Skewness = 2",
linestyle=:dash,
)
plot!(
x, y_skewed_3, color=:blue,
label="High Positive Skewness = 6",
linestyle=:dash,
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Multiple Series
From the perspective of visualization, We can manage to display different series, in one plot, or different using grid.
Regression
This can be done by these three steps.
- Scatter plot for each series.
- Line plot for each series.
- Calculate each standard errors.
- Add shaded region for standard error for each series.
With total plot drawing as 6 plots.
As usual, read data from CSV file into dataframe, then extract x and each y values from CSV data.
df = CSV.read("series.csv", DataFrame, types=Dict())
rename!(df, Symbol.(strip.(string.(names(df)))))
xs = df.xs
ys1 = df.ys1
ys2 = df.ys2
ys3 = df.ys3
Scatter plot for each series, without with regression lines.
scatter(
xs, ys1, label="ys1",
seriestype=:scatter, color=:red,
legend=:topright)
scatter!(
xs, ys2, label="ys2",
seriestype=:scatter, color=:green)
scatter!(
xs, ys3, label="ys3",
seriestype=:scatter, color=:blue)
Calculate each standard errors.
se1 = std(ys1) / sqrt(length(ys1))
se2 = std(ys2) / sqrt(length(ys2))
se3 = std(ys3) / sqrt(length(ys3))
Also define color scheme for shading.
colors = ColorSchemes.magma.colors
Line plot for each series, along with shaded region using ribbon, representing standard error for each series.
plot!(
xs, ys1, label="", color=colors[1],
ribbon=(se1, se1), fillalpha=0.3)
plot!(
xs, ys2, label="", color=colors[2],
ribbon=(se2, se2), fillalpha=0.3)
plot!(
xs, ys3, label="", color=colors[3],
ribbon=(se3, se3), fillalpha=0.3)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Combined
For some reason, it would be better to separate the result. For example, if you want different y-axis scale.
As usual, read data from CSV file into dataframe, then extract x and each y values from CSV data. Calculate standard error for each y series. And also consider aestethic by defining color scheme for shading.
df = CSV.read("series.csv", DataFrame, types=Dict())
rename!(df, Symbol.(strip.(string.(names(df)))))
xs = df.xs
ys1 = df.ys1
ys2 = df.ys2
ys3 = df.ys3
se1 = std(ys1) / sqrt(length(ys1))
se2 = std(ys2) / sqrt(length(ys2))
se3 = std(ys3) / sqrt(length(ys3))
colors = ColorSchemes.magma.colors
From this we can draw plot for each series:
[ys1
, ys2
, ys3
]
plot1 = scatter(
xs, ys1, label="ys1",
seriestype=:scatter, color=:red)
...
plot2 = scatter(
xs, ys2, label="ys2",
seriestype=:scatter, color=:green)
...
plot3 = scatter(
xs, ys3, label="ys3",
seriestype=:scatter, color=:blue)
...
Now we can combine plots into a single figure.
plot_combined = plot(
plot1, plot2, plot3, layout=(1, 3))
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Statistic Properties: StatsPlot
Three series in one axis plot
As you can see from previous statistical properties.
We can analyze the data for each series.
For example we can just consider just the y-series,
and obtain the mean
, median
, mode
,
and also the minimum
, maximum
, range
, and quantiles
.
We can use StatsPlot
for simple Boxplot
and Violinplot
.
But we require Gadfly
to draw Swarm Plot
.
StatsPlot: Box Plot
There is this boxplot
method from StatsPlot
.
We need to read data from CSV file, then extract the columns ys1, ys2, and ys3.
using CSV, DataFrames, StatsPlots
df = CSV.read("series.csv", DataFrame)
rename!(df, Symbol.(strip.(string.(names(df)))))
data = [df.ys1, df.ys2, df.ys3]
And utilize boxplot
directly,
to create a box plot using StatsPlots
.
boxplot(data,
labels = ["ys1", "ys2", "ys3"],
linecolor = :black,
legend = false,
xlabel = "Variable",
ylabel = "Value",
title = "Box Plot for ys1, ys2, and ys3",
grid = false)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
StatsPlot: Violin Plot
There is also this violin
method from StatsPlot
.
With the same data, we can use the method directly,
to create a violin plot using StatsPlots
.
violin(data,
labels = ["ys1", "ys2", "ys3"],
linecolor = :black,
legend = false,
xlabel = "Variable",
ylabel = "Value",
title = "Violin Plot for ys1, ys2, and ys3",
grid = false)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Unfortunately I can’t draw swarm plot and strip plot using StatsPlot
.
So I’m looking for something else.
Statistic Properties: Gadfly
Three series in one axis plot
Gadfly: Box Plot
To use this box_plot
method from Gadfly
,
we need to import Cairo
and Fontconfig
.
using CSV, DataFrames, Gadfly
import Cairo, Fontconfig
We need to melt the DataFrame to long format.
df = CSV.read("series.csv", DataFrame)
rename!(df, Symbol.(strip.(string.(names(df)))))
df_long = stack(df, Not(:xs))
And utilize box_plot
directly,
to create a box plot using Gadfly
.
box_plot = Gadfly.plot(
df_long,
x=:variable,
y=:value,
color=:variable,
Geom.boxplot(),
Guide.xlabel("Variable"),
Guide.ylabel("Value"),
Guide.title("Box Plot for ys1, ys2, and ys3"),
Theme(
key_position = :top,
boxplot_spacing = 100px,
background_color = "white",
)
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Gadfly: Violin Plot
Also to use this violin_plot
method from Gadfly
,
we need to import Cairo
and Fontconfig
.
With the same data, we can use the method directly,
to create a violin plot using Gadfly
.
violin_plot = Gadfly.plot(
df_long,
x=:variable,
y=:value,
color=:variable,
Geom.violin,
Guide.xlabel("Variable"),
Guide.ylabel("Value"),
Guide.title("Violin Plot for ys1, ys2, and ys3"),
Coord.cartesian(ymin=0),
Scale.y_continuous(minvalue=0),
Theme(
key_position=:top,
default_color="purple",
background_color="white",
panel_stroke=colorant"gray",
minor_label_font_size=10pt,
major_label_font_size=12pt,
)
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Gadfly: Swarm Plot
To draw swarm plot, we neen box_plot
from Gadfly
,
but with additional Geom.beeswarm()
parameter.
box_plot = Gadfly.plot(
df_long,
x=:variable,
y=:value,
color=:variable,
Geom.beeswarm(),
Guide.xlabel("Variable"),
Guide.ylabel("Value"),
Guide.title("Swarm Plot for ys1, ys2, and ys3"),
Theme(
key_position = :top,
background_color = "white",
)
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Statistic Properties: Distribution
Just like previous plots, we can analyse the y-axis, but this time by frequency of each series.
KDE Plot
Kernel Density Estimation
KDE shown well the distribution of the frequency.
This complex task can be done easily
with kde_plot
from StatsPlots
.
We need to melt the DataFrame to long format.
using CSV, DataFrames, StatsPlots
df = CSV.read("series.csv", DataFrame)
rename!(df, Symbol.(strip.(string.(names(df)))))
df_long = stack(df, Not(:xs))
Now we can create KDE plot using StatsPlots with custom colors
kde_plot = density(
df_long.value,
group = df_long.variable,
fillalpha = 0.7,
legend = :topright,
xlabel = "Value",
ylabel = "Density",
title = "KDE Plot for ys1, ys2, and ys3",
lw = 2, # Line width
α = 0.5 # Opacity
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Rug Plot
No Luck
You know, I still have no luck, drawing this plot in Julia. I’d better come back later on.
Histogram
This looks like the most common chart for beginner.
This simple task can be done easily
with hist_plot
from StatsPlots
.
using CSV, DataFrames, StatsPlots
df = CSV.read("series.csv", DataFrame)
rename!(df, Symbol.(strip.(string.(names(df)))))
df_long = stack(df, Not(:xs))
Now we can create Histogram using StatsPlots with custom colors
hist_plot = histogram(
df_long.value,
group = df_long.variable,
bins = collect(0:50:maximum(df_long.value)),
linecolor = :black,
fillalpha = 0.7,
color = :Set1,
xlabel = "Value",
ylabel = "Density",
title = "Histogram Plot for ys1, ys2, and ys3",
legend = :topleft
)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Unfortunately, I still don’t know to use custom color.
Well, I should learn more. I’ll do it later. When I’ve got the time.
Marginal
No Luck
I have to explore more. Unfortunately.
I apologize.
What’s the Next Chapter 🤔?
You can obtain the interactive JupyterLab
in this following link:
- [github.com/…/trend/.ipynb]
We can visualize statistical properties, in practical way.
Beside statistical analysis with python, R and Julia. We can go further to Typescript and Go, so you can integrate with your application seamlessly.
But currently I’m pretty busy with my job.
Conclusion
It is fun, right?
What do you think?
Farewell. We shall meet again.