Preface
Goal: Explore R Programming language visualization with ggplot2. Provide a bunch of example of plot cases in your fingertip.
Let’s continue our previous ggplot2
journey.
It is easy if we can embrace the grammar.
Distribution
We can start with normal distribution.
The dnorm
method can be used to
calculate the corresponding y-values
for the standard normal distribution
y <- dnorm(x)
Normal Distribution
geom_line
We can start with load required libraries.
Then generate data points for x-axis.
And use dnorm
method to
calculate the corresponding y-values
for a standard normal distribution,
So we can create data frame for plotting.
library(ggplot2)
x <- seq(-5, 5, length.out = 1000)
y <- dnorm(x)
df <- data.frame(x = x, y = y)
This way we can plot the normal distribution using geom_line
.
Then add decoration such as grid, labels and title.
And finally save plot as PNG.
plot <- ggplot(df, aes(x = x, y = y)) +
geom_line(color = "black")
plot <- plot +
theme_minimal() +
theme(
text = element_text(size = 4),
panel.grid = element_blank()) +
labs(
x = "x", y = "Density",
title = "Standard Normal ",
"Distribution with Quantiles")
ggsave("63-normal.png", plot,
width = 800, height = 400, units = "px")
Normal Distribution with Quantiles
geom_area
With above plot we can add quantiles.
First we have to calculate the quantiles, based on defined percentiles mark.
percentiles <- c(25, 50, 75, 100)
quantiles <- quantile(x, probs = percentiles / 100)
And add this shade regions corresponding to percentiles,
to the plot grammar. This can be done by using geom_area
.
for (i in seq_along(quantiles)) {
plot <- plot + geom_area(
data = subset(df,x <= quantiles[i]),
aes(x = x, y = y),
fill = i, alpha = 0.3)
}
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Kurtosis
With the dnorm
method,
we can simulate kurtosis and skewness.
Let’s make examples of distributions with different levels of kurtosis.
- Standard normal distribution (Kurtosis = 0)
- Lower kurtosis
- Higher kurtosis
y_standard <- dnorm(x)
df_standard <- data.frame(x = x, y = y_standard)
y_kurtosis_1 <- dnorm(x, mean = 1, sd = 1)
y_kurtosis_2 <- dnorm(x, mean = 1, sd = 0.5)
y_kurtosis_3 <- dnorm(x, mean = 1, sd = 2)
df_kurtosis_1 <- data.frame(x = x, y = y_kurtosis_1)
df_kurtosis_2 <- data.frame(x = x, y = y_kurtosis_2)
df_kurtosis_3 <- data.frame(x = x, y = y_kurtosis_3)
Then add geom_line
,
for each different levels of kurtosis to the plot grammar.
plot <- ggplot() +
geom_line(data = df_standard, color = "black"
aes(x = x, y = y), linewidth = 0.2) +
geom_line(data = df_kurtosis_1,
aes(x = x, y = y), color = "red",
linetype = "dashed", linewidth = 0.2) +
geom_line(data = df_kurtosis_2,
aes(x = x, y = y), color = "green",
linetype = "dashed", linewidth = 0.2) +
geom_line(data = df_kurtosis_3,
aes(x = x, y = y), color = "blue",
linetype = "dashed", linewidth = 0.2) +
labs(x = "x", y = "Density",
title = "Normal Distribution ",
"with Different Kurtosis") +
scale_linetype_manual(
values = c("solid", "dashed", "dashed", "dashed"),
labels = c(
"Standard Normal", "Standard Kurtosis = 0",
"Lower Kurtosis", "Higher Kurtosis")) +
theme_minimal() +
theme(
text = element_text(size = 4))
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Skewness
The same can be applied with skewness.
Let’s make examples of distributions with different skewness parameters.
- Negative skewness
- Moderate positive skewness
- High positive skewness
y_standard <- dnorm(x)
df_standard <- data.frame(x = x, y = y_standard)
y_skewed_1 <- dnorm(x) * 2 * pnorm(x)
y_skewed_2 <- dnorm(x) * 2 * pnorm(-x)
y_skewed_3 <- dnorm(x) * 2 * pnorm(x) * 2
df_skewed_1 <- data.frame(x = x, y = y_skewed_1)
df_skewed_2 <- data.frame(x = x, y = y_skewed_2)
df_skewed_3 <- data.frame(x = x, y = y_skewed_3)
Then again add geom_line
,
for each different skewed distributions to the plot grammar.
plot <- ggplot() +
geom_line(data = df_standard, color = "black",
aes(x = x, y = y), linewidth = 0.2) +
geom_line(data = df_skewed_1,
aes(x = x, y = y), color = "red",
linetype = "dashed", linewidth = 0.2) +
geom_line(data = df_skewed_2,
aes(x = x, y = y), color = "green",
linetype = "dashed", linewidth = 0.2) +
geom_line(data = df_skewed_3,
aes(x = x, y = y), color = "blue",
linetype = "dashed", linewidth = 0.2) +
labs(x = "x", y = "Density",
title = "Normal Distribution with Different Skewness") +
scale_linetype_manual(
values = c("solid", "dashed", "dashed", "dashed"),
labels = c(
"Standard Normal",
"Negative Skewness = -4",
"Moderate Positive Skewness = 2",
"High Positive Skewness = 6")) +
theme_minimal() +
theme(
text = element_text(size = 4))
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Trend: Multiple
From the perspective of visualization, We can manage to display different series, in one plot, or different using grid.
Geom Smooth
Instead of calculating linear model manually,
we can utilize geom_smooth
to plot the curve fitting.
This geom_smooth
also have standard error feature.
Let’s plot an example of this case.
We can start with plot area with geom_point
,
then by using geom_smooth
add regression line,
for each ys1
, ys2
and ys3
.
plot <- ggplot(data, aes(x = xs)) +
geom_point(
aes(x = xs, y = ys1),
size = 0.5, color = "firebrick") +
geom_smooth(
aes(x = xs, y = ys1), method = "lm",
se = TRUE, color = "firebrick",
linewidth = 0.2) +
text = element_text(size = 4))
...
We can also add nice solarized
theme.
To obtain this you need ggthemes
library.
labs(x = "x", y = "y",
title = "Scatter Plot with Regression Lines") +
theme_solarized() +
scale_color_solarized() +
theme(
text = element_text(size = 4))
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Grid Extra
For some reason, it would be better to separate the result.
For example, if you want different y-axis scale.
To obtain this you need gridExtra
library.
Let’s arrange plots using gridExtra horizontally.
grid_plot <- grid.arrange(
plot_y1, plot_y2, plot_y3, ncol = 3)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Statistic Properties: One Axis Plot
Three series in one axis plot
As you can see from previous statistical properties.
We can analyze the data for each series.
For example we can just consider just the y-series,
and obtain the mean
, median
, mode
,
and also the minimum
, maximum
, range
, and quantiles
.
Long Format
Melt
To visualize multiple y-series,
we need to melt the series to long format.
Piping to gather
method.
The gather
method is available in tidyr
library.
series_longer <- series %>%
gather(key = "y", value = "value", -xs)
You can check the result by cat
or print
the merged series.
Box Plot
The most common way to visualize this is the box plot.
We can utilize geom_boxplot
to get the plot.
plot <- ggplot(
series_longer,
aes(x = y, y = value, fill = y)) +
geom_boxplot(color = "black", linewidth= 0.2) +
...
Let’s use custom colors for this example.
scale_fill_manual(values = soft_colors) +
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Violin Plot
The better to visualize is by using violin plot.
We can utilize geom_violin
to get the plot.
plot <- ggplot(
series_longer,
aes(x = y, y = value, fill = y)) +
geom_violin(color = "black", linewidth= 0.2) +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Swarm Plot
This leave us with other option such as swarm plot and strip plot.
We can get swarm plot using jitter
inside geom_point
.
plot <- ggplot(
series_longer,
aes(x = y, y = value, color = y)) +
geom_point(
position = position_jitterdodge(
jitter.width = 0.3, jitter.height = 0),
size = 0.5) +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Strip Plot
We can get strip plot using geom_jitter
.
plot <- ggplot(
series_longer,
aes(x = y, y = value, color = y)) +
geom_jitter(
width = 0.3, height = 0, size = 0.5) +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Statistic Properties: Distribution
Just like previous four, we can analyse the y-axis, but this time by frequency of each series.
KDE Plot
Kernel Density Estimation
KDE shown well the distribution of the frequency.
This complex task can be done easily with geom_density
.
plot <- ggplot(
series_longer,
aes(x = value, fill = Category)) +
geom_density(alpha = 0.7, color = NA) +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Rug Plot
We can also simply show the rug plot using geom_rug
.
plot <- ggplot(
series_longer,
aes(x = value, fill = Category)) +
geom_rug(alpha = 0.5, sides = "b") +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Histogram
This looks like the most common chart for beginner.
But geom_histogram
is more than the basic histogram.
plot <- ggplot(
series_longer,
aes(x = value, fill = Category)) +
geom_histogram(
binwidth = 50, linewidth = 0.2,
alpha = 0.7, color = "black") +
...
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Statistic Properties: Marginal
We can step to analyse each of single axis analysis,
right on its own axis using ggMarginal
from ggExtra
library.
Density Example
For example we can add marginal density plot. Let’s start with usual plot.
Then add marginal density plot.
p_with_margins <- ggMarginal(
p, type = "density", linewidth = 0.2,)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Histogram Example
We can add different marginal plot such as histogram.
p_with_margins <- ggMarginal(
p, type = "histogram",
color = "black", fill = alpha("#FFD700", 0.1),
linewidth = 0.1)
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
What’s the Next Chapter 🤔?
We can visualize statistical properties, in practical way.
Beside python and R, for statistical analysis. We can have a peek to Julia for future programming language. And also Typescript and Go, so you can integrate with your application seamlessly.
Consider continuing your exploration with [ Trend - Language - Julia - Part One ].
Conclusion
It is fun, right?
What do you think?
Farewell. We shall meet again.