Preface
Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.
The thing about R
is I’m more in data aspect,
rather than in coding aspect.
This is weird at first for me as a coder,
so I avoid R
at first,
but then I love how the R
works.
Just like python
's seaborn
,
R
programming language equipped with this powerful ggplot2
.
Just like python
's polyfit
, there is this powerful lm
.
Just like python
. R
is also considered easy to learn.
Sure I have a lot of question about R
.
Instead of asking to the R
community directly,
I choose to explore the R
first,
and making bunch of working example.
So I can answer question in R
community,
whenever a member required a working example.
Preparation
Of course you need R
installed in your system.
No need for RStudio
, but there is a few things to consider.
Library
The script provided here start from the very basic,
and you need to get additional library from time to time.
You can install the package from R
terminal.
install.packages("readr")
install.packages("ggplot2")
install.packages("ggthemes")
You might prefer tidyverse
for convenience.
But I’d simply choose one library at a time,
to get more understanding.
Jupyter Lab
You also need to activate kernel for R
.
IRkernel::installspec()
This is optional.
Data Series Samples
I provide minimal case for visualization.
With only two example data,
we can make many kinds of visualization.
This way, you don’t need to adapt to new data,
for each visualization.
You can also reuse the R
code as well.
Minimizing rethink for each step.
The first one is using muiltiple series, suitable to experiment with melting dataframe.
xs, ys1, ys2, ys3
0, 5, 5, 5
1, 9, 12, 14
2, 13, 25, 41
3, 17, 44, 98
4, 21, 69, 197
5, 25, 100, 350
6, 29, 137, 569
7, 33, 180, 866
8, 37, 229, 1253
9, 41, 284, 1742
10, 45, 345, 2345
11, 49, 412, 3074
12, 53, 485, 3941
[R: ggplot2: Statistical Properties: CSV Source][017-vim-series]
And here is the simple one, for statistic properties, such as least square.
x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485
I use the word samples, to differ with the population. Since the calculation result would be different.
Trend: LM Model
Linear Model
Let’s get is started.
Vector
The array in R
is called vector.
# Given data
x_values <- c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y_values <- c(
5, 14, 41, 98, 197, 350, 569, 866,
1253, 1742, 2345, 3074, 3941)
Let’s say we have our linear regression as:
Let’s solve the linear model using lm()
.
First we need to define the order of the curve fitting.
Then perform cubic regression using lm()
.
With the lm_model
object we can get the coefficient.
But for printing, we need to reverse order to match output.
At last, we can print the coefficients with cat
.
order <- 3
lm_model <- lm(y_values ~
poly(x_values, order, raw = TRUE))
coefficients <- coef(lm_model)
coefficients <- coefficients[
length(coefficients):1]
cat("Coefficients (a, b, c, d):\n\t",
coefficients, "\n")
This should have this result below:
❯ Rscript 01-lm-vector.r
Coefficients (a, b, c, d):
2 3 4 5
It is so predictable, right?
You can obtain the interactive JupyterLab
in this following link:
Reading from CSV
Let’s continue,
this time reading from CSV, instead of hardcoded vector.
We can utilize built-in read.csv
method
to read data from CSV file.
We need to extract x values and y values from the data frame.
data <- read.csv("series.csv")
x_values <- data$xs
y_values <- data$ys3
The result is exactly the same as previous.
You can obtain the interactive JupyterLab
in this following link:
Using Readr
For a more complex case,
we can utilize readr
library.
First we need to load the required readr
library.
Then read data from CSV file and put into a dataframe.
Then create a variable shortcut,
by extracting x values and y values.
library(readr)
data <- read_csv(
"series.csv",
show_col_types = FALSE)
column_spec <- spec(data)
x_values <- data$xs
y_values <- data$ys3
You can retrieve the column specifications, and print if you need to inspect.
You can obtain the interactive JupyterLab
in this following link:
Different Order of LM
We can repeat above code for different order, or make it simpler.
We can make a generic function to make the process not repetitive.
This function,
perform linear regression using lm()
.
Also define a named vector to map order numbers to curve types.
Get the coefficients and also reverse order to match equation above.
And we can finally print the coefficients result.
calc_coeff <- function(x_values, y_values, order) {
lm_model <- lm(y_values ~
poly(x_values, order, raw = TRUE))
coeff_text <- c(
"(a, b)" = 1, "(a, b, c)" = 2, "(a, b, c, d)" = 3)
order_text <- c(
"Linear" = 1, "Quadratic" = 2, "Cubic" = 3)
cat(paste("Using lm_model :",
names(order_text)[order], "\n"))
coefficients <- coef(lm_model)
coefficients <- coefficients[
length(coefficients):1]
cat("Coefficients ",
names(coeff_text)[order], ":\n\t",
coefficients, "\n")
This way we can calculate coefficient, for different order and for different series.
library(readr)
data <- read_csv(
"series.csv",
show_col_types = FALSE)
calc_coeff(data$xs, data$ys1, 1)
calc_coeff(data$xs, data$ys2, 2)
calc_coeff(data$xs, data$ys3, 3)
With the result as below:
❯ Rscript 04-lm-merge.r
Using lm_model : Linear
Coefficients (a, b) :
4 5
Using lm_model : Quadratic
Coefficients (a, b, c) :
3 4 5
Using lm_model : Cubic
Coefficients (a, b, c, d) :
2 3 4 5
You can obtain the interactive JupyterLab
in this following link:
Trend: Built-in Plot
R
provide built-in plot with no additional library.
It is rather limited, but enough to get started with plotting.
Default Output
The default resulit is Rplot.pdf
.
But we can save to png
instead, for example:
# Open PNG graphics device
png("11-lm-line.png", width = 800, height = 400)
Linear Equation
We can start with plotting the data points.
plot(
x_values, y_values,
pch = 16, col = "blue",
xlab = "x", ylab = "y",
main = "Straight line fitting")
And continue with lines,
from precalculated plot values.
The y
values comes from the regression line,
previously performed by lm()
into lm_model
.
x_plot <- seq(
min(x_values), max(x_values),
length.out = 100)
y_plot <- predict(
lm_model,
newdata = data.frame(x_values = x_plot))
lines(x_plot, y_plot, col = "red")
We can also add decorative legend, to communicate the visual result.
legend("topright",
legend = c("Data points", "Linear Equation"),
col = c("blue", "red"),
pch = c(16, NA), lty = c(NA, 1))
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Straight Line
We can also utilized this abline
,
to add linear regression line to the plot.
so we don’t have to generate the y_plot
values manually.
abline(lm_model, col = "red")
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Quadratic Curve
From this, we can repeat the equation for the quadratic curve fitting. All we need to do is using different order for specific given data. Then change a view minor thing, such as title, and legend text. And that’s all.
order <- 2
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Cubic Curve
Also applied for cubic curve. Changing given data, order, and minor decorative changes. That simple.
order <- 3
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Trend: ggplot2
For complex case, we require this ggplot2
library.
But the thing is, we need to understand,
that plotting has its own grammar.
Linear Equation
Let’s try for a straight line.
To make the plot structure simple,
let’s put the model outside.
We need to generate values for the regression line,
the apply the result tp create data frame for ggplot2
.
x_plot <- seq(
min(x_values), max(x_values),
length.out = 100)
y_plot <- predict(
lm_model,
newdata = data.frame(x_values = x_plot))
data <- data.frame(x = x_values, y = y_values)
Now we are ready for the view.
Plot using ggplot2
.
As you can see, there is a lot of plus sign here.
This like an object stacked with another object,
all in one ggplot2
figure.
plot <- ggplot(data, aes(x = x, y = y)) +
geom_point(aes(color="Data Points"), size = 0.5) +
geom_line(
data = data.frame(x = x_plot, y = y_plot),
aes(x, y, color="Linear Equation"),
linewidth = 0.2) +
labs(
x = "x", y = "y",
title = "Straight line fitting") +
theme_minimal() +
theme(legend.position = "right",
legend.text = element_text(size = 2),
text = element_text(size = 4)) +
scale_color_manual(
name = "Plot",
breaks = c(
"Data Points",
"Linear Equation"),
values = c(
"Data Points"="red",
"Linear Equation"="black")) +
guides(
color = guide_legend(
override.aes = list(
shape = c(16, NA), linetype = c(0, 1)
)))
Do not forget to save to png
for convenience.
I’m using specific size (width x height),
so I can use the result directly in my blog article.
# Save plot as PNG
ggsave("14-lm-gg-line.png",
plot, width = 800, height = 400, units = "px")
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Quadratic Curve
By changing the given data, order and minor decorative changes,
we can apply the same ggplot2
grammar,
stacked parts of smaller plot object,
and sum them all to plot
variable.
And finally save the png
,
based on this generated plot
variable .
order <- 2
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Cubic Curve
The same applied for cubic. You can see the detail in the source code.
order <- 3
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Easy peasy right?
What’s the Next Chapter 🤔?
Let’s continue our previous R
journey,
with building class and statistical properties.
Consider continuing your exploration with [ Trend - Language - R - Part Two ].