Where to Discuss?

Local Group

Preface

Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.

The thing about R is I’m more in data aspect, rather than in coding aspect. This is weird at first for me as a coder, so I avoid R at first, but then I love how the R works.

Just like python's seaborn, R programming language equipped with this powerful ggplot2. Just like python's polyfit, there is this powerful lm. Just like python. R is also considered easy to learn.

Sure I have a lot of question about R. Instead of asking to the R community directly, I choose to explore the R first, and making bunch of working example. So I can answer question in R community, whenever a member required a working example.


Preparation

Of course you need R installed in your system. No need for RStudio, but there is a few things to consider.

Library

The script provided here start from the very basic, and you need to get additional library from time to time. You can install the package from R terminal.

install.packages("readr")
install.packages("ggplot2")
install.packages("ggthemes")

You might prefer tidyverse for convenience. But I’d simply choose one library at a time, to get more understanding.

Jupyter Lab

You also need to activate kernel for R.

IRkernel::installspec()

This is optional.

Data Series Samples

I provide minimal case for visualization. With only two example data, we can make many kinds of visualization. This way, you don’t need to adapt to new data, for each visualization. You can also reuse the R code as well. Minimizing rethink for each step.

The first one is using muiltiple series, suitable to experiment with melting dataframe.

xs, ys1, ys2, ys3
0,  5,   5,   5
1,  9,   12,  14
2,  13,  25,  41
3,  17,  44,  98
4,  21,  69,  197
5,  25,  100, 350
6,  29,  137, 569
7,  33,  180, 866
8,  37,  229, 1253
9,  41,  284, 1742
10, 45,  345, 2345
11, 49,  412, 3074
12, 53,  485, 3941

[R: ggplot2: Statistical Properties: CSV Source][017-vim-series]

And here is the simple one, for statistic properties, such as least square.

x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485

I use the word samples, to differ with the population. Since the calculation result would be different.


Trend: LM Model

Linear Model

Let’s get is started.

Vector

The array in R is called vector.

# Given data
x_values <- c(
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y_values <- c(
  5, 14, 41, 98, 197, 350, 569, 866, 
  1253, 1742, 2345, 3074, 3941)

R: Trend: LM Model: Vector

Let’s say we have our linear regression as:

Let’s solve the linear model using lm(). First we need to define the order of the curve fitting. Then perform cubic regression using lm(). With the lm_model object we can get the coefficient. But for printing, we need to reverse order to match output. At last, we can print the coefficients with cat.

order <- 3

lm_model <- lm(y_values ~
  poly(x_values, order, raw = TRUE))

coefficients <- coef(lm_model)
coefficients <- coefficients[
  length(coefficients):1]

cat("Coefficients (a, b, c, d):\n\t",
  coefficients, "\n")

R: Trend: LM Model: Vector

This should have this result below:

❯ Rscript 01-lm-vector.r
Coefficients (a, b, c, d):
         2 3 4 5 

It is so predictable, right?

You can obtain the interactive JupyterLab in this following link:

Reading from CSV

Let’s continue, this time reading from CSV, instead of hardcoded vector. We can utilize built-in read.csv method to read data from CSV file.

We need to extract x values and y values from the data frame.

data <- read.csv("series.csv")

x_values <- data$xs
y_values <- data$ys3

R: Trend: LM Model: CSV

The result is exactly the same as previous.

You can obtain the interactive JupyterLab in this following link:

Using Readr

For a more complex case, we can utilize readr library.

First we need to load the required readr library. Then read data from CSV file and put into a dataframe. Then create a variable shortcut, by extracting x values and y values.

library(readr)

data <- read_csv(
  "series.csv",
  show_col_types = FALSE)

column_spec <- spec(data)

x_values <- data$xs
y_values <- data$ys3

R: Trend: LM Model: readr

You can retrieve the column specifications, and print if you need to inspect.

You can obtain the interactive JupyterLab in this following link:

Different Order of LM

We can repeat above code for different order, or make it simpler.

We can make a generic function to make the process not repetitive.

This function, perform linear regression using lm(). Also define a named vector to map order numbers to curve types. Get the coefficients and also reverse order to match equation above. And we can finally print the coefficients result.

calc_coeff <- function(x_values, y_values, order) {
  lm_model <- lm(y_values ~ 
    poly(x_values, order, raw = TRUE))

  coeff_text <- c(
    "(a, b)" = 1, "(a, b, c)" = 2, "(a, b, c, d)" = 3)
  order_text <- c(
    "Linear" = 1, "Quadratic" = 2, "Cubic" = 3)

  cat(paste("Using lm_model :",
    names(order_text)[order], "\n"))

  coefficients <- coef(lm_model)
  coefficients <- coefficients[
    length(coefficients):1]

  cat("Coefficients ",
    names(coeff_text)[order], ":\n\t",
    coefficients, "\n")

R: Trend: LM Model: Merge

This way we can calculate coefficient, for different order and for different series.

library(readr)
data <- read_csv(
  "series.csv",
  show_col_types = FALSE)

calc_coeff(data$xs, data$ys1, 1)
calc_coeff(data$xs, data$ys2, 2)
calc_coeff(data$xs, data$ys3, 3)

R: Trend: LM Model: Merge

With the result as below:

❯ Rscript 04-lm-merge.r
Using lm_model : Linear 
Coefficients  (a, b) :
         4 5 
Using lm_model : Quadratic 
Coefficients  (a, b, c) :
         3 4 5 
Using lm_model : Cubic 
Coefficients  (a, b, c, d) :
         2 3 4 5 

R: Trend: LM Model: Merge

You can obtain the interactive JupyterLab in this following link:


Trend: Built-in Plot

R provide built-in plot with no additional library. It is rather limited, but enough to get started with plotting.

Default Output

The default resulit is Rplot.pdf. But we can save to png instead, for example:

# Open PNG graphics device
png("11-lm-line.png", width = 800, height = 400)

Linear Equation

We can start with plotting the data points.

plot(
  x_values, y_values,
  pch = 16, col = "blue",
  xlab = "x", ylab = "y",
  main = "Straight line fitting")

R: Trend: Built-in Plot: Linear Equation

And continue with lines, from precalculated plot values. The y values comes from the regression line, previously performed by lm() into lm_model.

x_plot <- seq(
  min(x_values), max(x_values),
  length.out = 100)
y_plot <- predict(
  lm_model,
  newdata = data.frame(x_values = x_plot))

lines(x_plot, y_plot, col = "red")

R: Trend: Built-in Plot: Linear Equation

We can also add decorative legend, to communicate the visual result.

legend("topright",
  legend = c("Data points", "Linear Equation"),
  col = c("blue", "red"),
  pch = c(16, NA), lty = c(NA, 1))

The plot result can be shown as follows:

R: Trend: Built-in Plot: Linear Equation

You can obtain the interactive JupyterLab in this following link:

Straight Line

We can also utilized this abline, to add linear regression line to the plot. so we don’t have to generate the y_plot values manually.

abline(lm_model, col = "red")

The plot result can be shown as follows:

R: Trend: Built-in Plot: Linear Equation: Alternative

You can obtain the interactive JupyterLab in this following link:

Quadratic Curve

From this, we can repeat the equation for the quadratic curve fitting. All we need to do is using different order for specific given data. Then change a view minor thing, such as title, and legend text. And that’s all.

order <- 2

The plot result can be shown as follows:

R: Trend: Built-in Plot: Quadratic Curve

You can obtain the interactive JupyterLab in this following link:

Cubic Curve

Also applied for cubic curve. Changing given data, order, and minor decorative changes. That simple.

order <- 3

The plot result can be shown as follows:

R: Trend: Built-in Plot: Cubic Curve

You can obtain the interactive JupyterLab in this following link:


Trend: ggplot2

For complex case, we require this ggplot2 library. But the thing is, we need to understand, that plotting has its own grammar.

Linear Equation

Let’s try for a straight line.

To make the plot structure simple, let’s put the model outside. We need to generate values for the regression line, the apply the result tp create data frame for ggplot2.

x_plot <- seq(
  min(x_values), max(x_values),
  length.out = 100)
y_plot <- predict(
  lm_model,
  newdata = data.frame(x_values = x_plot))

data <- data.frame(x = x_values, y = y_values)

R: Trend: Built-in Plot: Linear Equation

Now we are ready for the view. Plot using ggplot2. As you can see, there is a lot of plus sign here. This like an object stacked with another object, all in one ggplot2 figure.

plot <- ggplot(data, aes(x = x, y = y)) +
  geom_point(aes(color="Data Points"), size = 0.5) +
  geom_line(
    data = data.frame(x = x_plot, y = y_plot),
    aes(x, y, color="Linear Equation"),
    linewidth = 0.2) +
  labs(
    x = "x", y = "y",
    title = "Straight line fitting") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.text = element_text(size = 2),
        text = element_text(size = 4)) +
  scale_color_manual(
    name = "Plot",
    breaks = c(
      "Data Points",
      "Linear Equation"),
    values = c(
      "Data Points"="red",
      "Linear Equation"="black")) +
  guides(
    color = guide_legend(
      override.aes = list(
        shape = c(16, NA), linetype = c(0, 1)
    )))

R: Trend: Built-in Plot: Linear Equation

Do not forget to save to png for convenience. I’m using specific size (width x height), so I can use the result directly in my blog article.

# Save plot as PNG
ggsave("14-lm-gg-line.png",
  plot, width = 800, height = 400, units = "px")

R: Trend: Built-in Plot: Linear Equation

The plot result can be shown as follows:

R: Trend: ggplot2: Linear Equation

You can obtain the interactive JupyterLab in this following link:

Quadratic Curve

By changing the given data, order and minor decorative changes, we can apply the same ggplot2 grammar, stacked parts of smaller plot object, and sum them all to plot variable. And finally save the png, based on this generated plot variable .

order <- 2

The plot result can be shown as follows:

R: Trend: ggplot2: Quadratic Curve

You can obtain the interactive JupyterLab in this following link:

Cubic Curve

The same applied for cubic. You can see the detail in the source code.

order <- 3

The plot result can be shown as follows:

R: Trend: ggplot2: Cubic Curve

You can obtain the interactive JupyterLab in this following link:

Easy peasy right?


What’s the Next Chapter 🤔?

Let’s continue our previous R journey, with building class and statistical properties.

Consider continuing your exploration with [ Trend - Language - R - Part Two ].