Where to Discuss?

Local Group

Preface

Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.

Let’s continue our previous R journey, with building class and statistical properties.


Trend: Building Class

Real life experience requriee you to manage your code. This tidying process often be done by creating a class. This tedious process usually started by making the function first, then move them to a class.

Refactor to Function

Let’s make move the palin code above to a function. The problem with function is we need a bunch variables, for input and output between function.

First we need to load required libraries.

library(readr)
library(ggplot2)

And put the code to function one at a time, for example this generate_regression function, that performing regression using lm(), generate values for the regression line, and finally return a list containing all relevant objects.

generate_regression <- function(x_values, y_values, order) {
  ...

  return(list(
    order = order,
    x_values = x_values,
    y_values = y_values,
    lm_model = lm_model,
    x_plot = x_plot,
    y_plot = y_plot
  ))
}

Then we can use the regression data, to show the coefficient.

show_coeff <- function(regression_data) {
  order    <- regression_data$order
  lm_model <- regression_data$lm_model

  ...
}

Also wrap the plot in a create_plot function.

create_plot <- function(
    regression_data, title, label, pngfile) {

  x_values <- regression_data$x_values
  y_values <- regression_data$y_values
  x_plot   <- regression_data$x_plot
  y_plot   <- regression_data$y_plot

  ...
}

R: Trend: Building Class: Function

And finally call each functions, in main script, the outer block of the code.

regress_1 <- generate_regression(
  series$xs, series$ys1, 1)
regress_2 <- generate_regression(
  series$xs, series$ys2, 2)
regress_3 <- generate_regression(
  series$xs, series$ys3, 3)

show_coeff(regress_1)
create_plot(regress_1,
  "Straight line fitting",
  "Linear Equation",
  "14-lm-gg-line.png")

...

R: Trend: Building Class: Function

You can obtain the interactive JupyterLab in this following link:

Merging The Plot

Let’s have another case. Instead of making different plot for different series, This time merge the different order for just one series. This can be done by using code with skeleton as below:

generate_regressions <- function(x_values, y_values) {
  ...
}

show_coeff <- function(order, lm_model) {
  ...
}

create_plot <- function(regression_data) {
  ...
}

R: Trend: Building Class: Merge

Now we can call the function in main script. Get the regression data. And perform linear regression and plot.

regress <- generate_regressions(
    series$xs, series$ys3)

show_coeff(1, regress$lm_model_1)
show_coeff(2, regress$lm_model_2)
show_coeff(3, regress$lm_model_3)
plot <- create_plot(regress)

R: Trend: Building Class: Merge

The plot result can be shown as follows. You can see that in the legend, it lost one color. The green one for quadratic curve legend does not shown well. You can solve this by removing the guides.

R: Trend: Building Class: Plain Merge

You can obtain the interactive JupyterLab in this following link:

The problem with function is we need a bunch variables, for input and output between function.

Using R6 Class

Instead of function me can wrap them all into a class. R has a few kind of class implementation, one of them is R6

Let’s see how it goes with this skeleton.

CurveFitting <- R6Class("CurveFitting",

  public = list(
    xs  = NULL, ys  = NULL,
    lm1 = NULL, lm2 = NULL, lm3 = NULL,
    xp  = NULL, yp1 = NULL, yp2 = NULL, yp3 = NULL,

    initialize = function(xs, ys) {
      self$xs <- xs
      self$ys <- ys
    },

    generate_regressions = function() {
      ...
    },

    show_coeff = function(order, lm_model) {
      ...
    },

    show_coeffs = function() {
      self$show_coeff(1, self$lm1)
      self$show_coeff(2, self$lm2)
      self$show_coeff(3, self$lm3)
    },

    create_plot = function() {
      ...
    }
  )
)

R: Trend: Building Class: R6 Class

Now we can instantiate the CurveFitting class in main script. Then perform linear regression, display, finally and create the plot.

curve_fitting <- CurveFitting$new(series$xs, series$ys3)
curve_fitting$generate_regressions()
curve_fitting$show_coeffs()
curve_fitting$create_plot()

R: Trend: Building Class: R6 Class

The plot result can be shown as follows:

R: Trend: Building Class: Function

You can obtain the interactive JupyterLab in this following link:

Built in Plot (alternative)

As an alternative, you can use built-in plot instead of ggplot2.

    create_plot = function() {
      # Open PNG graphics device
      png("19-lm-merge-alt.png",
        width = 800, height = 400)

      # Plot data points
      plot(self$xs, self$ys,
        col = "black", pch = 20,
        xlab = "x", ylab = "y",
        main = "Polynomial Curve Fitting")

      # Plot regression lines
      lines(self$xp, self$yp1, col = "red", lwd = 2)
      lines(self$xp, self$yp2, col = "green", lwd = 2)
      lines(self$xp, self$yp3, col = "blue", lwd = 2)

      # Add legend
      legend("topright",
         legend = c("Data Points", "Linear Equation",
                    "Quadratic Curve", "Cubic Curve"),
         col = c("black", "red", "green", "blue"),
         pch = c(20, NA, NA, NA), lwd = c(NA, 2, 2, 2))
    }

R: Trend: Building Class: R6 Class

The plot result can be shown as follows:

R: Trend: Building Class: R6 Class

You can obtain the interactive JupyterLab in this following link:

Note that, not everything should be well structured. For simple thing, we’d better use shortcut as possible, as will be shown in the next R article.


Statistic Properties

We will have look at how R, handle statistical properties. Starting with least square.

Manual Calculation

You can check the detail of the manual calculation in the source code below:

...
# Calculate deviations
x_deviation <- x_observed - x_mean
y_deviation <- y_observed - y_mean

# Calculate squared deviations
x_sq_deviation <- sum(x_deviation^2)
y_sq_deviation <- sum(y_deviation^2)
...

R: Statistic Properties: Manual

You can obtain the interactive JupyterLab in this following link:

Using Built-in Method

You can also check the detail of the built-in method in the source code below:

...
# Calculate slope (m) and intercept (b)
m_slope <- lm(y_observed ~
  x_observed)$coefficients[2]
b_intercept <- lm(y_observed ~
  x_observed)$coefficients[1]
...

R: Statistic Properties: Built-in

Well, not everything has built-in method. I must admit I haven’t explore R’s statistical library thoroughly.

❯ Rscript 52-lq-built-in.r
n          =   13
∑x (total) =   78.00
∑y (total) = 2327.00
x̄ (mean)   =    6.00
ȳ (mean)   =  179.00

∑(xᵢ-x̄)    =      0.00
∑(yᵢ-ȳ)    =      0.00
∑(xᵢ-x̄)²   =    182.00
∑(yᵢ-ȳ)²   = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ)  =   7280.00
m (slope)      =     40.00
b (intercept)  =    -61.00

Equation     y = -61.00 + 40.00.x

sₓ² (variance) =     15.17
sy² (variance) =  25768.17
covariance     =    606.67
sₓ (std dev)   =      3.89
sy (std dev)   =    160.52
r (pearson)    =      0.97
R²             =      0.94

SSR = ∑ϵ²           =  18018.00
MSE = ∑ϵ²/(n-2)     =   1638.00
SE(β₁)  = √(MSE/sₓ) =      3.00
t-value = β̅₁/SE(β₁) =     13.33

You can obtain the interactive JupyterLab in this following link:

The result is the same with the python counterpart.

Visualization

However, I guess the lm above is, enough to calculate regression line. And then we can combine scatter plot and regression line.

We can start with usual scatter plot.

R: Statistic Properties: Plot

Then add this part the to plot grammar.

regression_line <- geom_abline(
  intercept = b_intercept, slope = m_slope,
  color = "red")

plot <- scatter_plot + regression_line

R: Statistic Properties: Plot

The plot result can be shown as follows:

R: Least Square: Simple Plot

You can obtain the interactive JupyterLab in this following link:

Additional Properties

We can also calculate additional properties. But in order to calculate kurtosis and skewness, we need this e1071 library.

x_kurtosis <- kurtosis(x_observed)
y_kurtosis <- kurtosis(y_observed)

x_skewness <- skewness(x_observed)
y_skewness <- skewness(y_observed)

cat(sprintf('x kurtosis     = %9.2f\n', x_kurtosis))
cat(sprintf('y kurtosis     = %9.2f\n', y_kurtosis))
cat(sprintf('x skewness     = %9.2f\n', x_skewness))
cat(sprintf('y skewness     = %9.2f\n\n', y_skewness))

R: Statistic Properties: Additional

The result is not really the same with the python counterpart.

❯ Rscript 61-additional.r
x (max, min, range) = (   0.00,   12.00,   12.00 )
y (max, min, range) = (   5.00,  485.00,  480.00 )

x median       =      6.00
y median       =    137.00
x mode         =      0.00
y mode         =      5.00

x kurtosis     =     -1.48
y kurtosis     =     -1.22
x skewness     =      0.00
y skewness     =      0.54

x SE kurtosis  =      1.19
y SE kurtosis  =      1.19
x SE skewness  =      0.62
y SE skewness  =      0.62

You can obtain the interactive JupyterLab in this following link:

As you can see, the kurtosis and the skewness is different, compared with PSPP. Just be aware of it.

I still don’t know the reason of this differences.


What’s the Next Chapter 🤔?

Let’s continue our previous ggplot2 journey.

Consider continuing your exploration with [ Trend - Language - R - Part Three ].