### Preface

Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.

Let’s continue our previous `R`

journey,
with building class and statistical properties.

### Trend: Building Class

Real life experience requriee you to manage your code. This tidying process often be done by creating a class. This tedious process usually started by making the function first, then move them to a class.

#### Refactor to Function

Let’s make move the palin code above to a function. The problem with function is we need a bunch variables, for input and output between function.

First we need to load required libraries.

```
library(readr)
library(ggplot2)
```

And put the code to function one at a time,
for example this `generate_regression`

function,
that performing regression using `lm()`

,
generate values for the regression line,
and finally return a list containing all relevant objects.

```
generate_regression <- function(x_values, y_values, order) {
...
return(list(
order = order,
x_values = x_values,
y_values = y_values,
lm_model = lm_model,
x_plot = x_plot,
y_plot = y_plot
))
}
```

Then we can use the regression data, to show the coefficient.

```
show_coeff <- function(regression_data) {
order <- regression_data$order
lm_model <- regression_data$lm_model
...
}
```

Also wrap the plot in a `create_plot`

function.

```
create_plot <- function(
regression_data, title, label, pngfile) {
x_values <- regression_data$x_values
y_values <- regression_data$y_values
x_plot <- regression_data$x_plot
y_plot <- regression_data$y_plot
...
}
```

And finally call each functions, in main script, the outer block of the code.

```
regress_1 <- generate_regression(
series$xs, series$ys1, 1)
regress_2 <- generate_regression(
series$xs, series$ys2, 2)
regress_3 <- generate_regression(
series$xs, series$ys3, 3)
show_coeff(regress_1)
create_plot(regress_1,
"Straight line fitting",
"Linear Equation",
"14-lm-gg-line.png")
...
```

You can obtain the interactive `JupyterLab`

in this following link:

#### Merging The Plot

Let’s have another case. Instead of making different plot for different series, This time merge the different order for just one series. This can be done by using code with skeleton as below:

```
generate_regressions <- function(x_values, y_values) {
...
}
show_coeff <- function(order, lm_model) {
...
}
create_plot <- function(regression_data) {
...
}
```

Now we can call the function in main script. Get the regression data. And perform linear regression and plot.

```
regress <- generate_regressions(
series$xs, series$ys3)
show_coeff(1, regress$lm_model_1)
show_coeff(2, regress$lm_model_2)
show_coeff(3, regress$lm_model_3)
plot <- create_plot(regress)
```

The plot result can be shown as follows. You can see that in the legend, it lost one color. The green one for quadratic curve legend does not shown well. You can solve this by removing the guides.

You can obtain the interactive `JupyterLab`

in this following link:

The problem with function is we need a bunch variables, for input and output between function.

#### Using R6 Class

Instead of function me can wrap them all into a class.
`R`

has a few kind of class implementation,
one of them is `R6`

Let’s see how it goes with this skeleton.

```
CurveFitting <- R6Class("CurveFitting",
public = list(
xs = NULL, ys = NULL,
lm1 = NULL, lm2 = NULL, lm3 = NULL,
xp = NULL, yp1 = NULL, yp2 = NULL, yp3 = NULL,
initialize = function(xs, ys) {
self$xs <- xs
self$ys <- ys
},
generate_regressions = function() {
...
},
show_coeff = function(order, lm_model) {
...
},
show_coeffs = function() {
self$show_coeff(1, self$lm1)
self$show_coeff(2, self$lm2)
self$show_coeff(3, self$lm3)
},
create_plot = function() {
...
}
)
)
```

Now we can instantiate the `CurveFitting`

class in main script.
Then perform linear regression, display, finally and create the plot.

```
curve_fitting <- CurveFitting$new(series$xs, series$ys3)
curve_fitting$generate_regressions()
curve_fitting$show_coeffs()
curve_fitting$create_plot()
```

The plot result can be shown as follows:

You can obtain the interactive `JupyterLab`

in this following link:

#### Built in Plot (alternative)

As an alternative,
you can use built-in plot instead of `ggplot2`

.

```
create_plot = function() {
# Open PNG graphics device
png("19-lm-merge-alt.png",
width = 800, height = 400)
# Plot data points
plot(self$xs, self$ys,
col = "black", pch = 20,
xlab = "x", ylab = "y",
main = "Polynomial Curve Fitting")
# Plot regression lines
lines(self$xp, self$yp1, col = "red", lwd = 2)
lines(self$xp, self$yp2, col = "green", lwd = 2)
lines(self$xp, self$yp3, col = "blue", lwd = 2)
# Add legend
legend("topright",
legend = c("Data Points", "Linear Equation",
"Quadratic Curve", "Cubic Curve"),
col = c("black", "red", "green", "blue"),
pch = c(20, NA, NA, NA), lwd = c(NA, 2, 2, 2))
}
```

The plot result can be shown as follows:

You can obtain the interactive `JupyterLab`

in this following link:

Note that, not everything should be well structured.
For simple thing, we’d better use shortcut as possible,
as will be shown in the next `R`

article.

### Statistic Properties

We will have look at how `R`

,
handle statistical properties.
Starting with least square.

#### Manual Calculation

You can check the detail of the manual calculation in the source code below:

```
...
# Calculate deviations
x_deviation <- x_observed - x_mean
y_deviation <- y_observed - y_mean
# Calculate squared deviations
x_sq_deviation <- sum(x_deviation^2)
y_sq_deviation <- sum(y_deviation^2)
...
```

You can obtain the interactive `JupyterLab`

in this following link:

#### Using Built-in Method

You can also check the detail of the built-in method in the source code below:

```
...
# Calculate slope (m) and intercept (b)
m_slope <- lm(y_observed ~
x_observed)$coefficients[2]
b_intercept <- lm(y_observed ~
x_observed)$coefficients[1]
...
```

Well, not everything has built-in method. I must admit I haven’t explore R’s statistical library thoroughly.

```
❯ Rscript 52-lq-built-in.r
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
∑(xᵢ-x̄) = 0.00
∑(yᵢ-ȳ) = 0.00
∑(xᵢ-x̄)² = 182.00
∑(yᵢ-ȳ)² = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 15.17
sy² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sy (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
```

You can obtain the interactive `JupyterLab`

in this following link:

The result is the same with the python counterpart.

#### Visualization

However, I guess the `lm`

above is,
enough to calculate regression line.
And then we can combine scatter plot and regression line.

We can start with usual scatter plot.

Then add this part the to plot grammar.

```
regression_line <- geom_abline(
intercept = b_intercept, slope = m_slope,
color = "red")
plot <- scatter_plot + regression_line
```

The plot result can be shown as follows:

You can obtain the interactive `JupyterLab`

in this following link:

#### Additional Properties

We can also calculate additional properties.
But in order to calculate kurtosis and skewness,
we need this `e1071`

library.

```
x_kurtosis <- kurtosis(x_observed)
y_kurtosis <- kurtosis(y_observed)
x_skewness <- skewness(x_observed)
y_skewness <- skewness(y_observed)
cat(sprintf('x kurtosis = %9.2f\n', x_kurtosis))
cat(sprintf('y kurtosis = %9.2f\n', y_kurtosis))
cat(sprintf('x skewness = %9.2f\n', x_skewness))
cat(sprintf('y skewness = %9.2f\n\n', y_skewness))
```

The result is not really the same with the python counterpart.

```
❯ Rscript 61-additional.r
x (max, min, range) = ( 0.00, 12.00, 12.00 )
y (max, min, range) = ( 5.00, 485.00, 480.00 )
x median = 6.00
y median = 137.00
x mode = 0.00
y mode = 5.00
x kurtosis = -1.48
y kurtosis = -1.22
x skewness = 0.00
y skewness = 0.54
x SE kurtosis = 1.19
y SE kurtosis = 1.19
x SE skewness = 0.62
y SE skewness = 0.62
```

You can obtain the interactive `JupyterLab`

in this following link:

As you can see, the kurtosis and the skewness is different, compared with PSPP. Just be aware of it.

I still don’t know the reason of this differences.

### What’s the Next Chapter 🤔?

Let’s continue our previous `ggplot2`

journey.

Consider continuing your exploration with [ Trend - Language - R - Part Three ].