Preface
Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.
Let’s continue our previous R
journey,
with building class and statistical properties.
Trend: Building Class
Real life experience requriee you to manage your code. This tidying process often be done by creating a class. This tedious process usually started by making the function first, then move them to a class.
Refactor to Function
Let’s make move the palin code above to a function. The problem with function is we need a bunch variables, for input and output between function.
First we need to load required libraries.
library(readr)
library(ggplot2)
And put the code to function one at a time,
for example this generate_regression
function,
that performing regression using lm()
,
generate values for the regression line,
and finally return a list containing all relevant objects.
generate_regression <- function(x_values, y_values, order) {
...
return(list(
order = order,
x_values = x_values,
y_values = y_values,
lm_model = lm_model,
x_plot = x_plot,
y_plot = y_plot
))
}
Then we can use the regression data, to show the coefficient.
show_coeff <- function(regression_data) {
order <- regression_data$order
lm_model <- regression_data$lm_model
...
}
Also wrap the plot in a create_plot
function.
create_plot <- function(
regression_data, title, label, pngfile) {
x_values <- regression_data$x_values
y_values <- regression_data$y_values
x_plot <- regression_data$x_plot
y_plot <- regression_data$y_plot
...
}
And finally call each functions, in main script, the outer block of the code.
regress_1 <- generate_regression(
series$xs, series$ys1, 1)
regress_2 <- generate_regression(
series$xs, series$ys2, 2)
regress_3 <- generate_regression(
series$xs, series$ys3, 3)
show_coeff(regress_1)
create_plot(regress_1,
"Straight line fitting",
"Linear Equation",
"14-lm-gg-line.png")
...
You can obtain the interactive JupyterLab
in this following link:
Merging The Plot
Let’s have another case. Instead of making different plot for different series, This time merge the different order for just one series. This can be done by using code with skeleton as below:
generate_regressions <- function(x_values, y_values) {
...
}
show_coeff <- function(order, lm_model) {
...
}
create_plot <- function(regression_data) {
...
}
Now we can call the function in main script. Get the regression data. And perform linear regression and plot.
regress <- generate_regressions(
series$xs, series$ys3)
show_coeff(1, regress$lm_model_1)
show_coeff(2, regress$lm_model_2)
show_coeff(3, regress$lm_model_3)
plot <- create_plot(regress)
The plot result can be shown as follows. You can see that in the legend, it lost one color. The green one for quadratic curve legend does not shown well. You can solve this by removing the guides.
You can obtain the interactive JupyterLab
in this following link:
The problem with function is we need a bunch variables, for input and output between function.
Using R6 Class
Instead of function me can wrap them all into a class.
R
has a few kind of class implementation,
one of them is R6
Let’s see how it goes with this skeleton.
CurveFitting <- R6Class("CurveFitting",
public = list(
xs = NULL, ys = NULL,
lm1 = NULL, lm2 = NULL, lm3 = NULL,
xp = NULL, yp1 = NULL, yp2 = NULL, yp3 = NULL,
initialize = function(xs, ys) {
self$xs <- xs
self$ys <- ys
},
generate_regressions = function() {
...
},
show_coeff = function(order, lm_model) {
...
},
show_coeffs = function() {
self$show_coeff(1, self$lm1)
self$show_coeff(2, self$lm2)
self$show_coeff(3, self$lm3)
},
create_plot = function() {
...
}
)
)
Now we can instantiate the CurveFitting
class in main script.
Then perform linear regression, display, finally and create the plot.
curve_fitting <- CurveFitting$new(series$xs, series$ys3)
curve_fitting$generate_regressions()
curve_fitting$show_coeffs()
curve_fitting$create_plot()
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Built in Plot (alternative)
As an alternative,
you can use built-in plot instead of ggplot2
.
create_plot = function() {
# Open PNG graphics device
png("19-lm-merge-alt.png",
width = 800, height = 400)
# Plot data points
plot(self$xs, self$ys,
col = "black", pch = 20,
xlab = "x", ylab = "y",
main = "Polynomial Curve Fitting")
# Plot regression lines
lines(self$xp, self$yp1, col = "red", lwd = 2)
lines(self$xp, self$yp2, col = "green", lwd = 2)
lines(self$xp, self$yp3, col = "blue", lwd = 2)
# Add legend
legend("topright",
legend = c("Data Points", "Linear Equation",
"Quadratic Curve", "Cubic Curve"),
col = c("black", "red", "green", "blue"),
pch = c(20, NA, NA, NA), lwd = c(NA, 2, 2, 2))
}
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Note that, not everything should be well structured.
For simple thing, we’d better use shortcut as possible,
as will be shown in the next R
article.
Statistic Properties
We will have look at how R
,
handle statistical properties.
Starting with least square.
Manual Calculation
You can check the detail of the manual calculation in the source code below:
...
# Calculate deviations
x_deviation <- x_observed - x_mean
y_deviation <- y_observed - y_mean
# Calculate squared deviations
x_sq_deviation <- sum(x_deviation^2)
y_sq_deviation <- sum(y_deviation^2)
...
You can obtain the interactive JupyterLab
in this following link:
Using Built-in Method
You can also check the detail of the built-in method in the source code below:
...
# Calculate slope (m) and intercept (b)
m_slope <- lm(y_observed ~
x_observed)$coefficients[2]
b_intercept <- lm(y_observed ~
x_observed)$coefficients[1]
...
Well, not everything has built-in method. I must admit I haven’t explore R’s statistical library thoroughly.
❯ Rscript 52-lq-built-in.r
n = 13
∑x (total) = 78.00
∑y (total) = 2327.00
x̄ (mean) = 6.00
ȳ (mean) = 179.00
∑(xᵢ-x̄) = 0.00
∑(yᵢ-ȳ) = 0.00
∑(xᵢ-x̄)² = 182.00
∑(yᵢ-ȳ)² = 309218.00
∑(xᵢ-x̄)(yᵢ-ȳ) = 7280.00
m (slope) = 40.00
b (intercept) = -61.00
Equation y = -61.00 + 40.00.x
sₓ² (variance) = 15.17
sy² (variance) = 25768.17
covariance = 606.67
sₓ (std dev) = 3.89
sy (std dev) = 160.52
r (pearson) = 0.97
R² = 0.94
SSR = ∑ϵ² = 18018.00
MSE = ∑ϵ²/(n-2) = 1638.00
SE(β₁) = √(MSE/sₓ) = 3.00
t-value = β̅₁/SE(β₁) = 13.33
You can obtain the interactive JupyterLab
in this following link:
The result is the same with the python counterpart.
Visualization
However, I guess the lm
above is,
enough to calculate regression line.
And then we can combine scatter plot and regression line.
We can start with usual scatter plot.
Then add this part the to plot grammar.
regression_line <- geom_abline(
intercept = b_intercept, slope = m_slope,
color = "red")
plot <- scatter_plot + regression_line
The plot result can be shown as follows:
You can obtain the interactive JupyterLab
in this following link:
Additional Properties
We can also calculate additional properties.
But in order to calculate kurtosis and skewness,
we need this e1071
library.
x_kurtosis <- kurtosis(x_observed)
y_kurtosis <- kurtosis(y_observed)
x_skewness <- skewness(x_observed)
y_skewness <- skewness(y_observed)
cat(sprintf('x kurtosis = %9.2f\n', x_kurtosis))
cat(sprintf('y kurtosis = %9.2f\n', y_kurtosis))
cat(sprintf('x skewness = %9.2f\n', x_skewness))
cat(sprintf('y skewness = %9.2f\n\n', y_skewness))
The result is not really the same with the python counterpart.
❯ Rscript 61-additional.r
x (max, min, range) = ( 0.00, 12.00, 12.00 )
y (max, min, range) = ( 5.00, 485.00, 480.00 )
x median = 6.00
y median = 137.00
x mode = 0.00
y mode = 5.00
x kurtosis = -1.48
y kurtosis = -1.22
x skewness = 0.00
y skewness = 0.54
x SE kurtosis = 1.19
y SE kurtosis = 1.19
x SE skewness = 0.62
y SE skewness = 0.62
You can obtain the interactive JupyterLab
in this following link:
As you can see, the kurtosis and the skewness is different, compared with PSPP. Just be aware of it.
I still don’t know the reason of this differences.
What’s the Next Chapter 🤔?
Let’s continue our previous ggplot2
journey.
Consider continuing your exploration with [ Trend - Language - R - Part Three ].