Where to Discuss?

Local Group

Preface

Goal: Explore R Programming language visualization with ggplot2. Providing the data using linear model.

R You Ready?

Welcome to the R-side of our trend trilogy. If this were a movie, this would be where we zoom in on a shy but powerful character. The kind who looks awkward in crowds but aces calculus in their sleep.

Let’s be honest. or many of us who came from a coding, first background, R feels… different. Less like a programming language, more like a statistics whisperer, with strong opinions on plotting aesthetics. At first, I tiptoed around it. Then I made a few plots. And suddenly, I was in love.

R may be quirky, but like that brilliant friend who labels their spice jars alphabetically, it rewards patience. And let’s talk power:

  • ike python’s seaborn, R has ggplot2, a charting ninja with a PhD in grammar.

  • Like python’s polyfit, R has lm(), a humble function that delivers linear models with confidence intervals and style.

  • Like python, R is beginner-friendly. Until it isn’t. Then it’s mysterious. But then you learn one more trick, and it all clicks again.

Instead of asking every question in the R forums (and being hit with “Have you read the docs?”), I took the engineer’s path: build a bunch of working examples first. Now, I can answer questions and feel smug while doing it.


Preparation

Gearing Up

Before we dive in, let’s do the statistical equivalent of, sharpening our chisels and lining up our rulers.

Of course you need R installed in your system. No need for RStudio, but there is a few things to consider.

📦 Library Check

We’ll start simple. No tidyverse buffet yet. We’re going à la carte for understanding’s sake. Install only what we need.

The script provided here start from the very basic, and you need to get additional library from time to time. You can install the package from R terminal.

install.packages("readr")
install.packages("ggplot2")
install.packages("ggthemes")

Yes, tidyverse is like installing a whole kitchen set. But for now, we’re just learning how to fry an egg, not host a cooking show. I’d simply choose one library at a time, to get more understanding.

Jupyter Lab

Jupyter + R = ❤️

Prefer to work in a Jupyter Lab environment like a civilized data nerd? Great! we’ll want to activate the R kernel:

IRkernel::installspec()

Totally optional, but nice when we like mixing Markdown with R code, and pretending we’re writing a thesis.

Data Series Samples

One Table to Rule Them All

Why do we care about having a tiny dataset with only a few values? Because small data = fewer distractions. We want to focus on the plotting and analysis, not debugging typos in row 2748.

We will use two example datasets:

  1. 📈 Series Data for Plotting Practice

This one’s got multiple series—great for trying out melt, layering lines, and watching trends grow exponentially, like our panic before a stats exam.

xs, ys1, ys2, ys3
0,  5,   5,   5
1,  9,   12,  14
2,  13,  25,  41
3,  17,  44,  98
4,  21,  69,  197
5,  25,  100, 350
6,  29,  137, 569
7,  33,  180, 866
8,  37,  229, 1253
9,  41,  284, 1742
10, 45,  345, 2345
11, 49,  412, 3074
12, 53,  485, 3941

[R: ggplot2: Statistical Properties: CSV Source][017-vim-series]

Useful when we want to get fancy and explore the art of beautiful chaos.

  1. 📏 Sample Data for Regression

This one is plain and perfect for linear regression, least squares, and other noble pursuits.

x,y
0,5
1,12
2,25
3,44
4,69
5,100
6,137
7,180
8,229
9,284
10,345
11,412
12,485

Yes, we call it samples instead of population, because we’re honest statisticians, and don’t want to go to academic jail.


Trend: LM Model

Linear Model

Let us summon the power of lm() and enter the arcane art of modeling. Today, we predict the future using the most noble of tools: the line (and its polynomial siblings).

Vector

In the mystical land of R, arrays are known as vectors. Think of them as the “data sushi rolls” that hold everything together.

# Given data
x_values <- c(
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y_values <- c(
  5, 14, 41, 98, 197, 350, 569, 866, 
  1253, 1742, 2345, 3074, 3941)

R: Trend: LM Model: Vector

This is the classic “x and y walked into a scatterplot” setup.

We aim to fit our values using a linear regression model. Mathematically, it looks like tthis.

But why stop at a line when we can escalate this to curves with extra flair? Let us unleash the lm() function with a polynomial twist.

First we need to define the order of the curve fitting. Then perform cubic regression using lm(). With the lm_model object we can get the coefficient. But for printing, we need to reverse order to match output. At last, we can print the coefficients with cat.

order <- 3

lm_model <- lm(y_values ~
  poly(x_values, order, raw = TRUE))

coefficients <- coef(lm_model)
coefficients <- coefficients[
  length(coefficients):1]

cat("Coefficients (a, b, c, d):\n\t",
  coefficients, "\n")

R: Trend: LM Model: Vector

This should output something along the lines of:

❯ Rscript 01-lm-vector.r
Coefficients (a, b, c, d):
         2 3 4 5 

It is so predictable, right? with the confidence of a stats professor grading on a curve.

Need something more interactive? Check the JupyterLab version:

Starting with hardcoded vectors helps us test the waters. We build intuition without file I/O headaches.

Reading from CSV

Let’s switch gears from hardcoding vector to file reading. Because eventually, all our precious data ends up in CSVs, the duct tape of data science. We can utilize built-in read.csv method to read data from CSV file.

We need to extract x values and y values from the data frame.

data <- read.csv("series.csv")

x_values <- data$xs
y_values <- data$ys3

R: Trend: LM Model: CSV

Same data, different delivery method. It’s like ordering the same meal, dine-in versus takeaway.

Interactive version here:

Real-world datasets rarely arrive as perfectly typed vectors. CSVs are the bridge between the wild data world and our cozy R environment.

Using Readr

Now for the fancier way, with readr library. We upgrade from public transport to a private data limousine. It’s faster, more flexible, and supports better metadata handling.

First we need to load the required readr library. Then read data from CSV file and put into a dataframe. Then create a variable shortcut, by extracting x values and y values.

library(readr)

data <- read_csv(
  "series.csv",
  show_col_types = FALSE)

column_spec <- spec(data)

x_values <- data$xs
y_values <- data$ys3

R: Trend: LM Model: readr

We can retrieve the column specifications, and print if we need to inspect. The spec() function helps us peek, into how R interprets each column, no surprises allowed.

JupyterLab version available here:

In large or complex projects, readr improves data loading reliability and error messages. More stats, less drama.

Different Order of LM

Why settle for one flavor of linear modeling, when we can sample the whole polynomial buffet? We can repeat above code for different order, or make it simpler.

Let’s write a reusable function so we don’t feel like a broken record.

This function, perform linear regression using lm(). Also define a named vector to map order numbers to curve types. Get the coefficients and also reverse order to match equation above. And we can finally print the coefficients result.

calc_coeff <- function(x_values, y_values, order) {
  lm_model <- lm(y_values ~ 
    poly(x_values, order, raw = TRUE))

  coeff_text <- c(
    "(a, b)" = 1, "(a, b, c)" = 2, "(a, b, c, d)" = 3)
  order_text <- c(
    "Linear" = 1, "Quadratic" = 2, "Cubic" = 3)

  cat(paste("Using lm_model :",
    names(order_text)[order], "\n"))

  coefficients <- coef(lm_model)
  coefficients <- coefficients[
    length(coefficients):1]

  cat("Coefficients ",
    names(coeff_text)[order], ":\n\t",
    coefficients, "\n")

R: Trend: LM Model: Merge

This way we can calculate coefficient, for different order and for different series. Now let’s throw in some real data and try multiple models.

library(readr)
data <- read_csv(
  "series.csv",
  show_col_types = FALSE)

calc_coeff(data$xs, data$ys1, 1)
calc_coeff(data$xs, data$ys2, 2)
calc_coeff(data$xs, data$ys3, 3)

R: Trend: LM Model: Merge

Expected output:

❯ Rscript 04-lm-merge.r
Using lm_model : Linear 
Coefficients  (a, b) :
         4 5 
Using lm_model : Quadratic 
Coefficients  (a, b, c) :
         3 4 5 
Using lm_model : Cubic 
Coefficients  (a, b, c, d) :
         2 3 4 5 

R: Trend: LM Model: Merge

JupyterLab edition available here:

We can test different models and complexity levels with ease. This is critical when the true relationship is, hiding behind layers of curve-fitting suspense.

That wraps up the core machinery of linear modeling in R. We started with simple vectors, took a CSV detour, leveled up with readr, and built a flexible regression toolkit.

And remember, in statistics, it’s all about fitting in. Even if that means adding a few extra powers of x just to impress the plot.


Trend: Built-in Plot

When in doubt, plot it out.

Sometimes we do not need shiny visuals or fancy libraries. Base R's plotting functions may look like they were designed in the late ’90s, because they were. But they get the job done with minimal fuss. It is like the trusty wrench in a toolbox: not flashy, but reliable.

It is rather limited, but enough to get started with plotting.

Default Output

By default, R quietly saves the result to a file called Rplot.pdf. It is polite like that, but we might prefer PNG for easier embedding or sharing. Let us switch gears:

# Open PNG graphics device
png("11-lm-line.png", width = 800, height = 400)

Choosing our output format early saves the embarrassment of, emailing a 4MB PDF when all we needed was a lightweight image for our blog post.

Linear Equation

Let us start with the basics: plotting the data points. No drama. No packages. Just pixels and points.

plot(
  x_values, y_values,
  pch = 16, col = "blue",
  xlab = "x", ylab = "y",
  main = "Straight line fitting")

R: Trend: Built-in Plot: Linear Equation

And continue with lines, from precalculated plot values. The y values comes from the regression line, previously performed by lm() into lm_model.

x_plot <- seq(
  min(x_values), max(x_values),
  length.out = 100)
y_plot <- predict(
  lm_model,
  newdata = data.frame(x_values = x_plot))

lines(x_plot, y_plot, col = "red")

R: Trend: Built-in Plot: Linear Equation

Then, communicate the visual result. For those of us who like clarity, or want to impress our thesis advisor, we add a legend:

legend("topright",
  legend = c("Data points", "Linear Equation"),
  col = c("blue", "red"),
  pch = c(16, NA), lty = c(NA, 1))

The plot result can be shown as follows:

R: Trend: Built-in Plot: Linear Equation

You can explore the interactive JupyterLab version here:

A regression without a plot is like a punchline without a joke. Visuals make trends obvious, and suspicious outliers even more so.

Straight Line

For those who believe in minimalism and shortcuts (i.e., statisticians during finals week), we can use abline() to draw the regression line, directly from the model to the plot. so we don’t have to generate the y_plot values manually.

abline(lm_model, col = "red")

The plot result can be shown as follows:

R: Trend: Built-in Plot: Linear Equation: Alternative

Interactive Jupyter Notebook:

abline() is like Ctrl+C for plotting regression lines. Quick. Dirty. Efficient. Use when elegance is less important than speed.

Quadratic Curve

What if the data curves a little? Like life, sometimes it is not linear. All we need to do is increase the polynomial order, and adjust the decorations accordingly.

order <- 2

The plot result can be shown as follows:

R: Trend: Built-in Plot: Quadratic Curve

Interactive Jupyter Notebook:

Quadratic regression captures parabolic trends, essential when things speed up or slow down in a curve, like population growth or our anxiety curve before a deadline.

Cubic Curve

When linear and quadratic just do not cut it, cubic fits come to the rescue. More flexible. More wiggly. More impressive looking in presentations.

order <- 3

The plot result can be shown as follows:

R: Trend: Built-in Plot: Cubic Curve

You can obtain the interactive JupyterLab in this following link:

Cubic fits can capture subtle turning points in data, though if the model starts oscillating wildly, it may be a cry for help. Statistically speaking, that’s called overfitting. ocially speaking, it’s called trying too hard.


Trend: ggplot2

Plotting with the elegance of a violin plot at a tuxedo gala.

When our data gets more expressive, and our plots need to level up from “quick sketch” to “conference-ready”, we turn to the ever-fancy ggplot2. But be warned: ggplot2 is not just a library, it is a grammar. A syntax ballet of layers and aesthetics.

This plotting has its own grammar.

Linear Equation

Let us begin our first dance with a simple straight line.

Before we build the plot, we prepare the stage. We generate prediction values from our linear model, and organize both raw data and predicted values, into proper data.frame structures.

x_plot <- seq(
  min(x_values), max(x_values),
  length.out = 100)
y_plot <- predict(
  lm_model,
  newdata = data.frame(x_values = x_plot))

data <- data.frame(x = x_values, y = y_values)

R: Trend: Built-in Plot: Linear Equation

Now the real show begins. Plot using ggplot2. We assemble plot components piece by piece, using the + operator. As you can see, there is a lot of plus sign here. This like an object stacked with another object, all in one ggplot2 figure. The statistical version of LEGO bricks.

plot <- ggplot(data, aes(x = x, y = y)) +
  geom_point(aes(color="Data Points"), size = 0.5) +
  geom_line(
    data = data.frame(x = x_plot, y = y_plot),
    aes(x, y, color="Linear Equation"),
    linewidth = 0.2) +
  labs(
    x = "x", y = "y",
    title = "Straight line fitting") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.text = element_text(size = 2),
        text = element_text(size = 4)) +
  scale_color_manual(
    name = "Plot",
    breaks = c(
      "Data Points",
      "Linear Equation"),
    values = c(
      "Data Points"="red",
      "Linear Equation"="black")) +
  guides(
    color = guide_legend(
      override.aes = list(
        shape = c(16, NA), linetype = c(0, 1)
    )))

R: Trend: Built-in Plot: Linear Equation

And finally, we save the plot as a PNG, using pixel-perfect dimensions fit for a blog, report, or that paper we are totally going to submit before the deadline.

# Save plot as PNG
ggsave("14-lm-gg-line.png",
  plot, width = 800, height = 400, units = "px")

R: Trend: Built-in Plot: Linear Equation

The plot result can be shown as follows:

R: Trend: ggplot2: Linear Equation

You can obtain the interactive JupyterLab in this following link:

With ggplot2, our plot becomes both readable and customizable. It gives us control over every element, ideal when we want clarity without sacrificing style.

Quadratic Curve

Now, we raise the degree of complexity, literally.

By tweaking the data, increasing the model’s order, and adjusting the aesthetic elements just a tad, we can reuse our ggplot2 structure for a curvier scenario.

order <- 2

The plot result can be shown as follows:

R: Trend: ggplot2: Quadratic Curve

You can obtain the interactive JupyterLab in this following link:

Real-world data often bends and twists. Quadratic fits help us capture those subtle curves, without spiraling into polynomial madness, yet.

Cubic Curve

And finally, for datasets with extra flair (or drama), we apply a cubic fit. Just as before, we update the order and reuse the same plotting template.

order <- 3

The plot result can be shown as follows:

R: Trend: ggplot2: Cubic Curve

You can obtain the interactive JupyterLab in this following link:

Cubic fits let us catch turning points and inflection. Great for trends that change direction. But let’s not get carried away. Beyond cubic, it often stops being insight and starts being noise.

Easy peasy? Quite so, once we see ggplot2 for what it really is: not a plotting tool, but a grammar for visual storytelling. One where each + means and also, and each aes() is our secret decoder ring.


What’s the Next Chapter 🤔?

Our ggplot2 adventure has drawn a neat little line, or curve, to a temporary stop. But as statisticians, we know the story doesn’t end at a good plot.

Next, we dive into Rs building models, that think in terms of classes and distributions. Yes, it’s time for some character development, where our data points stop being just dots, and start acting like they belong to something bigger.

Curious about regression’s more sociable cousin? The one that cares about categories and not just numbers? Then grab a fresh cup of coffee and head over to: [ Trend - Language - R - Part Two ].