Where to Discuss?

Local Group

Preface

Goal: Explore R Programming language visualization with ggplot2. Provide a bunch of example of plot cases in your fingertip.

We’re back at it with our beloved ggplot2. It is easy if we can embrace the grammar.

Like any well-behaved grammar, once we know the rules, the plots practically write themselves. Only here the commas are aesthetics, the clauses are geoms, nd the punchlines are distributions.

Let’s keep building our visual intuition. The more we speak in plots, the more persuasive we become, during those impromptu midnight data debates. And let’s be honest, some of us argue better with graphs than with words.

Let’s continue building our visual intuition. The more we speak in plots, the more persuasive we sound in late-night data debates. And let’s be honest, some of us argue better with graphs than with words.

When you have a pet, you can make a poet about it.
When you have a data pet, visualization is the poet.

👉 Remember: ggplot2 is not just a plotting library. It’s a grammar for visual storytelling. Let’s make our data sing, or at least hum with statistical elegance.


Distribution

Let’s start with the classic, the normal distribution. Statisticians love it. Students fear it. Algorithms assume it. We may not always get normally distributed data in real life, but that does not stop us from dreaming in bell curves.

The dnorm method gives us the y-values (probability densities) for each x on the standard normal distribution curve.

y <- dnorm(x)

Normal Distribution

geom_line

We start by loading the essential gear (required libraries), generating x values across the curve, and calculating y using dnorm method. Then we wrap it all in a tidy data.frame, so ggplot2 can do what it does best: turn numbers into art.

library(ggplot2)

x <- seq(-5, 5, length.out = 1000)
y <- dnorm(x)

df <- data.frame(x = x, y = y)

Once we have our data frame, we build the plot. geom_line() does the drawing, and theme_minimal() keeps things as clean as our assumptions. We can add labels and save the output. This is how we go from raw probability to polished PNG.

plot <- ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "black")

plot <- plot +
  theme_minimal() +
  theme(
    text = element_text(size = 4),
    panel.grid = element_blank()) + 

  labs(
    x = "x", y = "Density",
    title = "Standard Normal ",
      "Distribution with Quantiles")

ggsave("63-normal.png", plot,
  width = 800, height = 400, units = "px")

Normal Distribution with Quantiles

geom_area

We enhance the normal plot by showing quantiles. Why? Because sometimes we need to explain to others, where “most of the data lives”, without quoting the full standard deviation manifesto.

We define the percentiles, then let quantile() do the slicing.

percentiles <- c(25, 50, 75, 100)
quantiles <- quantile(x, probs = percentiles / 100)

R: Distribution: Quantiles

And add this shade regions corresponding to percentiles, to the plot grammar. This can be done by using geom_area.

for (i in seq_along(quantiles)) {
  plot <- plot + geom_area(
    data = subset(df,x <= quantiles[i]),
    aes(x = x, y = y),
    fill = i, alpha = 0.3)
}

R: Distribution: Quantiles

The plot result can be shown as follows:

R: Distribution: Normal Distribution with Quantiles

👉 Want to play with it interactively?

Adding quantiles makes the plot speak. Visually hinting where the data clusters. This is useful for storytelling, teaching, and winning arguments about, whether someone’s test score is “really above average.

Kurtosis

The pointiness of our curve

With the dnorm method, we can simulate kurtosis and skewness.

We explore different levels of kurtosis. Technically it is about tails and peaks, but informally, it tells us how extreme our data feels. like comparing drama queens to stoics in distribution form.

Let’s make examples of distributions with different levels of kurtosis. We generate standard, flat, and pointy curves.

  1. Standard normal distribution (Kurtosis = 0)
  2. Lower kurtosis
  3. Higher kurtosis
y_standard <- dnorm(x)
df_standard <- data.frame(x = x, y = y_standard)

y_kurtosis_1 <- dnorm(x, mean = 1, sd = 1)
y_kurtosis_2 <- dnorm(x, mean = 1, sd = 0.5)
y_kurtosis_3 <- dnorm(x, mean = 1, sd = 2)

df_kurtosis_1 <- data.frame(x = x, y = y_kurtosis_1)
df_kurtosis_2 <- data.frame(x = x, y = y_kurtosis_2)
df_kurtosis_3 <- data.frame(x = x, y = y_kurtosis_3)

R: Distribution: Kurtosis

Now we draw them using geom_line. for each different levels of kurtosis to the plot grammar. Each line shows how the same mean can hide wildly different stories.

plot <- ggplot() +
  geom_line(data = df_standard, color = "black"
    aes(x = x, y = y), linewidth = 0.2) +
  geom_line(data = df_kurtosis_1,
    aes(x = x, y = y), color = "red",
    linetype = "dashed", linewidth = 0.2) +
  geom_line(data = df_kurtosis_2,
    aes(x = x, y = y), color = "green",
    linetype = "dashed", linewidth = 0.2) +
  geom_line(data = df_kurtosis_3,
    aes(x = x, y = y), color = "blue",
    linetype = "dashed", linewidth = 0.2) +
  labs(x = "x", y = "Density",
    title = "Normal Distribution ",
      "with Different Kurtosis") +
  scale_linetype_manual(
    values = c("solid", "dashed", "dashed", "dashed"),
    labels = c(
      "Standard Normal", "Standard Kurtosis = 0",
      "Lower Kurtosis", "Higher Kurtosis")) +
  theme_minimal() +
  theme(
    text = element_text(size = 4))

R: Distribution: Kurtosis

The plot result can be shown as follows:

R: Distribution: Kurtosis

👉 Want to run the numbers live?

Why care about kurtosis? Because reality is messy. Sometimes data behaves like a chill Gaussian. Sometimes it spikes like a stressed-out grad student on deadline week.

Skewness

When data leans—left, right, or dramatically off-center

Now let’s shift our focus to skewness. It tells us whether our data leans like a suspicious p-hacking attempt.

Let’s make examples of distributions with different skewness parameters. We simulate several skewness levels: negative, moderate, and highly positive.

  1. Negative skewness
  2. Moderate positive skewness
  3. High positive skewness
y_standard <- dnorm(x)
df_standard <- data.frame(x = x, y = y_standard)

y_skewed_1 <- dnorm(x) * 2 * pnorm(x)
y_skewed_2 <- dnorm(x) * 2 * pnorm(-x)
y_skewed_3 <- dnorm(x) * 2 * pnorm(x) * 2

df_skewed_1 <- data.frame(x = x, y = y_skewed_1)
df_skewed_2 <- data.frame(x = x, y = y_skewed_2)
df_skewed_3 <- data.frame(x = x, y = y_skewed_3)

R: Distribution: Skewness

Then we draw them in context by adding geom_line again, When done right, the plot reveals which side the data whispers its secrets to. for each different skewed distributions to the plot grammar.

plot <- ggplot() +
  geom_line(data = df_standard, color = "black",
    aes(x = x, y = y), linewidth = 0.2) +
  geom_line(data = df_skewed_1,
    aes(x = x, y = y), color = "red",
    linetype = "dashed", linewidth = 0.2) +
  geom_line(data = df_skewed_2,
    aes(x = x, y = y), color = "green",
    linetype = "dashed", linewidth = 0.2) +
  geom_line(data = df_skewed_3,
    aes(x = x, y = y), color = "blue",
    linetype = "dashed", linewidth = 0.2) +
  labs(x = "x", y = "Density",
    title = "Normal Distribution with Different Skewness") +
  scale_linetype_manual(
    values = c("solid", "dashed", "dashed", "dashed"),
    labels = c(
      "Standard Normal",
      "Negative Skewness = -4",
      "Moderate Positive Skewness = 2",
      "High Positive Skewness = 6")) +
  theme_minimal() +
  theme(
    text = element_text(size = 4))

R: Distribution: Skewness

The plot result can be shown as follows:

R: Distribution: Skewness

👉 Try bending the curve yourself:

Understanding skewness helps us avoid misleading averages. Sometimes the “average” income includes Jeff Bezos, and that shifts everything right.


Trend: Multiple

When in doubt, regress everything. Then doubt the regression.

Sometimes, we need to show multiple trends at once. Maybe to compare patterns, maybe to confuse our boss. Either way, ggplot2 lets us put several series in one plot, or spread them neatly across a grid. It’s like choosing between a dinner buffet or a tasting menu.

Geom Smooth

Linear model with standard errors: the tuxedo of trendlines.

Why manually calculate regression when geom_smooth() can, do the heavy lifting and wear a standard error cloak while at it? Think of it as our personal assistant for drawing statistically proper lines.

Let’s begin with a scatter plot using geom_point(), then let geom_smooth() do its thing, plotting linear fits for [ys₁, ys₂, and ys₃].

plot <- ggplot(data, aes(x = xs)) +
  geom_point(
    aes(x = xs, y = ys1),
    size = 0.5, color = "firebrick") +  
  geom_smooth(
    aes(x = xs, y = ys1), method = "lm",
    se = TRUE, color = "firebrick",
    linewidth = 0.2) +  
    text = element_text(size = 4))
  ...

R: Trend: Multiple: geom smooth

To make it visually elegant (and slightly hipster), we can apply the solarized theme from the ggthemes library. Plots deserve style too.

  labs(x = "x", y = "y",
    title = "Scatter Plot with Regression Lines") +
  theme_solarized() +
  scale_color_solarized() +
  theme(
    text = element_text(size = 4))

R: Trend: Multiple: geom smooth

Here’s the final output, where our data points are loud and the regression lines are proud:

R: Trend: Multiple: Geom Smooth

Get interactive and tinker in JupyterLab:

These plots help us compare multiple trend patterns on a single canvas Regression isn’t just prediction, it’s persuasion in visual form.

Grid Extra

Not all trends get along on the same y-axis.

Sometimes plotting everything together makes, the trends look like they’re in a statistical street fight. When axes clash or y-axis scales diverge, it’s time to separate the series using gridExtra.

Here, we arrange the individual trend plots, using gridExtra side-by-side horizontally with:

grid_plot <- grid.arrange(
  plot_y1, plot_y2, plot_y3, ncol = 3)

The result is a civilized separation of trends, each plot gets its own y-axis, no need to share or compromise:

R: Trend: Multiple: grid extra

The plot result can be shown as follows:

R: Trend: Multiple: Grid Extra

Explore it yourself in JupyterLab:

Multiple trends in one frame are useful, until the y-axis scale distorts the story. Grid arrangements help us show the data fairly, without statistical squabbling.


Statistic Properties: One Axis Plot

Three series in one axis plot

Three series walk into a plot… and immediately argue about the median.

Before we start visualizing, let’s remember: statistics are like cats (cat the pet, not the linux command). They don’t always behave the way we expect, but with the right box (or violin), we can get them to show their shape.

In this section, we wrangle multiple y-series, and explore ways to compare their distributions using one axis. This is our statistical group therapy session, here each series gets a chance to express itself.

As you can see from previous statistical properties. We can analyze the data for each series. For example we can just consider just the y-series, and obtain the mean, median, mode, and also the minimum, maximum, range, and quantiles.

Long Format

Melt

We melt the data, not the analyst.

When plotting multiple y-series in one go, we need to reshape the data from wide to long. Why? Because ggplot2 prefers tidy data, each row is one observation, each column one variable. Wide format is for spreadsheets, long format is for plots and parties.

We use gather() from the tidyr package. Think of it as inviting all your series to one common table.

series_longer <- series %>%
  gather(key = "y", value = "value", -xs)

R: Statistic Properties: Long Format

You can check the result by cat or print the merged series.

Melting lets us treat all y-series equally. Ideal for side-by-side comparison in single-axis plots.

Box Plot

The box plot: where medians get the spotlight, and outliers are gently shamed.

The classic box plot summarizes data with quartiles and outliers. It’s like a résumé for our series: here’s the median, here’s the range, and oh, those dots? We don’t talk about those.

The most common way to visualize this is the box plot. We can utilize geom_boxplot to get the plot.

plot <- ggplot(
    series_longer,
    aes(x = y, y = value, fill = y)) +
  geom_boxplot(color = "black", linewidth= 0.2) +
  ...

R: Statistic Properties: Box Plot

Let’s use custom colors for this example.

  scale_fill_manual(values = soft_colors) +

The plot result can be shown as follows:

R: Statistic Properties: One Axis: Box Plot

You can obtain the interactive JupyterLab in this following link:

Box plots are the go-to for comparing distributions. Quick, informative, and a favorite of statisticians, who want to look serious.

Violin Plot

When a box plot learns to dance.

Box plots are informative, but sometimes we want more flair. Enter the violin plot: a combination of box plot and kernel density. Now we get to see the shape of the data, not just its summary.

The better to visualize is by using violin plot. We can utilize geom_violin to get the plot.

plot <- ggplot(
    series_longer,
    aes(x = y, y = value, fill = y)) +
  geom_violin(color = "black", linewidth= 0.2) +
 ...

R: Statistic Properties: Violin Plot

The plot result can be shown as follows:

R: Statistic Properties: One Axis: Violin Plot

You can obtain the interactive JupyterLab in this following link:

Violin plots show the full distribution. Great for spotting multimodal patterns or hidden spikes. Like a box plot with emotional range.

Swarm Plot

Dot by dot, truth emerges.

Sometimes we just want to see all the raw values, without summarizing them away. The swarm plot (via jitter in geom_point), lets each data point speak for itself. No data left behind.

This leave us with other option such as swarm plot and strip plot. We can get swarm plot using jitter inside geom_point.

plot <- ggplot(
    series_longer,
    aes(x = y, y = value, color = y)) +
  geom_point(
    position = position_jitterdodge(
      jitter.width = 0.3, jitter.height = 0),
    size = 0.5) +
  ...

R: Statistic Properties: Swarm Plot

And here it is. The plot result can be shown as follows:

R: Statistic Properties: One Axis: Swarm Plot

You can obtain the interactive JupyterLab in this following link:

Swarm plots are perfect when sample sizes are small, or distributions are non-traditional. Plus, it just looks cool.

Strip Plot

Minimalism, but make it jitter.

Strip plots are similar to swarm plots but even simpler. Here, we use geom_jitter() to spread out points horizontally, so they don’t overlap like penguins in a pile.

We can get strip plot using geom_jitter.

plot <- ggplot(
    series_longer,
    aes(x = y, y = value, color = y)) +
  geom_jitter(
    width = 0.3, height = 0, size = 0.5) +
  ...

R: Statistic Properties: Strip Plot

The plot result can be shown as follows:

R: Statistic Properties: One Axis: Strip Plot

You can obtain the interactive JupyterLab in this following link:

Strip plots keep it simple. Good for exploratory views and, when we want to embrace a bit of chaos, but still understand the story.


Statistic Properties: Distribution

Let there be frequencies, and let them plot themselves.

We’ve met our y-series in various boxy, violiny, and jittery forms. Now it’s time to ask: how do these values, distribute themselves along the number line? Are they huddled near the mean? Wandering wildly like unsupervised p-values?

We’ll explore that with some of, our favorite tools for visualizing distributions. Just like previous four, we can analyse the y-axis, but this time by frequency of each series.

KDE Plot

Kernel Density Estimation

The KDE plot gives us a smoothed-out version of the histogram. It estimates the probability density function of the variable. Which sounds fancy, but really just means: “Here’s where the data likes to hang out.

And thanks to geom_density, we don’t even need to bring calculus to the party.

KDE shown well the distribution of the frequency. This complex task can be done easily with geom_density.

plot <- ggplot(
    series_longer,
    aes(x = value, fill = Category)) +
  geom_density(alpha = 0.7, color = NA) +
  ...

R: Statistic Properties: KDE Plot

And here’s the resulting plot:

R: Statistic Properties: Distribution: KDE Plot

You can obtain the interactive JupyterLab in this following link:

KDE plots help us compare shapes of distributions, without worrying about bin size. They give a smooth overview that highlights trends, like a box plot with a PhD in smoothness.

Rug Plot

The rug plot is about minimalism. It doesn’t summarize, smooth, or bin. It simply places a tiny tick mark for each observation. Elegant. Humble. Underrated.

We can also simply show the rug plot using geom_rug.

plot <- ggplot(
    series_longer,
    aes(x = value, fill = Category)) +
  geom_rug(alpha = 0.5, sides = "b") +
  ...

R: Statistic Properties: Rug Plot

And here’s the tick parade:

R: Statistic Properties: Distribution: Rug Plot

You can obtain the interactive JupyterLab in this following link:

Rug plots show the raw locations of all data points. Great for spotting clustering, gaps, or if our data has secretly turned binary when we weren’t looking.

Histogram

The classic.

Histograms are still one of the clearest ways, to show frequency distributions. But geom_histogram is more than the basic histogram for beginner. In R, geom_histogram lets us tweak bins, colors, and aesthetics to make our plots look good and tell the truth. A rare combo.

plot <- ggplot(
    series_longer,
    aes(x = value, fill = Category)) +
  geom_histogram(
    binwidth = 50, linewidth = 0.2,
    alpha = 0.7, color = "black") +
  ...

R: Statistic Properties: Histogram

And here’s the result:

R: Statistic Properties: Distribution: Histogram

You can obtain the interactive JupyterLab in this following link:

Histograms offer a quick glance, at the frequency of values in intervals. Perfect for spotting skewness, symmetry, and when the data is just plain weird.


Statistic Properties: Marginal

At this point, we have sliced and diced our data in every direction, except the literal margins. Time to fix that. Marginal plots allow us to peek at each axis on its own turf. Think of it as giving the x and y axes their solo performance, with ggMarginal from the ggExtra library as our stage manager.

We can step to analyse each of single axis analysis, right on its own axis using ggMarginal from ggExtra library.

Density Example

We start with a classic 2D plot. So far so good.

For example we can add marginal density plot. Let’s start with usual plot.

R: Statistic Properties: Marginal: Density

Now we let each axis whisper its own secrets, by adding marginal density plots. This gives a quick glimpse of distribution on both axes, without adding clutter to the main scene.

p_with_margins <- ggMarginal(
  p, type = "density", linewidth = 0.2,)

R: Statistic Properties: Marginal: Density

The final result is a polished plot that combines the big picture with fine details:

R: Statistic Properties: Marginal:

Explore the interactive version here:

Marginal plots help us connect the dots, between joint and marginal distributions. In other words, we see not just where the points are, but how they behave on each axis. For statisticians, that’s like switching from grayscale to technicolor.

Histogram Example

We’re not married to density. Let’s switch it up and try histograms instead.

We can use marginal histograms to get a more “binned” perspective. This is the histogram’s time to shine—on the sidelines.

p_with_margins <- ggMarginal(
  p, type = "histogram",
  color = "black", fill = alpha("#FFD700", 0.1),
  linewidth = 0.1)

R: Statistic Properties: Marginal: Histogram

Here’s how the final result looks:

R: Statistic Properties: Marginal:

And the interactive version is ready for your curiosity:

Sometimes we want to see the raw frequency count. Histograms add structure to the axis-wise distribution, making it easier to spot skew, outliers, or whether the data suffers from “the dreaded left shoulder syndrome.


What’s the Next Chapter 🤔?

So far, we’ve explored statistical properties with R, sliced data from every angle, and smoothed it like a well-behaved normal curve. All in all, not bad for a plotting party hosted by statisticians.

Numbers deserve more than being left alone in a spreadsheet. We’ve shown how to visualize data distributions in practical, meaningful ways. From box plots to violin plots to marginal density overlays, we now have tools that not only inform, but also impress during awkward coffee break conversations.

But where do we go from here?

While Python and R are the old reliable friends of the statistical world, the programming party doesn’t end there. There’s Julia, fast, modern, and already practicing its keynote speech for the future. Then there’s Typescript and Go, bridging the world of analysis, with scalable applications that refuse to crash under pressure.

Learning statistical visualization across multiple languages means, we are not just tool users. We are toolmakers. We future-proof our workflows and avoid the classic trap: “I have this great insight, but it only runs on my laptop.

So if you’re curious what comes next, consider peeking into: 👉 [ Trend - Language - Julia - Part One ].

In data science, like in life, flexibility is the best regression line. It’s time to see how the stats game changes, when we switch gears.