Article Series

This article series discuss more than 30 different programming languages. Please read overview before you read any of the details.

Playing with Records Related Articles.

Where to Discuss?

Local Group

Preface

Goal: A practical case to collect unique record fields using GNU.

GNU R has different beauty compared with Julia. We can simply utilize dataframe to data handle records.

We have to switch our paradigm to dataframe, instead of general coding approach. Hence, a closer to statistical approach, in which GNU R is made for.

Reference Reading

Source Examples

You can obtain source examples here:


Common Use Case

Task: Get the unique tag string

Please read overview for more detail.

Prepopulated Data

Songs and Poetry

library(tibble)

tibble(
  title = "Cantaloupe Island",
  tags  = list(list("60s", "jazz"))
) %>% add_row(
  title = "Let It Be",
  tags  = list(list("60s", "rock"))
) %>% add_row(
  title = "Knockin' on Heaven's Door",
  tags  = list(list("70s", "rock"))
) %>% add_row(
  title = "Emotion",
  tags  = list(list("70s", "pop"))
) %>% add_row(
  title = "The River"
) -> dataframe

GNU R Solution

The Answer

There might be many ways to do things in GNU R. One of them is this oneliner as below:

library(tibble)
library(purrr)

load("songs.RData")

(dataframe %>% 
  dplyr::filter(.data$tags != "NULL")
)$tags %>% flatten %>% unique %>% paste

Enough with introduction, at this point we should go straight to coding.

Environment

No need any special setup. Just run and voila..!

You can utilize R-Studio. But for this simple case in this article, any text editor sufficient.

An optional packages called purrr, need to be setup manually.


1: Data Structure

We are going to use list throught out this article.

Simple List

Before building a struct, I begin with simple array.

tags <- list("rock", "jazz", "rock", "pop", "pop")

paste(tags)

It is easy to dump variable in GNU R using paste. With the result similar as below string:

❯ Rscript 01-tags.r
[1] "rock" "jazz" "rock" "pop"  "pop" 

While print show the complete structural data, paste is actually just concatenation of the data interpretation.

The Record Structure

We can build record structure using dataframe. The convenient way to add record like data is by using tibble. First we setup two columns title and tags, shown in below code:

library(tibble)

dataframe <- tibble(
  title = "Cantaloupe Island",
  tags  = list(list("60s", "jazz"))
)

print(dataframe)

With the result similar as below dataframe:

$ Rscript 02-dataframe.r
# A tibble: 1 x 2
  title             tags      
  <chr>             <list>    
1 Cantaloupe Island <list [2]>

GNU R: Dataframe with only one song record

We can explore more about dataframe.

paste(head(dataframe))

paste(names(dataframe))

paste(dataframe["tags"])

paste(dataframe$tags)

With the result similar as below string:

[1] "Cantaloupe Island"             "list(list(\"60s\", \"jazz\"))"
[1] "title" "tags" 
[1] "list(list(\"60s\", \"jazz\"))"
[1] "list(\"60s\", \"jazz\")"

Issue with List in Rows

Why double list 🤔?

You might spot a strange declaration here:

dataframe <- tibble(
  title = "Cantaloupe Island",
  tags  = list(list("60s", "jazz"))
)

This double list is a workaround. If we declare as below:

dataframe <- tibble(
  title = "Cantaloupe Island",
  tags  = list("60s", "jazz")
)

The list will be expanded into two rows of data.

Rows of Song Struct

Meet The Songs Rows

From just karaoke, we can go pro with recording studio. From simple data, we can build a structure to solve our task.

As has been said, the convenient way to add-row is by using tibble.

library(tibble)

dataframe <- tibble(
  title = "Cantaloupe Island",
  tags  = list(list("60s", "jazz"))
)

dataframe <- add_row(dataframe,
  title = "Let It Be",
  tags  = list(list("60s", "rock"))
)

dataframe <- dataframe %>% add_row(
  title = "Knockin' on Heaven's Door",
  tags  = list(list("70s", "rock"))
)

add_row(dataframe,
  title = "Emotion",
  tags  = list(list("70s", "pop"))
) -> dataframe

dataframe %>% add_row(
  title = "The River"
) -> dataframe

dataframe %>% print

With the result similar as below dataframe:

$ Rscript 03-add-row.r
# A tibble: 5 x 2
  title                     tags      
  <chr>                     <list>    
1 Cantaloupe Island         <list [2]>
2 Let It Be                 <list [2]>
3 Knockin' on Heaven's Door <list [2]>
4 Emotion                   <list [2]>
5 The River                 <NULL>    

You can spot that there are four ways to manipulate data.

  • Two combinations of assignment <- operator, and.
  • Two combinations of pipe %>% operator, and.

Maybe Null

We need, a not too simple, case example.

GNU R handle this NULL value natively in statistical context. I won’t bother into the the detail. Let’s just the GNU R handle this with its approach.


2: Separating Module

Since we need to reuse the songs rows multiple times, it is a good idea to separate the record from logic.

Songs Module

The code can be shown as below:

library(tibble)

tibble(
  title = "Cantaloupe Island",
  tags  = list(list("60s", "jazz"))
) %>% add_row(
  title = "Let It Be",
  tags  = list(list("60s", "rock"))
) %>% add_row(
  title = "Knockin' on Heaven's Door",
  tags  = list(list("70s", "rock"))
) %>% add_row(
  title = "Emotion",
  tags  = list(list("70s", "pop"))
) %>% add_row(
  title = "The River"
) -> dataframe

GNU R: The Songs Module Containing Dataframe

The pipe %> syntatic sugar from tibble is very nice.

Using Songs Module

Now we can have a very short code.

source("my-songs.r")

dataframe %>% print

With the result exactly the same as above array.

GNU R: Using Songs Module

Julia has this map do notation.


3: External Data

RData and CSV

By its design, GNU R can handle large amount of external data. Why don’t we try it now.

RData: Write

We can save the state of the object into a file.

source("my-songs.r")

save(dataframe, file = "songs.RData")

This way, current object can be modularized into a file, to be used later, in other script.

RData: Read

This is how we read it in other script.

library(tibble)

load("songs.RData")
dataframe %>% print

The object name persist. No need any new assignment. We can continue from where we left.

CSV: Write

There is another way,m using CSV. Bu unfortunately we use list with already has a comma separated value. So we cannot save into regular CSV.

A workaround can be done, to save this using CSV2. This csv2 utilize ; as separator.

The list in each itself is also an issue. A workaround is to conert to matrix first.

source("my-songs.r")

matrix <- as.matrix(dataframe)
matrix %>% print

matrix %>% write.csv2("./songs.csv")

The matrix transformation has form as below:

$ Rscript 05-write-csv.r
     title                       tags  
[1,] "Cantaloupe Island"         List,2
[2,] "Let It Be"                 List,2
[3,] "Knockin' on Heaven's Door" List,2
[4,] "Emotion"                   List,2
[5,] "The River"                 NULL  

CSV: Output

The CSV output result can be show as below:

"";"title";"tags"
"1";Cantaloupe Island;list("60s", "jazz")
"2";Let It Be;list("60s", "rock")
"3";Knockin' on Heaven's Door;list("70s", "rock")
"4";Emotion;list("70s", "pop")
"5";The River;NULL

CSV: Read

We can read the CSV using code below.

library(tibble)

songs_df <- read.csv2(file = "./songs.csv")
songs_df["tags"] %>% paste

CSV: Caveat

CSV is good for plain records, especially for data interchange between application. But it is not recommended for storing list object.

Consider have a look at the output of the read below:

[1] "1:5"                                                                                               
[2] "c(\"Cantaloupe Island\", \"Let It Be\", \"Knockin' on Heaven's Door\", \"Emotion\", \"The River\")"
[3] "c(\"list(60s, jazz)\", \"list(60s, rock)\", \"list(70s, rock)\", \"list(70s, pop)\", \"NULL\")"    

What missing here is, the tags rows stored as string instead of list. This become troublesome for further processing.


4: Extracting Fields

Walk, Filter, Drop/Keep

Walk

Map

Sometimes, we need to examine each records. Not as a whole dataframe, but each rows separately. To achieve this we can utilizr walk function from purrr library.

source("my-songs.r")
library(purrr)

dataframe$tags %>%
  walk(function(current) {
    current %>% as.list %>% paste %>% print
  })

With the result similar as below sequential lines of row:

$ Rscript 07-purrr.r
[1] "60s"  "jazz"
[1] "60s"  "rock"
[1] "70s"  "rock"
[1] "70s" "pop"
character(0)

Filter

To get rid of the row without tags data, we utilize dplyr::filter.

source("my-songs.r")

songs_df <- dataframe %>% 
  dplyr::filter(.data$tags != "NULL")

songs_df %>% print

With the result similar as below dataframe:

$ Rscript 08-filter.r
# A tibble: 4 x 2
  title                     tags      
  <chr>                     <list>    
1 Cantaloupe Island         <list [2]>
2 Let It Be                 <list [2]>
3 Knockin' on Heaven's Door <list [2]>
4 Emotion                   <list [2]>

Just beware of the fully qualified name, or we may end up using filter from other library.

GNU R: Extracting Fields: Filter

Drop/Keep

Although not required, we can shrink the dataframe, to have only specific columns.

library(tibble)

songs_df <- read.csv2(file = "./songs.csv")

songs_df <- songs_df %>% 
  dplyr::filter(.data$tags != "NULL")

keeps <- "tags"
tags_df <- songs_df[ , keeps, drop = FALSE]

tags_df %>% print

With the result similar as below lines of string, loaded from csv.

$ Rscript 09-keep.r
             tags
1 list(60s, jazz)
2 list(60s, rock)
3 list(70s, rock)
4  list(70s, pop)

5: Finishing The Task

Flatten, Unique, Oneliner Pipe

Flatten

There is already a flatten function in dplyr library,

source("my-songs.r")
library(purrr)

songs_df <- dataframe %>% 
  dplyr::filter(.data$tags != "NULL")

tags_df <- songs_df[ , "tags", drop = FALSE]

flattened <- flatten(tags_df$tags)

flattened %>% paste

With the result similar as below string:

$ Rscript 10-flatten.r
[1] "60s"  "jazz" "60s"  "rock" "70s"  "rock" "70s"  "pop" 

We don’t really need to drop columns. We can safely remove this line:

tags_df <- songs_df[ , "tags", drop = FALSE]

Unique

There is also a unique function in standard library. So I guess our problem is solved completely.

source("my-songs.r")
library(purrr)

songs_df <- dataframe %>% 
  dplyr::filter(.data$tags != "NULL")

flattened <- flatten(songs_df$tags)

distinct <- unique(flattened)

distinct %>% paste
$ Rscript 11-unique.r
[1] "60s"  "jazz" "rock" "70s"  "pop" 

Oneliner

Clean up

Using a bunch of %> pipe operator, we can completely transform code above, into a single oneliner statement.

library(tibble)
library(purrr)

load("songs.RData")

(dataframe %>% 
  dplyr::filter(.data$tags != "NULL")
)$tags %>% flatten %>% unique %>% paste

The code is now simple, and clear. We can understand exactly what it does, by just reading the code.

GNU R: Finishing The Task


What is Next 🤔?

Consider continue reading [ Nim - Playing with Records ].