Where to Discuss?

Local Group

Preface

Goal: Transform data with Panda’s dataframe, from plain table to pivot.

I enjoy geekily transforming data using list comprehension. However, when it comes to real-world problem-solving, a more straightforward method is often necessary, one that won’t overwhelm beginners.

Something that won’t cause beginner too much headache.

Managing Local Library

Naming Space

For tutorial purposes, I organize the development process gradually. For example:

├── 11-pandas.py
├── 12-model-view.py
├── 13-pivot.py
├── 14-writer.py
├── 15-deco.py
├── 16-pivot.py
├── 17-reader.py
├── 18-reader-writer.py

I name the files for the classes, with the same number as the main file. In this case for 11-pandas.py I have these two files:

  • lib/TableSample11.py
  • lib/PivotSample11.py
.
├── 11-pandas.py
└── lib
    ├── PivotSample11.py
    └── TableSample11.py

However, the classes inside the files have no numbers. In this case, the class names are:

  • class TableSample:
  • class PivotSample:

Now, I can call the local library as below:

from lib.TableSample11 import TableSample
from lib.PivotSample11 import PivotSample

When the class became mature and can be shared along multiple script, I remove the number, for example:

  • lib/TableSample.py
  • lib/PivotSample.py

with shared class, I can avoid too maintaining too many classes.


Transformation Abstraction

We can envision the data morphing, through a diagram and simple statements.

The Data Set

The raw data set can be represented as:

The Operations

Even with Pandas library, we can capture the essence of the operations in a more abstract form.

  1. Obtaining the Pivot, with a magical panda function.
  1. Creating a new index, for sorting later, based on date/time.
  1. Sorting by the result, using new index.

Now we are ready for the actual code.

Morphism Diagram

We can also visualize the process as below:

Pivot: Dataframe: Pandas: Morphism Diagram

This diagrammatic algebra is a simplified diagram, without considering reindexing using datetime parameter, while sorting the pivot result.


Simple CSV Reader

Let’s begin with a straightforward process, by reading the CSV and importing it into a dataframe at once.

Pivot: Dataframe: Pandas: Class Diagram

The class structure might seem complex, but the actual code is concise.

Directory Structure

We are also going to organize the class in its own lib directory.

.
├── 11-pandas.py
└── lib
    ├── PivotSample11.py
    └── TableSample11.py

Main Script

I prefer to keep the main script short. Long scripts tend to be messy, and with a shorter main script, I can manage the flow better without losing focus.

The class skeleton is as follows:

#!/usr/bin/env python3

# Local Library
from lib.TableSample11 import TableSample
from lib.PivotSample11 import PivotSample

def main() -> None:
  source_csv = 'sample-data.csv'

  table_sample = TableSample(source_csv)
  table_sample.process()

  pivot_sample = PivotSample(table_sample.df_table)
  pivot_sample.process()

if __name__ == "__main__":
  main()

Pivot: Dataframe: Pandas: Main

I use table_sample.df_table property, as a feed to pivot class instance.


Table Sample Class

The simple TableSample class is concise, and quite self-explanatory.

import pandas as pd
from pandas import DataFrame

class TableSample:
  def __init__(self, filename: str) -> None:
    # save initial parameter
    self.filename = filename

  def process(self) -> None:
    # Load data into a DataFrame
    self.df_table = pd.read_csv(self.filename)

    # Display the header
    print("Header:", self.df_table.columns)

    # Display the data
    print("Data:")
    print(self.df_table)

Pivot: Dataframe: Pandas: TableSample: Class

With the result as shown below:

❯ ./11-pandas.py
Header: Index(['Number', 'Date', 'Fruit'], dtype='object')
Data:
     Number        Date       Fruit
0         1  19/02/2017      Orange
1         2  19/02/2017       Grape
2         3  19/02/2017  Strawberry
3         4  19/02/2017      Orange
4         5  19/02/2017       Apple
..      ...         ...         ...
750     751  05/04/2017  Strawberry
751     752  05/04/2017  Strawberry
752     753  05/04/2017       Mango
753     754  05/04/2017  Strawberry
754     755  05/04/2017      Banana

Pivot: Dataframe: Pandas: TableSample: Result

Just keep an eye on the index. We will manipulate with the index later.


Pivot Sample Class

The simple PivotSample class is also succint.

import pandas as pd

from datetime import datetime
from pandas import DataFrame

class PivotSample:
  def __init__(self, df_table: DataFrame) -> None:
    # save initial parameter
    self.df_table = df_table

  def process(self) -> None:
    # Perform pivot operations, for example:
    pivot_table = self.df_table.pivot_table(
      index='Date', columns='Fruit',
      aggfunc='count', fill_value=0)

    # Sort the index by both month and day
    pivot_table.index = pivot_table. \
      index.to_series().apply(lambda x:
        datetime.strptime(str(x), "%d/%m/%Y"))
    self.pivot_table = pivot_table.sort_index()

    print("Pivot Table:")
    print(self.pivot_table)

Pivot: Dataframe: Pandas: PivotSample: Class

With the result as shown below:

❯ ./11-pandas.py
Pivot Table:
           Number         ...                  
Fruit       Apple Banana  ... Orange Strawberry
Date                      ...                  
2017-02-19      1      3  ...      3          4
2017-02-20      0      4  ...      2          2
2017-02-21      2      1  ...      2          3
2017-02-22      4      1  ...      4          5
...
...
2017-04-02      1      1  ...      5          1
2017-04-03      4      0  ...      3          5
2017-04-04      0      2  ...      3          3
2017-04-05      3      1  ...      2          4

[46 rows x 6 columns]

It is going to display a significant amount of data.

Pivot: Dataframe: Pandas: PivotSample: Result

And so on… continue…

Pivot: Dataframe: Pandas: PivotSample: Result

If you prefer, you can also keep an eye of the index, using self.pivot_table.columns.

Header: MultiIndex([('Number',      'Apple'),
            ('Number',     'Banana'),
            ('Number',      'Grape'),
            ('Number',      'Mango'),
            ('Number',     'Orange'),
            ('Number', 'Strawberry')],
           names=[None, 'Fruit'])

How does it works?

How is the data transformed behind the scene? It is easier to understand, by examininging how the data evolves.

First, we utilize the built-in pivot maker from the dataframe.

    # Perform pivot operations, for example:
    pivot_table = self.df_table.pivot_table(
      index='Date', columns='Fruit',
      aggfunc='count', fill_value=0)

Pivot: Dataframe: Pandas: PivotSample: Build

This results in a grouped but unsorted date, as shown below:

❯ ./11-pandas.py
Pivot Table:
           Number         ...                  
Fruit       Apple Banana  ... Orange Strawberry
Date                      ...                  
01/03/2017      1      3  ...      3          4
01/04/2017      4      6  ...      3          4
02/03/2017      1      1  ...      2          4
02/04/2017      1      1  ...      5          1
03/03/2017      1      4  ...      5          2
03/04/2017      4      0  ...      3          5
04/03/2017      1      3  ...      0          4
04/04/2017      0      2  ...      3          3
05/03/2017      1      4  ...      4          4
05/04/2017      3      1  ...      2          4
06/03/2017      5      2  ...      3          2
07/03/2017      4      2  ...      3          2

Pivot: Dataframe: Pandas: PivotSample: Unsorted Result

The actual index is:

❯ ./11-pandas.py
Pivot Table:
Index(['01/03/2017', '01/04/2017', '02/03/2017', '02/04/2017', '03/03/2017',
       '03/04/2017', '04/03/2017', '04/04/2017', '05/03/2017', '05/04/2017',
       '06/03/2017', '07/03/2017', '08/03/2017', '09/03/2017', '10/03/2017',
       '11/03/2017', '12/03/2017', '13/03/2017', '14/03/2017', '15/03/2017',
       '16/03/2017', '17/03/2017', '18/03/2017', '19/02/2017', '19/03/2017',
       '20/02/2017', '20/03/2017', '21/02/2017', '21/03/2017', '22/02/2017',
       '22/03/2017', '23/02/2017', '23/03/2017', '24/02/2017', '24/03/2017',
       '25/02/2017', '25/03/2017', '26/02/2017', '26/03/2017', '27/02/2017',
       '27/03/2017', '28/02/2017', '28/03/2017', '29/03/2017', '30/03/2017',
       '31/03/2017'],
      dtype='object', name='Date')

Pivot: Dataframe: Pandas: PivotSample: Unsorted Index

With the help of lambda, strptime, and friendly spirit. we can sort the dataframe by date.

    # Sort the index by both month and day
    pivot_table.index = pivot_table. \
      index.to_series().apply(lambda x:
        datetime.strptime(str(x), "%d/%m/%Y"))
    self.pivot_table = pivot_table.sort_index()

Pivot: Dataframe: Pandas: PivotSample: Sort

❯ ./11-pandas.py
Pivot Table:
DatetimeIndex(['2017-02-19', '2017-02-20', '2017-02-21', '2017-02-22',
               '2017-02-23', '2017-02-24', '2017-02-25', '2017-02-26',
               '2017-02-27', '2017-02-28', '2017-03-01', '2017-03-02',
               '2017-03-03', '2017-03-04', '2017-03-05', '2017-03-06',
               '2017-03-07', '2017-03-08', '2017-03-09', '2017-03-10',
               '2017-03-11', '2017-03-12', '2017-03-13', '2017-03-14',
               '2017-03-15', '2017-03-16', '2017-03-17', '2017-03-18',
               '2017-03-19', '2017-03-20', '2017-03-21', '2017-03-22',
               '2017-03-23', '2017-03-24', '2017-03-25', '2017-03-26',
               '2017-03-27', '2017-03-28', '2017-03-29', '2017-03-30',
               '2017-03-31', '2017-04-01', '2017-04-02', '2017-04-03',
               '2017-04-04', '2017-04-05'],
              dtype='datetime64[ns]', name='Date', freq=None)

Pivot: Dataframe: Pandas: PivotSample: Sorted Index

This forms the foundation of our pivot table.


What’s Our Next Move 🤔?

From simple dataframe, we can continue to custom dataframe, such as adding custom total column and custom total row. We will learn to access the matrix cell inside the dataframe using index.

After exploring the simple dataframe, we can proceed to a custom dataframe, incorporating features like adding custom total columns and custom total rows. We will delve into accessing matrix cells within the dataframe using indices.

Consider continuing to read [ Pivot - Model - Custom Class ].