Reshape Operations

Reshaping data is a key part of preparing it for analysis, enabling the transformation of datasets into different structures to meet analysis requirements. This can involve changing the layout from wide (more columns) to long formats (more rows), turning categories into separate columns, or organizing columns into key-value pairs. These changes are vital for effective data analysis, creating visual representations, and building models.

Various data reshaping techniques, such as folding and pivoting, will be discussed to equip users with the ability to effectively rearrange their datasets as needed.

Fold

The "Fold" operation takes a column from the dataset and breaks it down into two new columns, forming key-value pairs:

  • The first new column contains the names of the original columns (keys).
  • The second new column holds the data from those columns (values).

This technique is particularly handy for organizing categorical data into a format that's easier to analyze and visualize.

In the provided example from the PC dataset, which includes columns like USUBJID, PCSTRESN (concentration), and PCTPT (hourly POSTDOSE label), the focus is on data from Period 1 Day 1, filtered by the VISIT column. The fold operation is applied to the PCTPT column.

After folding, two new columns are generated:

  • key: A new column created to serve as a unique identifier, with an index starting from 1.
  • variable: A column populated with the names of the original columns from which the fold was performed.

The original PCTPT column is then transformed into a value column, containing the data associated with each key. This reorganization makes the dataset more flexible for further analysis and visualization tasks.

Pivot

The "Pivot" function changes the structure of a dataset by turning unique values from a selected column into new columns. This increases the width of the dataset and is particularly useful for organizing and analyzing categorical data.

In this operation, the dataset is expanded horizontally. For example, if pivoting is based on the value column (which contains the original PCTPT data), and PCSTRESN values are used to fill the new columns, with USUBJID serving as the unique identifier:

In this case, each USUBJID value, having 12 records, will spread across 12 new columns, placing all PCSTRESN values in a single row for each USUBJID. This means every unique USUBJID gets its own row with corresponding data spread out into new columns.

Alternatively, if key is used as the unique identifier:

Since each key value is unique for every row, the dataset will display one PCSTRESN value per row, creating a more extended set of columns. This method turns each row's key-value pair into a separate column, effectively distributing single data points across multiple columns in the dataset.

Pivoting is a way to reformat data, making it easier to analyze patterns and trends, especially when dealing with categories and repeated measures within the dataset.