Reshape Operations
Reshaping data is a key part of preparing it for analysis, enabling the transformation of datasets into different structures to meet analysis requirements. This can involve changing the layout from wide (more columns) to long formats (more rows), turning categories into separate columns, or organizing columns into key-value pairs. These changes are vital for effective data analysis, creating visual representations, and building models.
Various data reshaping techniques, such as folding and pivoting, will be discussed to equip users with the ability to effectively rearrange their datasets as needed.
Fold
The "Fold" operation takes a column from the dataset and breaks it down into two new columns, forming key-value pairs:
- The first new column contains the names of the original columns (keys).
- The second new column holds the data from those columns (values).
This technique is particularly handy for organizing categorical data into a format that's easier to analyze and visualize.
In the provided example from the PC
dataset, which includes columns like USUBJID
, PCSTRESN
(concentration), and PCTPT
(hourly POSTDOSE label), the focus is on data from Period 1 Day 1
, filtered by the VISIT
column. The fold operation is applied to the PCTPT
column.
After folding, two new columns are generated:
key
: A new column created to serve as a unique identifier, with an index starting from 1.variable
: A column populated with the names of the original columns from which the fold was performed.
The original PCTPT
column is then transformed into a value
column, containing the data associated with each key. This reorganization makes the dataset more flexible for further analysis and visualization tasks.
Pivot
The "Pivot" function changes the structure of a dataset by turning unique values from a selected column into new columns. This increases the width of the dataset and is particularly useful for organizing and analyzing categorical data.
In this operation, the dataset is expanded horizontally. For example, if pivoting is based on the value
column (which contains the original PCTPT
data), and PCSTRESN
values are used to fill the new columns, with USUBJID
serving as the unique identifier:
In this case, each USUBJID
value, having 12 records, will spread across 12 new columns, placing all PCSTRESN
values in a single row for each USUBJID
. This means every unique USUBJID
gets its own row with corresponding data spread out into new columns.
Alternatively, if key
is used as the unique identifier:
Since each key
value is unique for every row, the dataset will display one PCSTRESN
value per row, creating a more extended set of columns. This method turns each row's key-value pair into a separate column, effectively distributing single data points across multiple columns in the dataset.
Pivoting is a way to reformat data, making it easier to analyze patterns and trends, especially when dealing with categories and repeated measures within the dataset.