Set Operations
Set operations are key methods in data wrangling that help you manage and organize your data effectively. Think of each dataset as a collection or 'set' of information. Set operations allow you to merge these sets together, find common data between them, and remove duplicates or unwanted information. These techniques are crucial for combining multiple sources of data, ensuring data uniqueness, and refining your datasets for analysis. By using set operations, you can streamline your data preparation process, making it easier to achieve accurate and meaningful analysis results.
Types of Set Operations
- Concat (Concatenation)
- Union
- Intersect
- Except
We will be using 2 sample datasets from PC
and EX
pc_for_set
: Consists of 4 columns, USUBJID
, DOMAIN
(PC
), PCSTRESN
, VISIT
, ADTM
(DateTime(PCDTC)
)
ex_for_set
: Consists of 4 columns, USUBJID
, DOMAIN
(EX
), EXDOSE
, VISIT
, ADTM
(DateTime(EXSTDTC)
)
Concat
The
Concat
operation merges two or more datasets by lining them up one after the other based on specified columns.This is useful when you need to combine datasets end-to-end, creating a longer list of records.
For example, if the primary dataset is
pc_for_set
and the secondary isex_for_set
, concatenating them will result in a single dataset where the records fromex_for_set
follow those frompc_for_set
.The combined dataset will have
1632
rows:1536
from PC and96
from EX.The columns like
USUBJID
,VISIT
, andADTM
are aligned and stacked vertically throughout the combined dataset.Ordering the combined dataset by
USUBJID
andADTM
will organize the data so that records fromEX
andPC
are sequentially arranged.
This approach allows for a seamless integration of datasets, providing a continuous flow of records based on the chosen columns.
Union
The
Union
operation merges two datasets into one, keeping only distinct records and common columns.To successfully perform a union, the datasets should have the same columns. For example, unique columns like
PCSTRESN
in thePC
dataset andEXDOSE
in theEX
dataset need to be excluded before the operation.After the union operation, the resulting dataset will have
1632
rows, combining the unique records from both datasets. The common columns between the datasets are unified in the merged dataset.Sorting (using
orderby
) the merged dataset will display the unified records in an organized sequence.
This process effectively combines the data from both sources into a single set, ensuring no repetitions and maintaining the integrity of the common columns.
Intersect
The
intersect
operation finds and returns only the records that are common to both datasets, essentially filtering for exact matches across all rows.This operation is useful for pinpointing records that appear in both datasets, helping identify shared data points.
Before performing an intersection, ensure that only the common columns are present in both datasets. Unique columns like
PCSTRESN
from thePC
dataset,EXDOSE
from theEX
dataset, and any non-matching columns likeDOMAIN
should be excluded.In this case, if the first record of the first
VISIT
is the only common entry across both datasets and considering there are 24 subjects with 4 visits each, the intersecting set will result in 24 unique records that are common between the two datasets.
This method filters out all but the overlapping data, providing a dataset that consists solely of records found in both of the original datasets.
Except
The
Except
operation identifies and returns records that are present in the primary dataset but absent in the secondary dataset, essentially filtering out the common elements.This function is useful for isolating unique records in one dataset or removing elements that are shared with another dataset.
This operation acts as the converse of an intersection, focusing on differences rather than commonalities.
For the operation to work correctly, the datasets involved must have matching columns. Therefore, any non-matching columns, such as
PCSTRESN
from thePC
dataset andEXDOSE
from theEX
dataset, or columns likeDOMAIN
with no common values, should be eliminated.If the intersection of these datasets yields 24 records, the
Except
operation on the primary dataset (assuming it has 1536 rows) will result in1512
records (1536 - 24 = 1512
). These are the records unique to the primary dataset, excluding the common elements found in both datasets.
This method effectively segregates the exclusive data of the primary dataset, providing a clear view of what is unique to it compared to the secondary dataset.