Set Operations
Set operations are key methods in data wrangling that help you manage and organize your data effectively. Think of each dataset as a collection or 'set' of information. Set operations allow you to merge these sets together, find common data between them, and remove duplicates or unwanted information. These techniques are crucial for combining multiple sources of data, ensuring data uniqueness, and refining your datasets for analysis. By using set operations, you can streamline your data preparation process, making it easier to achieve accurate and meaningful analysis results.
Types of Set Operations
- Concat (Concatenation)
- Union
- Intersect
- Except
We will be using 2 sample datasets from PC and EX
pc_for_set : Consists of 4 columns, USUBJID, DOMAIN (PC), PCSTRESN, VISIT, ADTM (DateTime(PCDTC))
ex_for_set : Consists of 4 columns, USUBJID, DOMAIN (EX), EXDOSE, VISIT, ADTM (DateTime(EXSTDTC))
Concat
The
Concatoperation merges two or more datasets by lining them up one after the other based on specified columns.This is useful when you need to combine datasets end-to-end, creating a longer list of records.

For example, if the primary dataset is
pc_for_setand the secondary isex_for_set, concatenating them will result in a single dataset where the records fromex_for_setfollow those frompc_for_set.The combined dataset will have
1632rows:1536from PC and96from EX.The columns like
USUBJID,VISIT, andADTMare aligned and stacked vertically throughout the combined dataset.Ordering the combined dataset by
USUBJIDandADTMwill organize the data so that records fromEXandPCare sequentially arranged.
This approach allows for a seamless integration of datasets, providing a continuous flow of records based on the chosen columns.
Union
The
Unionoperation merges two datasets into one, keeping only distinct records and common columns.To successfully perform a union, the datasets should have the same columns. For example, unique columns like
PCSTRESNin thePCdataset andEXDOSEin theEXdataset need to be excluded before the operation.
After the union operation, the resulting dataset will have
1632rows, combining the unique records from both datasets. The common columns between the datasets are unified in the merged dataset.Sorting (using
orderby) the merged dataset will display the unified records in an organized sequence.
This process effectively combines the data from both sources into a single set, ensuring no repetitions and maintaining the integrity of the common columns.
Intersect
The
intersectoperation finds and returns only the records that are common to both datasets, essentially filtering for exact matches across all rows.This operation is useful for pinpointing records that appear in both datasets, helping identify shared data points.
Before performing an intersection, ensure that only the common columns are present in both datasets. Unique columns like
PCSTRESNfrom thePCdataset,EXDOSEfrom theEXdataset, and any non-matching columns likeDOMAINshould be excluded.
In this case, if the first record of the first
VISITis the only common entry across both datasets and considering there are 24 subjects with 4 visits each, the intersecting set will result in 24 unique records that are common between the two datasets.
This method filters out all but the overlapping data, providing a dataset that consists solely of records found in both of the original datasets.
Except
The
Exceptoperation identifies and returns records that are present in the primary dataset but absent in the secondary dataset, essentially filtering out the common elements.This function is useful for isolating unique records in one dataset or removing elements that are shared with another dataset.
This operation acts as the converse of an intersection, focusing on differences rather than commonalities.
For the operation to work correctly, the datasets involved must have matching columns. Therefore, any non-matching columns, such as
PCSTRESNfrom thePCdataset andEXDOSEfrom theEXdataset, or columns likeDOMAINwith no common values, should be eliminated.
If the intersection of these datasets yields 24 records, the
Exceptoperation on the primary dataset (assuming it has 1536 rows) will result in1512records (1536 - 24 = 1512). These are the records unique to the primary dataset, excluding the common elements found in both datasets.
This method effectively segregates the exclusive data of the primary dataset, providing a clear view of what is unique to it compared to the secondary dataset.