Set Operations

Set operations are key methods in data wrangling that help you manage and organize your data effectively. Think of each dataset as a collection or 'set' of information. Set operations allow you to merge these sets together, find common data between them, and remove duplicates or unwanted information. These techniques are crucial for combining multiple sources of data, ensuring data uniqueness, and refining your datasets for analysis. By using set operations, you can streamline your data preparation process, making it easier to achieve accurate and meaningful analysis results.

Types of Set Operations

  1. Concat (Concatenation)
  2. Union
  3. Intersect
  4. Except

We will be using 2 sample datasets from PC and EX pc_for_set : Consists of 4 columns, USUBJID, DOMAIN (PC), PCSTRESN, VISIT, ADTM (DateTime(PCDTC))

ex_for_set : Consists of 4 columns, USUBJID, DOMAIN (EX), EXDOSE, VISIT, ADTM (DateTime(EXSTDTC))

Concat

  • The Concat operation merges two or more datasets by lining them up one after the other based on specified columns.

  • This is useful when you need to combine datasets end-to-end, creating a longer list of records.

  • For example, if the primary dataset is pc_for_set and the secondary is ex_for_set, concatenating them will result in a single dataset where the records from ex_for_set follow those from pc_for_set.

  • The combined dataset will have 1632 rows: 1536 from PC and 96 from EX.

  • The columns like USUBJID, VISIT, and ADTM are aligned and stacked vertically throughout the combined dataset.

  • Ordering the combined dataset by USUBJID and ADTM will organize the data so that records from EX and PC are sequentially arranged.

This approach allows for a seamless integration of datasets, providing a continuous flow of records based on the chosen columns.

Union

  • The Union operation merges two datasets into one, keeping only distinct records and common columns.

  • To successfully perform a union, the datasets should have the same columns. For example, unique columns like PCSTRESN in the PC dataset and EXDOSE in the EX dataset need to be excluded before the operation.

  • After the union operation, the resulting dataset will have 1632 rows, combining the unique records from both datasets. The common columns between the datasets are unified in the merged dataset.

  • Sorting (using orderby) the merged dataset will display the unified records in an organized sequence.

This process effectively combines the data from both sources into a single set, ensuring no repetitions and maintaining the integrity of the common columns.

Intersect

  • The intersect operation finds and returns only the records that are common to both datasets, essentially filtering for exact matches across all rows.

  • This operation is useful for pinpointing records that appear in both datasets, helping identify shared data points.

  • Before performing an intersection, ensure that only the common columns are present in both datasets. Unique columns like PCSTRESN from the PC dataset, EXDOSE from the EX dataset, and any non-matching columns like DOMAIN should be excluded.

  • In this case, if the first record of the first VISIT is the only common entry across both datasets and considering there are 24 subjects with 4 visits each, the intersecting set will result in 24 unique records that are common between the two datasets.

This method filters out all but the overlapping data, providing a dataset that consists solely of records found in both of the original datasets.

Except

  • The Except operation identifies and returns records that are present in the primary dataset but absent in the secondary dataset, essentially filtering out the common elements.

  • This function is useful for isolating unique records in one dataset or removing elements that are shared with another dataset.

  • This operation acts as the converse of an intersection, focusing on differences rather than commonalities.

  • For the operation to work correctly, the datasets involved must have matching columns. Therefore, any non-matching columns, such as PCSTRESN from the PC dataset and EXDOSE from the EX dataset, or columns like DOMAIN with no common values, should be eliminated.

  • If the intersection of these datasets yields 24 records, the Except operation on the primary dataset (assuming it has 1536 rows) will result in 1512 records (1536 - 24 = 1512). These are the records unique to the primary dataset, excluding the common elements found in both datasets.

This method effectively segregates the exclusive data of the primary dataset, providing a clear view of what is unique to it compared to the secondary dataset.