Join Operation

The Join operation is a fundamental technique in data manipulation, allowing users to combine data from two or more datasets based on common columns. This operation is essential for integrating data from different sources or tables to perform comprehensive analyses or generate new insights.

The Join operation merges datasets by matching values in specified columns, creating a unified dataset that combines information from all sources involved. Different types of joins dictate how matching is performed and how unmatched records are handled.

Types of Joins:

Inner
Outer
Left
Right
Cross
Semi
Outer

ex_for_join = Sample of EX with 3 columns, USUBJID, EXSTDTC, EXDOSE.

pc_for_join = Sample of PC with 3 columns, USUBJID, PCTRESN, PCDTC.

Inner join

An inner join combines rows from two datasets where there are matching values in both. It ensures that only the records with corresponding entries in both datasets are included in the result.
It retains all columns from both datasets, focusing on the common columns to establish a match.
This method is ideal for merging data that exists in both datasets, ensuring no unmatched records are included.
For example, using the pc_for_join [1536 X 3] as the primary dataset and ex_for_join [96 X 4] as the secondary dataset, an inner join will proceed as follows:
- Choose Join Type: Select 'inner' to perform an inner join.
- Choose Secondary Dataset: Pick ex_for_join [96 X 4] as the dataset to join with the primary dataset.
- Choose Primary Dataset Columns: Select columns USUBJID, PCDTC from the primary dataset.
- Secondary Dataset Columns: Choose USUBJID, EXSTDTC from the secondary dataset.

In this join, PCDTC from the primary dataset and EXSTDTC from the secondary dataset are treated as common columns. The joined dataset will use PCDTC as the common column. The outcome will be a dataset where each row represents a unique match found, with the total rows equalling the unique entries in the EX dataset, resulting in 24 for this example.

Outer join

An outer join combines data from two datasets. It includes all rows from each dataset, pairing up rows that have matching values and also keeping rows that don't match from both datasets.
This type of join is useful when you want to retain all the data from both datasets, showing both the matched and unmatched records.
Considering an example where each subject in the PC dataset has one matching record in the EX dataset, but 3 records in EX and 63 records in PC are unmatched. Given there are 24 subjects:
- The total number of rows in the resulting dataset would be the sum of matched and unmatched records, calculated as (3 + 63 + 1) * 24, which equals 1608.
- To easily view the matched and unmatched rows for each subject, the resulting dataset can be sorted using Orderby :USUBJID.

This approach ensures all data from both sources is presented, highlighting how each record from one dataset corresponds to records in the other, if at all.

Left join

A left join combines all rows to the primary (left) dataset with the matching rows from the secondary (right) dataset.
If there are rows in the primary dataset that don’t have a match in the secondary dataset will appear with missing values in the columns from the secondary dataset.
This join type is useful for retaining the complete set of records in the primary dataset while including the corresponding matches from the secondary dataset.

In this context, the resulting dataset will have the same number of rows as the primary dataset, which is 1536. This ensures that all data from the primary dataset is preserved, with matched data from the secondary dataset added where available.

Right join

A right join combines all rows to the secondary (right) dataset with the matching rows from the primary (left) dataset.
If there are rows in the secondary dataset that do not find a match in the primary dataset, these rows will appear with missing values in the columns corresponding to the primary dataset.
This join type is appropriate when the goal is to keep every record from the secondary dataset while including matching records from the primary dataset.

For the right join, the resulting dataset will have the same number of rows as the secondary dataset, 96 in this case. This approach ensures that the entire dataset from the right (secondary) side is maintained in the output, with data from the left (primary) dataset merged in where it corresponds.

Cross join

A cross join produces the Cartesian product of two datasets, combining each row from the primary (left) dataset with every row from the secondary (right) dataset.
This join type results in a dataset that contains all possible combinations of rows from both datasets. If the primary dataset has m rows and the secondary dataset has n rows, the resulting dataset will contain m * n rows.
A cross join is useful in scenarios where every combination of the records from the two datasets is needed, such as generating test cases or performing exhaustive analysis.

For a cross join, if the primary dataset PC_for join has 1536 rows and the secondary dataset EX_for join has 96 rows, the resulting dataset will contain 1536 * 96 = 147456 rows, encompassing all combinations of the two datasets.

Note

Use with caution, as it can result in a large number of records.

Semi join

A semi join retrieves rows from the primary (left) dataset that have corresponding matches in the secondary (right) dataset.
This join type doesn’t actually include the data from the secondary dataset; it only checks if a match exists.
The semi join is useful for filtering the primary dataset to include only those records that have a matching entry in the secondary dataset.

The resulting dataset from a semi join will have the same number of rows as would be found in an inner join, but it will contain only the columns from the primary dataset. This means that the semi join effectively filters the primary dataset based on the matching criteria established with the secondary dataset.

Anti join

An anti join selects rows from the primary (left) dataset that do not have a corresponding match in the secondary (right) dataset.
This type of join is used to exclude rows from the primary dataset that have matching entries in the secondary dataset, essentially filtering the primary dataset for non-matching records.

The resulting dataset from an anti join will contain the number of rows from the primary dataset that do not have matches in the secondary dataset. For instance, if the primary dataset has 1536 rows and the semi join (representing matching rows) has 24 rows, the anti join will result in 1512 rows (1536 - 24 = 1512). This effectively shows all the unique records in the primary dataset that aren’t found in the secondary dataset.