Join Operation
The Join
operation is a fundamental technique in data manipulation, allowing users to combine data from two or more datasets based on common columns. This operation is essential for integrating data from different sources or tables to perform comprehensive analyses or generate new insights.
The Join
operation merges datasets by matching values in specified columns, creating a unified dataset that combines information from all sources involved. Different types of joins dictate how matching is performed and how unmatched records are handled.
Types of Joins:
- Inner
- Outer
- Left
- Right
- Cross
- Semi
- Outer
ex_for_join
= Sample of EX
with 3 columns, USUBJID
, EXSTDTC
, EXDOSE
.
pc_for_join
= Sample of PC
with 3 columns, USUBJID
, PCTRESN
, PCDTC
.
Inner join
An inner join combines rows from two datasets where there are matching values in both. It ensures that only the records with corresponding entries in both datasets are included in the result.
It retains all columns from both datasets, focusing on the common columns to establish a match.
This method is ideal for merging data that exists in both datasets, ensuring no unmatched records are included.
For example, using the
pc_for_join [1536 X 3]
as the primary dataset andex_for_join [96 X 4]
as the secondary dataset, an inner join will proceed as follows:- Choose Join Type: Select 'inner' to perform an inner join.
- Choose Secondary Dataset: Pick
ex_for_join [96 X 4]
as the dataset to join with the primary dataset. - Choose Primary Dataset Columns: Select columns
USUBJID
,PCDTC
from the primary dataset. - Secondary Dataset Columns: Choose
USUBJID
,EXSTDTC
from the secondary dataset.
In this join, PCDTC
from the primary dataset and EXSTDTC
from the secondary dataset are treated as common columns. The joined dataset will use PCDTC
as the common column. The outcome will be a dataset where each row represents a unique match found, with the total rows equalling the unique entries in the EX
dataset, resulting in 24
for this example.
Outer join
An outer join combines data from two datasets. It includes all rows from each dataset, pairing up rows that have matching values and also keeping rows that don't match from both datasets.
This type of join is useful when you want to retain all the data from both datasets, showing both the matched and unmatched records.
Considering an example where each subject in the
PC
dataset has one matching record in theEX
dataset, but3
records inEX
and63
records inPC
are unmatched. Given there are24
subjects:- The total number of rows in the resulting dataset would be the sum of matched and unmatched records, calculated as
(3 + 63 + 1) * 24
, which equals1608
. - To easily view the matched and unmatched rows for each subject, the resulting dataset can be sorted using
Orderby :USUBJID
.
- The total number of rows in the resulting dataset would be the sum of matched and unmatched records, calculated as
This approach ensures all data from both sources is presented, highlighting how each record from one dataset corresponds to records in the other, if at all.
Left join
A left join combines all rows to the primary (left) dataset with the matching rows from the secondary (right) dataset.
If there are rows in the primary dataset that don’t have a match in the secondary dataset will appear with
missing
values in the columns from the secondary dataset.This join type is useful for retaining the complete set of records in the primary dataset while including the corresponding matches from the secondary dataset.
In this context, the resulting dataset will have the same number of rows as the primary dataset, which is 1536
. This ensures that all data from the primary dataset is preserved, with matched data from the secondary dataset added where available.
Right join
A right join combines all rows to the secondary (right) dataset with the matching rows from the primary (left) dataset.
If there are rows in the secondary dataset that do not find a match in the primary dataset, these rows will appear with
missing
values in the columns corresponding to the primary dataset.This join type is appropriate when the goal is to keep every record from the secondary dataset while including matching records from the primary dataset.
For the right join, the resulting dataset will have the same number of rows as the secondary dataset, 96
in this case. This approach ensures that the entire dataset from the right (secondary) side is maintained in the output, with data from the left (primary) dataset merged in where it corresponds.
Cross join
A cross join produces the Cartesian product of two datasets, combining each row from the primary (left) dataset with every row from the secondary (right) dataset.
This join type results in a dataset that contains all possible combinations of rows from both datasets. If the primary dataset has m rows and the secondary dataset has
n
rows, the resulting dataset will containm * n
rows.A cross join is useful in scenarios where every combination of the records from the two datasets is needed, such as generating test cases or performing exhaustive analysis.
For a cross join, if the primary dataset PC_for join
has 1536
rows and the secondary dataset EX_for join
has 96
rows, the resulting dataset will contain 1536 * 96 = 147456
rows, encompassing all combinations of the two datasets.
Use with caution, as it can result in a large number of records.
Semi join
A semi join retrieves rows from the primary (left) dataset that have corresponding matches in the secondary (right) dataset.
This join type doesn’t actually include the data from the secondary dataset; it only checks if a match exists.
The semi join is useful for filtering the primary dataset to include only those records that have a matching entry in the secondary dataset.
The resulting dataset from a semi join will have the same number of rows as would be found in an inner join, but it will contain only the columns from the primary dataset. This means that the semi join effectively filters the primary dataset based on the matching criteria established with the secondary dataset.
Anti join
An anti join selects rows from the primary (left) dataset that do not have a corresponding match in the secondary (right) dataset.
This type of join is used to exclude rows from the primary dataset that have matching entries in the secondary dataset, essentially filtering the primary dataset for non-matching records.
The resulting dataset from an anti join will contain the number of rows from the primary dataset that do not have matches in the secondary dataset. For instance, if the primary dataset has 1536
rows and the semi join (representing matching rows) has 24
rows, the anti join will result in 1512
rows (1536 - 24
= 1512
). This effectively shows all the unique records in the primary dataset that aren’t found in the secondary dataset.