Clean Operations
The Clean operation is a crucial step in preparing data, aimed at improving the quality and trustworthiness of the data. It helps fix any mistakes, inconsistencies, or gaps in the datasets. The goal is to make sure the data is precise, complete, and ready for analysis or further use.
Remove Duplicates
Duplicate records can lead to inaccuracies and affect the outcome of data analysis, underscoring the importance of having unique entries in datasets. The Remove Duplicates feature within the cleaning tools helps in locating and deleting redundant entries to ensure each record in the dataset is distinct.
Here is an example from the PC dataset, which contains 4 columns.
Using Remove Duplicates on the USUBJID column, any duplicate records are deleted, leaving only the first instance of each subject in the dataset.
Similarly, applying Remove Duplicates to the VISIT column retains only the first occurrence of each unique visit, effectively keeping the earliest record for each type of visit.
To ensure the first occurrence of each visit is kept across all subjects, Remove Duplicates needs to be applied to both USUBJID and VISIT columns.
This functionality aids in maintaining a clean and reliable dataset by removing extra copies of the same records, which is vital for accurate analysis.
Impute
Missing values are a common issue in real-world datasets and can negatively impact analysis if not properly addressed. The "Impute Missing Values" feature offers a variety of tools to effectively manage and replace missing data with estimated or calculated values.
Various imputation methods are available for users to select, including:
- interp (Interpolation): Linearly estimates missing values based on neighboring data points.
- locf (Last Observation Carried Forward): Fills missing entries with the most recent non-missing value.
- nocb (Next Observation Carried Backward): Replaces missing values with the subsequent non-missing entry.
- replace: Directly replaces missing values with a specific value set by the user.
- srs (Simple Random Sample): Imputes missing values by randomly selecting from the observed data.
- substitute: Replaces missing values using a statistical measure (like mean or median) from the data.
These methods help ensure that the dataset is complete, with no missing values, facilitating more accurate and reliable analysis.

In the EX dataset example, grouping by USUBJID is recommended before imputation. For instance, if EXDOSE is missing at EXSEQ == 2, different imputation methods can be applied:
Examples of Imputation Methods
interp (Interpolation): Calculates missing values by averaging adjacent data points.
Here, 7.5is imputed for missing values in each subject group atEXSEQ == 2.locf (Last Observation Carried Forward): Fills in missing data with the last available value.
For instance, 5is used to fill missing values atEXSEQ == 2.nocb (Next Observation Carried Backward): Backfills missing data with the next available value.
Here, 10is used for missing entries atEXSEQ == 2.srs (Simple Random Sample): Randomly selects observed values to replace missing data.
The selected value depends on the set seed number; for example, 5is chosen forvalue == 1.substitute: Fills missing values using a chosen summary statistic like minimum, maximum, mean, median, or mode.
The impute functionality ensures data completeness, enhancing the dataset's suitability for subsequent analysis.