Clean Operations
The Clean operation is a crucial step in preparing data, aimed at improving the quality and trustworthiness of the data. It helps fix any mistakes, inconsistencies, or gaps in the datasets. The goal is to make sure the data is precise, complete, and ready for analysis or further use.
Remove Duplicates
Duplicate records can lead to inaccuracies and affect the outcome of data analysis, underscoring the importance of having unique entries in datasets. The Remove Duplicates
feature within the cleaning tools helps in locating and deleting redundant entries to ensure each record in the dataset is distinct.
Here is an example from the
PC
dataset, which contains 4 columns.
Using
Remove Duplicates
on the USUBJID
column, any duplicate records are deleted, leaving only the first instance of each subject in the dataset.
Similarly, applying
Remove Duplicates
to the VISIT
column retains only the first occurrence of each unique visit, effectively keeping the earliest record for each type of visit.
To ensure the first occurrence of each visit is kept across all subjects,
Remove Duplicates
needs to be applied to both USUBJID
and VISIT
columns.
This functionality aids in maintaining a clean and reliable dataset by removing extra copies of the same records, which is vital for accurate analysis.
Impute
Missing values are a common issue in real-world datasets and can negatively impact analysis if not properly addressed. The "Impute Missing Values" feature offers a variety of tools to effectively manage and replace missing data with estimated or calculated values.
Various imputation methods are available for users to select, including:
- interp (Interpolation): Linearly estimates missing values based on neighboring data points.
- locf (Last Observation Carried Forward): Fills missing entries with the most recent non-missing value.
- nocb (Next Observation Carried Backward): Replaces missing values with the subsequent non-missing entry.
- replace: Directly replaces missing values with a specific value set by the user.
- srs (Simple Random Sample): Imputes missing values by randomly selecting from the observed data.
- substitute: Replaces missing values using a statistical measure (like mean or median) from the data.
These methods help ensure that the dataset is complete, with no missing values, facilitating more accurate and reliable analysis.
In the EX
dataset example, grouping by USUBJID
is recommended before imputation. For instance, if EXDOSE
is missing at EXSEQ == 2
, different imputation methods can be applied:
Examples of Imputation Methods
interp (Interpolation): Calculates missing values by averaging adjacent data points.
Here,
7.5
is imputed for missing values in each subject group atEXSEQ == 2
.locf (Last Observation Carried Forward): Fills in missing data with the last available value.
For instance,
5
is used to fill missing values atEXSEQ == 2
.nocb (Next Observation Carried Backward): Backfills missing data with the next available value.
Here,
10
is used for missing entries atEXSEQ == 2
.srs (Simple Random Sample): Randomly selects observed values to replace missing data.
The selected value depends on the set seed number; for example,
5
is chosen forvalue == 1
.substitute: Fills missing values using a chosen summary statistic like minimum, maximum, mean, median, or mode.
The impute functionality ensures data completeness, enhancing the dataset's suitability for subsequent analysis.