Clean Operations

The Clean operation is a crucial step in preparing data, aimed at improving the quality and trustworthiness of the data. It helps fix any mistakes, inconsistencies, or gaps in the datasets. The goal is to make sure the data is precise, complete, and ready for analysis or further use.

Remove Duplicates

Duplicate records can lead to inaccuracies and affect the outcome of data analysis, underscoring the importance of having unique entries in datasets. The Remove Duplicates feature within the cleaning tools helps in locating and deleting redundant entries to ensure each record in the dataset is distinct.

Here is an example from the PC dataset, which contains 4 columns.

Using Remove Duplicates on the USUBJID column, any duplicate records are deleted, leaving only the first instance of each subject in the dataset.

Similarly, applying Remove Duplicates to the VISIT column retains only the first occurrence of each unique visit, effectively keeping the earliest record for each type of visit.

To ensure the first occurrence of each visit is kept across all subjects, Remove Duplicates needs to be applied to both USUBJID and VISIT columns.

This functionality aids in maintaining a clean and reliable dataset by removing extra copies of the same records, which is vital for accurate analysis.

Impute

Missing values are a common issue in real-world datasets and can negatively impact analysis if not properly addressed. The "Impute Missing Values" feature offers a variety of tools to effectively manage and replace missing data with estimated or calculated values.

  • Various imputation methods are available for users to select, including:

    • interp (Interpolation): Linearly estimates missing values based on neighboring data points.
    • locf (Last Observation Carried Forward): Fills missing entries with the most recent non-missing value.
    • nocb (Next Observation Carried Backward): Replaces missing values with the subsequent non-missing entry.
    • replace: Directly replaces missing values with a specific value set by the user.
    • srs (Simple Random Sample): Imputes missing values by randomly selecting from the observed data.
    • substitute: Replaces missing values using a statistical measure (like mean or median) from the data.
  • These methods help ensure that the dataset is complete, with no missing values, facilitating more accurate and reliable analysis.

In the EX dataset example, grouping by USUBJID is recommended before imputation. For instance, if EXDOSE is missing at EXSEQ == 2, different imputation methods can be applied:

Examples of Imputation Methods

  • interp (Interpolation): Calculates missing values by averaging adjacent data points.

    Here, 7.5 is imputed for missing values in each subject group at EXSEQ == 2.

  • locf (Last Observation Carried Forward): Fills in missing data with the last available value.

    For instance, 5 is used to fill missing values at EXSEQ == 2.

  • nocb (Next Observation Carried Backward): Backfills missing data with the next available value.

    Here, 10 is used for missing entries at EXSEQ == 2.

  • srs (Simple Random Sample): Randomly selects observed values to replace missing data.

    The selected value depends on the set seed number; for example, 5 is chosen for value == 1.

  • substitute: Fills missing values using a chosen summary statistic like minimum, maximum, mean, median, or mode.

The impute functionality ensures data completeness, enhancing the dataset's suitability for subsequent analysis.