Core Operations

Core Operations form the foundation for manipulating and refining datasets to extract valuable insights. These operations empower users to seamlessly navigate through their data, allowing for selective extraction, precise filtering, derivation of new variables, logical grouping, and simultaneous operations across multiple columns. In this section, we delve into each of these fundamental operations in detail, showcasing their functionalities through practical examples utilizing the DAPA (that is available in the examples) dataset, thus equipping users with the essential skills to harness the full potential of their data.

Core Operations on Data Wrangler

  1. Select Columns
  2. Filer Rows
  3. Orderby
  4. Groupby
  5. Derive
  6. Convert
  7. Convert Unit
  8. Infer Units
  9. Define Route
  10. Rename
  11. Sample Rows
  12. Slice
  13. Relocate
  14. Count
  15. Ungroup

Original PC Dataset

Select Columns

Select specific columns from the dataset.

Functionality:

  • Choose columns based on their names or indices.
  • Select multiple columns at once or individually.

Example: Selecting columns PCDTC (Sample collection Time), PCSTRESN (Standard result numeric) , and USUBJID (Unique Subject ID).

Note
  • All columns are selected by default.
  • To display columns in a specific order, select the checkboxes in the desired sequence.

Filter Rows

Filter rows based on specific conditions.

Functionality:

  • Apply filters using logical conditions (e.g., greater than, less than, equal to)
  • Filter rows based on categorical or numerical criteria.
  • Multiple conditions can be applied simultaneously using logical AND (&&) and logical OR (||).

Example 1: Filtering rows where PCSTRESN (Concentration) is less than 0.1. This is written as :PCSTRESN <= 0.1 in the filter box. 0.1 is the lower limit of quantification in this case. Found from PCLLOQ column.

Example 2: Filter rows where USUBJID is equal to "DAPA01-001". This is written as :USUBJID == "DAPA01-001" in the filter box.

Example 3: Apply both conditions from Example 1 and 2. This is written as :PCSTRESN <= 0.1 && :USUBJID == "DAPA01-001" in the filter box. This will filter the rows where PCSTRESN is less than 0.1 and USUBJID is equal to "DAPA01-001".

Note

The reduction in the number of rows after filtering is evident, as the resultant datasets are subsets of the original dataset.

Orderby

Order rows based on one or more columns.

Functionality:

  • Sort the dataset by the selected column(s) either in ascending or descending order.
  • The order of column selection matters, as the dataset will be sorted by the first column, then the second, and so on.

Example: Ordering data by USUBJID (Unique Subject ID) and ADTM (Analysis DateTime) in Ascending order. The process begins with the first subject, ordering their data points in ascending time order. This sequence is then repeated for each subsequent subject.

Note

To change the order, the user can click on a pill and drag it to the desired position.

Groupby

Group rows based on unique values in one or more columns.

Functionality: Group data by one or multiple columns. Perform aggregate functions on grouped data (eg. locf, nocb, count).

Example: Calculating the maxpcdate (maximum PCDTC) for each subject. First, the data is grouped by USUBJID.

Secondly, the maxseq column is derived to contain the maximum PCSEQ value for each group, using a column-wise evaluation approach with the code maximum(:PCSEQ).

Next, the maxpcdate is obtained for the last record of each subject (where PCSEQ == 64), with the remaining records filled with missing.

Finally, missing values are imputed using the next observation carried backward (NOCB) method, which is detailed in the Clean section.

Note

The groupby operation is only applied to the transformation invoked right after the grouping operation. If we want to apply the groupby operation to multiple transformations, we have to groupby each transformation separately. In the above example, the grouping operation was used primarily to derive the maximum PCSEQ for each subject. The maxpcdate was derived using the maxseq column and did not need a groupby operation.

Derive

Create new columns by performing calculations or transformations on existing columns.

Functionality:

  • Apply mathematical operations (e.g., addition, subtraction, multiplication, division).
  • Use inbuilt functions to transform data.
Note

Any inbuilt functions available in the Julia programming language can be used in the derive operation. The following table lists some common examples of the inbuilt functions that can be used in the derive operation.

Inbuilt FunctionsCode ExampleDescription
ceilceil(:PCSTRESN)rounds the float to the next greatest integer.
clampclamp(:PCSTRESN, 100, 50)limits value of the column to a specified interval defined by two values.
floorfloor(:PCSTRESN)rounds the float to the previous smallest integer.
lpadlpad(:USUBJID, 11, '0')pads extra characters("0") to the left of the value till the whole string reaches particular length (11)
lstriplstrip(:PCTEST, ['D', 'A'])Returns string with any leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained fro the left.
maxmax(:PCSEQ, :PCTPTNUM)Returns the maximum value from both the columns.
minmin(:PCSEQ, :PCTPTNUM)Returns the mainimum value from both the columns.
randrand(1:10)Creates a new column with random numbers varying between the interval (1-10).
rpadrpad(:USUBJID, 11, '0')pads extra characters("0") to the right of the value till the whole string reaches particular length(11).
rstriprstrip(:PCTESTCD, ['D', 'A'])Returns string with any trailing whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from the right.
roundround(:PCSTRESN)rounds the float to the next integer either greater or smaller.
stripstrip(:PCTESTCD, ['D', 'A'])Returns string with any trailing and leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from both right and left.
stringstring(:PCSEQ)convert values of a column to string.
  • Utilize functions compatible with the @rtransform macro from the DataFramesMeta.jl package, such as Date, Time, DateTime, SubString, split, Millisecond, Int, parse, div, among others.

  • There are two ways to create new variables in a dataset:

    • Row-wise: This method looks at one row at a time and uses the data in that row to calculate new values. These new values are then stored in the same row, but can use information from different columns within that row.

    • Column-wise: This method looks at one column at a time, either in isolation or by grouping similar entries together. It calculates new values based on the data in each column. These new values are then assigned to the appropriate rows within that column or a new column, potentially affecting multiple rows at once.

Sample of PC dataset with 5 columns.

By row

Example 1 : Create a column from another column

Example 2 : Create a column by taking substring from another column

This takes the 7th and 8th characters from VISIT values and creates APERIOD. Here Analysis Period (APERIOD) is generated using APERIOD = SubString(:VISIT, 7,8)

An alternate approach is to split the VISIT column values based on whitespace, that results in four strings: "Period", "1", "Day", "1". The second string is then extracted and used for APERIOD. The code is represented as APERIOD = split(:VISIT, " ")[2] as seen below.

Example 3 : Generating Columns Based on Conditions Applied to Other Columns. The operators ? and : are part of what is known as the ternary operator. The ternary operator is a conditional operator that takes three operands -

  • A condition followed by a question mark ?
  • An expression to evaluate if the condition is true, followed by a colon :
  • An expression to evaluate if the condition is false.

Here Nominal time relative to last dose (NRRLT) is generated using NRRLT = :PCTPTNUM < 0 ? 0 : :PCTPTNUM

This can be explained as, for each row in the dataset, the following operation is performed:

  • if :PCTPTNUM < 0 , then NRRLT is assigned with 0
  • else :PCTPTNUM (the value is retained)

Example 4 : Convert column to float APERIOD Float (APERIODF) is generated from numeric string APERIOD. APERIODF = parse(Float64, :APERIOD)

Note

In Julia, parse and tryparse are essential for string data type conversion. Here’s how they work in general terms:

  • parse forcefully converts a value to a specified type and will error out if the conversion isn't possible.
  • tryparse attempts to convert a value to a specified type but returns missing if the conversion fails, preventing errors.

General Examples

  1. Convert Numeric String to Integer: When converting strings (integers in string format) to floats, parse can be used to directly convert the values, ensuring each integer is transformed into a float.

    integer_value = "10"
    float_value = parse(Float64, integer_value)  # Converts to 10.0
    10.0
  2. Convert Non-numeric String to Integer: When converting strings (non-numeric data) to floats, parse will throw an error.

    non_numeric_string = "ABC"
    non_numeric_value = tryparse(Int, non_numeric_string)  # will throw an error
  3. Convert Mixed String to Integer: In a scenario where strings might represent numeric or non-numeric data, tryparse can differentiate and convert accordingly, returning missing for non-numeric strings.

    numeric_string = "1"
    non_numeric_string = "ABC"
    numeric_value = tryparse(Int, numeric_string)  # Converts "1" to 1
    non_numeric_value = tryparse(Int, non_numeric_string)  # Returns `missing`

In these scenarios, parse ensures a strict conversion process, demanding exact match in data types, while tryparse offers a lenient approach, allowing for safe conversion attempts without causing runtime errors.

Example 5 : Apply mathematical operations on columns Nominal Time relative to first dose (NFRLT) is generated using NFRLT = :NRRLT + (24*7*(:APERIODF - 1)) Analysis Visit Number (AVISITN) is generated using AVISITN = 7 * (:APERIODF - 1) + 1

Example 6 : Convert columns to multiple types AVISITN is converted to an Integer, and then it is prefixed with a string to create Analysis Visit (AVISIT), which is formulated as: AVISIT = string("Day ", Int(:AVISITN))

Examples 7 : Convert string to datetime Deriving Analysis Date Time (ADTM) from PCDTC involves specifying the input date format as yyyy-mm-dd HH:MM:SS. ADTM = DateTime(:PCDTC,"yyyy-mm-dd HH:MM:SS")

Example 8 : Extract Date from DateTime Deriving Analysis Date (ADT) from ADTM. ADT = Date(:ADTM)

Example 9 : Extract Time from DateTime Deriving Analysis Time (ATM) from ADTM. ATM = Time(:ADTM)

By column

Note

The By Column checkbox should be selected when the operation is performed on a column-by-column basis.

Example 1 : Create new columns with a single string and a number

  • EVID(Event ID) - represents the whether the data records are dosing or not. PC = 0 , EX = 1
  • VISITNUMU (VISITNUM units) - days are used here

Example 2 : Find maximum value across the full dataset The new column maxpcstresn is created and populated with the maximum PCSTRESN value across all subjects, using the code maximum(:PCSTRESN).

Example 3 : Find maximum value for each Subject To calculate the maximum PCSTRESN for each subject, grouping by USUBJID is required before deriving the column.

Note

Derive can be used to create new columns or mutate existing columns by specifying the column name as the old column name. The new column will replace the old column.

The USUBJID column has been mutated here.

Convert

Convert all values in the selected column(s) to the chosen type.

Functionality:

  • When strict conversion is enabled, any values that cannot be converted to the chosen type will raise an error, and the operation will fail to run.
  • When strict conversion is disabled, any values that cannot be converted will result in missing values instead of errors.
  • Note that the column type appears in parentheses next to the column name in the column header below the column name.

Example: Convert APERIOD (string column) to Integer

Convert Units

Convert the selected column to the given unit.

Functionality:

  • When no entry is made in the "New column name" textbox, then the selected column is used as the output column for the unit conversion operation.
  • Selecting Unitless as the unit type will simply assign a unit to that column rather than performing a conversion between units. This is the default when the dataset column variables have no prior assigned units.
  • An assigned unit will appear in the column header next to the column type in parentheses. Example - (ng/mL)

Initial Dataset

Example 1: Assign unit to PCSTRESN (Number column) Units (ng/mL) are assigned to PCSTRESN. If the old and new columns are the same, 'unitless' becomes the default option, indicating that units are being assigned to the existing column.

Example 2 : Convert units of PCSTRESN (Assigned unit Number column) A new column PCSTRESN [mg/mL] is created from PCSTRESN, which has the unit ng/mL. The process involves selecting ng/mL as the input unit and mg/mL as the output unit. Consequently, the unit mg/mL is assigned to the PCSTRESN [mg/mL] column.

Note

Unit conversion can only be applied to columns that have an assigned unit. If a conversion is attempted on a unitless column, the resultant column will remain the same as the original column.

Infer Units

Infer the units for all columns in the dataset based either on the content of the column name, the first row of the dataset or an entire column that has units defined.

Functionality:

  • In a column header the unit type is denoted by parentheses or square brackets after the column name.
  • If the unit is defined in the first row then only the unit text, e.g. ng/mL is required.
  • Selecting Automatic first checks the column Header and then the First Row of the dataset for units. This is default.
  • A user can instead explicitly select Header or First Row to infer units from the column name or the first row of the dataset respectively instead of the Automatic option.
  • If units are available, it would be inferred. Or else, the column will be retained as it is.
  • If Columns is selected, the units are inferred from another column in the dataset that has U as the suffix for the selected columns. For example, if PCSTRESN is selected, the units will be inferred from the column PCSTRESNU.

Example 1: A column named PCSTRESN [ng/mL] or PCSTRESN (ng/mL) would be inferred to have a column name of PCSTRESN and unit of ng/mL.

By renaming PCSTRESN to PCSTRESN [ng/mL], units are assigned to the column, as it is inferred from its name PCSTRESN [ng/mL] as seen below.

Note

The Infer Units operation currently uses a pre-defined set of units. The table below provides all the supported units for the Infer Units operation.

Unit TypeUnitAliases
MassKilogram (kg)
Gram (g)
Milligram (mg)
Microgram (μg)ug
Nanogram (ng)
TimeSecond (s)sec, secs, second, seconds
Minute (minute)min, mins, minutes
Hour (hr)h, hour, hours
Day (d)day, days
VolumeLiter (L)
Milliliter (mL)
Concentrationng/mL
μg/mLug/mL
mg/mL
g/mL
ng/L
μg/Lug/L
mg/L
g/L

Define Route

Provide a manually defined route column with the select route type.

Route AbbreviationExpansionDescription
evExtra VascularAll routes of administration except those in which the drug is directly introduced into the bloodstream.
poPer Os (Oral)Administration to or by way of the mouth
imIntramuscularadministration within a muscle
scSubcutaneousadministration beneath the skin; hypodermic (synonymous with the term SUBDERMAL)
oralOralAdministration to or by way of the mouth.
ivIntravenousadministration within or into a vein or veins.
infInfiltrationadministered in a diffuse pattern to multiple sites.

Example 1: Create column route as oral

Note
  • Using this method allows the addition of a single route to all subjects.
  • To add different routes for different groups, the derive function should be used.
  • sc, po, im, oral are just aliases for ev. The actual route is ev in the internal representation.

Rename

Change the name of the selected column.

Example 1: Rename PCSTRESN to PCSTRESN [ng/mL]

Sample Rows

  • Randomly sample the dataset for the desired number of rows from the original dataset.
  • A seed can be provided to ensure reproducibility.
  • By selecting With Replacement, the same rows will be sampled multiple times and vice versa.

Example 1: Sample 10 rows (random number seed = 1, with replacement)

Note

Different seeds result in unique combinations. However, using e.g. seed = 1 will consistently produce the initial combination.

Slice Rows

Select a range of rows to keep in the dataset.

Example 1: Select the first 5 rows (1 to 5)

Example 2: Select the last 5 rows (1531 to 1536 (last row of PC))

Note

The start and end indices are inclusive.

Relocate

  • Move the selected column to the desired location in the dataset.
  • The column will be inserted before or after the selected column depending on the option selected.

Original Dataset

Example 1: Relocate after

Example 2: Relocate before

Count

  • Displays the count of rows in the dataset.
  • This functionality can be useful when you want to count the number of rows in a group.

Original Dataset

Example 1: Count without groupby

Example 2: Count after Groupby

Ungroup

Ungroups dataset after groupby operations are performed.