Core Operations
Core Operations form the foundation for manipulating and refining datasets to extract valuable insights. These operations empower users to seamlessly navigate through their data, allowing for selective extraction, precise filtering, derivation of new variables, logical grouping, and simultaneous operations across multiple columns. In this section, we delve into each of these fundamental operations in detail, showcasing their functionalities through practical examples utilizing the DAPA (that is available in the examples) dataset, thus equipping users with the essential skills to harness the full potential of their data.
Core Operations on Data Wrangler
- Select Columns
- Filer Rows
- Orderby
- Groupby
- Derive
- Convert
- Convert Unit
- Infer Units
- Define Route
- Rename
- Sample Rows
- Slice
- Relocate
- Count
- Ungroup
Original PC Dataset

Select Columns
Select specific columns from the dataset.
Functionality:
- Choose columns based on their names or indices.
- Select multiple columns at once or individually.
Example: Selecting columns PCDTC (Sample collection Time), PCSTRESN (Standard result numeric) , and USUBJID (Unique Subject ID). 
- All columns are selected by default.
- To display columns in a specific order, select the checkboxes in the desired sequence.
Filter Rows
Filter rows based on specific conditions.
Functionality:
- Apply filters using logical conditions (e.g., greater than, less than, equal to)
- Filter rows based on categorical or numerical criteria.
- Multiple conditions can be applied simultaneously using logical AND (&&) and logical OR (||).
Example 1: Filtering rows where PCSTRESN (Concentration) is less than 0.1. This is written as :PCSTRESN <= 0.1 in the filter box.
0.1 is the lower limit of quantification in this case. Found from PCLLOQ column.
Example 2: Filter rows where USUBJID is equal to "DAPA01-001". This is written as :USUBJID == "DAPA01-001" in the filter box. 
Example 3: Apply both conditions from Example 1 and 2. This is written as :PCSTRESN <= 0.1 && :USUBJID == "DAPA01-001" in the filter box.
This will filter the rows where PCSTRESN is less than 0.1 and USUBJID is equal to "DAPA01-001".
The reduction in the number of rows after filtering is evident, as the resultant datasets are subsets of the original dataset.
Orderby
Order rows based on one or more columns.
Functionality:
- Sort the dataset by the selected column(s) either in ascending or descending order.
- The order of column selection matters, as the dataset will be sorted by the first column, then the second, and so on.
Example: Ordering data by USUBJID (Unique Subject ID) and ADTM (Analysis DateTime) in Ascending order.
The process begins with the first subject, ordering their data points in ascending time order. This sequence is then repeated for each subsequent subject.
Groupby
Group rows based on unique values in one or more columns.
Functionality: Group data by one or multiple columns. Perform aggregate functions on grouped data (eg. locf, nocb, count).
Example: Calculating the maxpcdate (maximum PCDTC) for each subject.
First, the data is grouped by USUBJID.
Secondly, the maxseq column is derived to contain the maximum PCSEQ value for each group, using a column-wise evaluation approach with the code maximum(:PCSEQ).
Next, the maxpcdate is obtained for the last record of each subject (where PCSEQ == 64), with the remaining records filled with missing.
Finally, missing values are imputed using the next observation carried backward (NOCB) method, which is detailed in the Clean section.
The groupby operation is only applied to the transformation invoked right after the grouping operation. If we want to apply the groupby operation to multiple transformations, we have to groupby each transformation separately. In the above example, the grouping operation was used primarily to derive the maximum PCSEQ for each subject. The maxpcdate was derived using the maxseq column and did not need a groupby operation.
Derive
Create new columns by performing calculations or transformations on existing columns.
Functionality:
- Apply mathematical operations (e.g., addition, subtraction, multiplication, division).
- Use inbuilt functions to transform data.
Any inbuilt functions available in the Julia programming language can be used in the derive operation. The following table lists some common examples of the inbuilt functions that can be used in the derive operation.
| Inbuilt Functions | Code Example | Description |
|---|---|---|
| ceil | ceil(:PCSTRESN) | rounds the float to the next greatest integer. |
| clamp | clamp(:PCSTRESN, 100, 50) | limits value of the column to a specified interval defined by two values. |
| floor | floor(:PCSTRESN) | rounds the float to the previous smallest integer. |
| lpad | lpad(:USUBJID, 11, '0') | pads extra characters("0") to the left of the value till the whole string reaches particular length (11) |
| lstrip | lstrip(:PCTEST, ['D', 'A']) | Returns string with any leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained fro the left. |
| max | max(:PCSEQ, :PCTPTNUM) | Returns the maximum value from both the columns. |
| min | min(:PCSEQ, :PCTPTNUM) | Returns the mainimum value from both the columns. |
| rand | rand(1:10) | Creates a new column with random numbers varying between the interval (1-10). |
| rpad | rpad(:USUBJID, 11, '0') | pads extra characters("0") to the right of the value till the whole string reaches particular length(11). |
| rstrip | rstrip(:PCTESTCD, ['D', 'A']) | Returns string with any trailing whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from the right. |
| round | round(:PCSTRESN) | rounds the float to the next integer either greater or smaller. |
| strip | strip(:PCTESTCD, ['D', 'A']) | Returns string with any trailing and leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from both right and left. |
| string | string(:PCSEQ) | convert values of a column to string. |
Utilize functions compatible with the
@rtransformmacro from the DataFramesMeta.jl package, such asDate,Time,DateTime,SubString,split,Millisecond,Int,parse,div, among others.There are two ways to create new variables in a dataset:
Row-wise: This method looks at one row at a time and uses the data in that row to calculate new values. These new values are then stored in the same row, but can use information from different columns within that row.
Column-wise: This method looks at one column at a time, either in isolation or by grouping similar entries together. It calculates new values based on the data in each column. These new values are then assigned to the appropriate rows within that column or a new column, potentially affecting multiple rows at once.
Sample of PC dataset with 5 columns.
By row
Example 1 : Create a column from another column 
Example 2 : Create a column by taking substring from another column 
This takes the 7th and 8th characters from VISIT values and creates APERIOD. Here Analysis Period (APERIOD) is generated using APERIOD = SubString(:VISIT, 7,8)
An alternate approach is to split the VISIT column values based on whitespace, that results in four strings: "Period", "1", "Day", "1". The second string is then extracted and used for APERIOD. The code is represented as APERIOD = split(:VISIT, " ")[2] as seen below. 
Example 3 : Generating Columns Based on Conditions Applied to Other Columns.
The operators ? and : are part of what is known as the ternary operator. The ternary operator is a conditional operator that takes three operands -
- A condition followed by a question mark
? - An expression to evaluate if the condition is true, followed by a colon
: - An expression to evaluate if the condition is false.
Here Nominal time relative to last dose (NRRLT) is generated using NRRLT = :PCTPTNUM < 0 ? 0 : :PCTPTNUM
This can be explained as, for each row in the dataset, the following operation is performed:
- if
:PCTPTNUM < 0, thenNRRLTis assigned with0 - else
:PCTPTNUM(the value is retained)
Example 4 : Convert column to float
APERIOD Float (APERIODF) is generated from numeric string APERIOD. APERIODF = parse(Float64, :APERIOD)
In Julia, parse and tryparse are essential for string data type conversion. Here’s how they work in general terms:
parseforcefully converts a value to a specified type and will error out if the conversion isn't possible.tryparseattempts to convert a value to a specified type but returnsmissingif the conversion fails, preventing errors.
General Examples
Convert Numeric String to Integer: When converting strings (integers in string format) to floats,
parsecan be used to directly convert the values, ensuring each integer is transformed into a float.integer_value = "10" float_value = parse(Float64, integer_value) # Converts to 10.010.0Convert Non-numeric String to Integer: When converting strings (non-numeric data) to floats,
parsewill throw an error.non_numeric_string = "ABC" non_numeric_value = tryparse(Int, non_numeric_string) # will throw an errorConvert Mixed String to Integer: In a scenario where strings might represent numeric or non-numeric data,
tryparsecan differentiate and convert accordingly, returningmissingfor non-numeric strings.numeric_string = "1" non_numeric_string = "ABC" numeric_value = tryparse(Int, numeric_string) # Converts "1" to 1 non_numeric_value = tryparse(Int, non_numeric_string) # Returns `missing`
In these scenarios, parse ensures a strict conversion process, demanding exact match in data types, while tryparse offers a lenient approach, allowing for safe conversion attempts without causing runtime errors.
Example 5 : Apply mathematical operations on columns
Nominal Time relative to first dose (NFRLT) is generated using NFRLT = :NRRLT + (24*7*(:APERIODF - 1))
Analysis Visit Number (AVISITN) is generated using AVISITN = 7 * (:APERIODF - 1) + 1
Example 6 : Convert columns to multiple types
AVISITN is converted to an Integer, and then it is prefixed with a string to create Analysis Visit (AVISIT), which is formulated as: AVISIT = string("Day ", Int(:AVISITN))
Examples 7 : Convert string to datetime
Deriving Analysis Date Time (ADTM) from PCDTC involves specifying the input date format as yyyy-mm-dd HH:MM:SS. ADTM = DateTime(:PCDTC,"yyyy-mm-dd HH:MM:SS")
Example 8 : Extract Date from DateTime
Deriving Analysis Date (ADT) from ADTM. ADT = Date(:ADTM)
Example 9 : Extract Time from DateTime
Deriving Analysis Time (ATM) from ADTM. ATM = Time(:ADTM)
By column
The By Column checkbox should be selected when the operation is performed on a column-by-column basis.
Example 1 : Create new columns with a single string and a number 
EVID(Event ID) - represents the whether the data records are dosing or not.PC = 0,EX = 1VISITNUMU(VISITNUM units) -daysare used here
Example 2 : Find maximum value across the full dataset
The new column maxpcstresn is created and populated with the maximum PCSTRESN value across all subjects, using the code maximum(:PCSTRESN).
Example 3 : Find maximum value for each Subject
To calculate the maximum PCSTRESN for each subject, grouping by USUBJID is required before deriving the column.
Derive can be used to create new columns or mutate existing columns by specifying the column name as the old column name. The new column will replace the old column.
The USUBJID column has been mutated here.
Convert
Convert all values in the selected column(s) to the chosen type.
Functionality:
- When strict conversion is enabled, any values that cannot be converted to the chosen type will raise an error, and the operation will fail to run.
- When strict conversion is disabled, any values that cannot be converted will result in missing values instead of errors.
- Note that the column type appears in parentheses next to the column name in the column header below the column name.
Example: Convert APERIOD (string column) to Integer 
Convert Units
Convert the selected column to the given unit.
Functionality:
- When no entry is made in the "New column name" textbox, then the selected column is used as the output column for the unit conversion operation.
- Selecting
Unitlessas the unit type will simply assign a unit to that column rather than performing a conversion between units. This is the default when the dataset column variables have no prior assigned units. - An assigned unit will appear in the column header next to the column type in parentheses. Example - (ng/mL)
Initial Dataset
Example 1: Assign unit to PCSTRESN (Number column)
Units (ng/mL) are assigned to PCSTRESN. If the old and new columns are the same, 'unitless' becomes the default option, indicating that units are being assigned to the existing column.
Example 2 : Convert units of PCSTRESN (Assigned unit Number column)
A new column PCSTRESN [mg/mL] is created from PCSTRESN, which has the unit ng/mL. The process involves selecting ng/mL as the input unit and mg/mL as the output unit. Consequently, the unit mg/mL is assigned to the PCSTRESN [mg/mL] column.
Unit conversion can only be applied to columns that have an assigned unit. If a conversion is attempted on a unitless column, the resultant column will remain the same as the original column.
Infer Units
Infer the units for all columns in the dataset based either on the content of the column name, the first row of the dataset or an entire column that has units defined.
Functionality:
- In a column header the unit type is denoted by parentheses or square brackets after the column name.
- If the unit is defined in the first row then only the unit text, e.g.
ng/mLis required. - Selecting
Automaticfirst checks the columnHeaderand then theFirst Rowof the dataset for units. This is default. - A user can instead explicitly select
HeaderorFirst Rowto infer units from the column name or the first row of the dataset respectively instead of theAutomaticoption. - If units are available, it would be inferred. Or else, the column will be retained as it is.
- If
Columnsis selected, the units are inferred from another column in the dataset that hasUas the suffix for the selected columns. For example, ifPCSTRESNis selected, the units will be inferred from the columnPCSTRESNU.
Example 1: A column named PCSTRESN [ng/mL] or PCSTRESN (ng/mL) would be inferred to have a column name of PCSTRESN and unit of ng/mL. 
By renaming PCSTRESN to PCSTRESN [ng/mL], units are assigned to the column, as it is inferred from its name PCSTRESN [ng/mL] as seen below. 
The Infer Units operation currently uses a pre-defined set of units. The table below provides all the supported units for the Infer Units operation.
| Unit Type | Unit | Aliases |
|---|---|---|
| Mass | Kilogram (kg) | |
Gram (g) | ||
Milligram (mg) | ||
Microgram (μg) | ug | |
Nanogram (ng) | ||
| Time | Second (s) | sec, secs, second, seconds |
Minute (minute) | min, mins, minutes | |
Hour (hr) | h, hour, hours | |
Day (d) | day, days | |
| Volume | Liter (L) | |
Milliliter (mL) | ||
| Concentration | ng/mL | |
μg/mL | ug/mL | |
mg/mL | ||
g/mL | ||
ng/L | ||
μg/L | ug/L | |
mg/L | ||
g/L |
Define Route
Provide a manually defined route column with the select route type.
| Route Abbreviation | Expansion | Description |
|---|---|---|
| ev | Extra Vascular | All routes of administration except those in which the drug is directly introduced into the bloodstream. |
| po | Per Os (Oral) | Administration to or by way of the mouth |
| im | Intramuscular | administration within a muscle |
| sc | Subcutaneous | administration beneath the skin; hypodermic (synonymous with the term SUBDERMAL) |
| oral | Oral | Administration to or by way of the mouth. |
| iv | Intravenous | administration within or into a vein or veins. |
| inf | Infiltration | administered in a diffuse pattern to multiple sites. |
Example 1: Create column route as oral 
- Using this method allows the addition of a single
routeto all subjects. - To add different routes for different groups, the
derivefunction should be used. sc,po,im,oralare just aliases forev. The actual route isevin the internal representation.
Rename
Change the name of the selected column.
Example 1: Rename PCSTRESN to PCSTRESN [ng/mL] 
Sample Rows
- Randomly sample the dataset for the desired number of rows from the original dataset.
- A seed can be provided to ensure reproducibility.
- By selecting
With Replacement, the same rows will be sampled multiple times and vice versa.
Example 1: Sample 10 rows (random number seed = 1, with replacement) 
Different seeds result in unique combinations. However, using e.g. seed = 1 will consistently produce the initial combination.
Slice Rows
Select a range of rows to keep in the dataset.
Example 1: Select the first 5 rows (1 to 5) 
Example 2: Select the last 5 rows (1531 to 1536 (last row of PC)) 
Relocate
- Move the selected column to the desired location in the dataset.
- The column will be inserted
beforeorafterthe selected column depending on the option selected.
Original Dataset
Example 1: Relocate after 
Example 2: Relocate before 
Count
- Displays the count of rows in the dataset.
- This functionality can be useful when you want to count the number of rows in a group.
Original Dataset
Example 1: Count without groupby 
Example 2: Count after Groupby 
Ungroup
Ungroups dataset after groupby operations are performed. 