Core Operations
Core Operations form the foundation for manipulating and refining datasets to extract valuable insights. These operations empower users to seamlessly navigate through their data, allowing for selective extraction, precise filtering, derivation of new variables, logical grouping, and simultaneous operations across multiple columns. In this section, we delve into each of these fundamental operations in detail, showcasing their functionalities through practical examples utilizing the DAPA (that is available in the examples) dataset, thus equipping users with the essential skills to harness the full potential of their data.
Core Operations on Data Wrangler
- Select Columns
- Filer Rows
- Orderby
- Groupby
- Derive
- Convert
- Convert Unit
- Infer Units
- Define Route
- Rename
- Sample Rows
- Slice
- Relocate
- Count
- Ungroup
Original PC Dataset
Select Columns
Select specific columns from the dataset.
Functionality:
- Choose columns based on their names or indices.
- Select multiple columns at once or individually.
Example: Selecting columns PCDTC
(Sample collection Time), PCSTRESN
(Standard result numeric) , and USUBJID
(Unique Subject ID).
- All columns are selected by default.
- To display columns in a specific order, select the checkboxes in the desired sequence.
Filter Rows
Filter rows based on specific conditions.
Functionality:
- Apply filters using logical conditions (e.g., greater than, less than, equal to)
- Filter rows based on categorical or numerical criteria.
- Multiple conditions can be applied simultaneously using logical AND (&&) and logical OR (||).
Example 1: Filtering rows where PCSTRESN
(Concentration) is less than 0.1. This is written as :PCSTRESN <= 0.1
in the filter box. 0.1 is the lower limit of quantification in this case. Found from
PCLLOQ
column.
Example 2: Filter rows where USUBJID
is equal to "DAPA01-001". This is written as :USUBJID == "DAPA01-001"
in the filter box.
Example 3: Apply both conditions from Example 1 and 2. This is written as :PCSTRESN <= 0.1 && :USUBJID == "DAPA01-001"
in the filter box. This will filter the rows where
PCSTRESN
is less than 0.1 and USUBJID
is equal to "DAPA01-001".
The reduction in the number of rows after filtering is evident, as the resultant datasets are subsets of the original dataset.
Orderby
Order rows based on one or more columns.
Functionality:
- Sort the dataset by the selected column(s) either in ascending or descending order.
- The order of column selection matters, as the dataset will be sorted by the first column, then the second, and so on.
Example: Ordering data by USUBJID
(Unique Subject ID) and ADTM
(Analysis DateTime) in Ascending order. The process begins with the first subject, ordering their data points in ascending time order. This sequence is then repeated for each subsequent subject.
To change the order, the user can click on a pill and drag it to the desired position.
Groupby
Group rows based on unique values in one or more columns.
Functionality: Group data by one or multiple columns. Perform aggregate functions on grouped data (eg. locf, nocb, count).
Example: Calculating the maxpcdate
(maximum PCDTC
) for each subject. First, the data is grouped by
USUBJID
.
Secondly, the
maxseq
column is derived to contain the maximum PCSEQ
value for each group, using a column-wise evaluation approach with the code maximum(:PCSEQ)
.
Next, the
maxpcdate
is obtained for the last record of each subject (where PCSEQ == 64
), with the remaining records filled with missing
.
Finally, missing values are imputed using the next observation carried backward (NOCB) method, which is detailed in the Clean section.
The groupby
operation is only applied to the transformation invoked right after the grouping operation. If we want to apply the groupby
operation to multiple transformations, we have to groupby
each transformation separately. In the above example, the grouping operation was used primarily to derive the maximum PCSEQ
for each subject. The maxpcdate
was derived using the maxseq
column and did not need a groupby
operation.
Derive
Create new columns by performing calculations or transformations on existing columns.
Functionality:
- Apply mathematical operations (e.g., addition, subtraction, multiplication, division).
- Use inbuilt functions to transform data.
Any inbuilt functions available in the Julia programming language can be used in the derive
operation. The following table lists some common examples of the inbuilt functions that can be used in the derive
operation.
Inbuilt Functions | Code Example | Description |
---|---|---|
ceil | ceil(:PCSTRESN) | rounds the float to the next greatest integer. |
clamp | clamp(:PCSTRESN, 100, 50) | limits value of the column to a specified interval defined by two values. |
floor | floor(:PCSTRESN) | rounds the float to the previous smallest integer. |
lpad | lpad(:USUBJID, 11, '0') | pads extra characters("0") to the left of the value till the whole string reaches particular length (11) |
lstrip | lstrip(:PCTEST, ['D', 'A']) | Returns string with any leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained fro the left. |
max | max(:PCSEQ, :PCTPTNUM) | Returns the maximum value from both the columns. |
min | min(:PCSEQ, :PCTPTNUM) | Returns the mainimum value from both the columns. |
rand | rand(1:10) | Creates a new column with random numbers varying between the interval (1-10). |
rpad | rpad(:USUBJID, 11, '0') | pads extra characters("0") to the right of the value till the whole string reaches particular length(11). |
rstrip | rstrip(:PCTESTCD, ['D', 'A']) | Returns string with any trailing whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from the right. |
round | round(:PCSTRESN) | rounds the float to the next integer either greater or smaller. |
strip | strip(:PCTESTCD, ['D', 'A']) | Returns string with any trailing and leading whitespace removed. If chars (a character, or vector or set of characters) is provided, instead remove characters contained from both right and left. |
string | string(:PCSEQ) | convert values of a column to string. |
Utilize functions compatible with the
@rtransform
macro from the DataFramesMeta.jl package, such asDate
,Time
,DateTime
,SubString
,split
,Millisecond
,Int
,parse
,div
, among others.There are two ways to create new variables in a dataset:
Row-wise: This method looks at one row at a time and uses the data in that row to calculate new values. These new values are then stored in the same row, but can use information from different columns within that row.
Column-wise: This method looks at one column at a time, either in isolation or by grouping similar entries together. It calculates new values based on the data in each column. These new values are then assigned to the appropriate rows within that column or a new column, potentially affecting multiple rows at once.
Sample of
PC
dataset with 5 columns.
By row
Example 1 : Create a column from another column
Example 2 : Create a column by taking substring from another column
This takes the 7th and 8th characters from VISIT
values and creates APERIOD
. Here Analysis Period (APERIOD
) is generated using APERIOD = SubString(:VISIT, 7,8)
An alternate approach is to split the VISIT
column values based on whitespace, that results in four strings: "Period", "1", "Day", "1". The second string is then extracted and used for APERIOD
. The code is represented as APERIOD = split(:VISIT, " ")[2]
as seen below.
Example 3 : Generating Columns Based on Conditions Applied to Other Columns. The operators
?
and :
are part of what is known as the ternary operator. The ternary operator is a conditional operator that takes three operands -
- A condition followed by a question mark
?
- An expression to evaluate if the condition is true, followed by a colon
:
- An expression to evaluate if the condition is false.
Here Nominal time relative to last dose (NRRLT
) is generated using NRRLT = :PCTPTNUM < 0 ? 0 : :PCTPTNUM
This can be explained as, for each row in the dataset, the following operation is performed:
- if
:PCTPTNUM < 0
, thenNRRLT
is assigned with0
- else
:PCTPTNUM
(the value is retained)
Example 4 : Convert column to float APERIOD Float (
APERIODF
) is generated from numeric string APERIOD
. APERIODF = parse(Float64, :APERIOD)
In Julia, parse
and tryparse
are essential for string data type conversion. Here’s how they work in general terms:
parse
forcefully converts a value to a specified type and will error out if the conversion isn't possible.tryparse
attempts to convert a value to a specified type but returnsmissing
if the conversion fails, preventing errors.
General Examples
Convert Numeric String to Integer: When converting strings (integers in string format) to floats,
parse
can be used to directly convert the values, ensuring each integer is transformed into a float.integer_value = "10" float_value = parse(Float64, integer_value) # Converts to 10.0
10.0
Convert Non-numeric String to Integer: When converting strings (non-numeric data) to floats,
parse
will throw an error.non_numeric_string = "ABC" non_numeric_value = tryparse(Int, non_numeric_string) # will throw an error
Convert Mixed String to Integer: In a scenario where strings might represent numeric or non-numeric data,
tryparse
can differentiate and convert accordingly, returningmissing
for non-numeric strings.numeric_string = "1" non_numeric_string = "ABC" numeric_value = tryparse(Int, numeric_string) # Converts "1" to 1 non_numeric_value = tryparse(Int, non_numeric_string) # Returns `missing`
In these scenarios, parse
ensures a strict conversion process, demanding exact match in data types, while tryparse
offers a lenient approach, allowing for safe conversion attempts without causing runtime errors.
Example 5 : Apply mathematical operations on columns Nominal Time relative to first dose (
NFRLT
) is generated using NFRLT = :NRRLT + (24*7*(:APERIODF - 1))
Analysis Visit Number (
AVISITN
) is generated using AVISITN = 7 * (:APERIODF - 1) + 1
Example 6 : Convert columns to multiple types
AVISITN
is converted to an Integer, and then it is prefixed with a string to create Analysis Visit (AVISIT
), which is formulated as: AVISIT = string("Day ", Int(:AVISITN))
Examples 7 : Convert string to datetime Deriving Analysis Date Time (ADTM) from PCDTC involves specifying the input date format as
yyyy-mm-dd HH:MM:SS
. ADTM = DateTime(:PCDTC,"yyyy-mm-dd HH:MM:SS")
Example 8 : Extract Date from DateTime Deriving
Analysis Date (ADT)
from ADTM
. ADT = Date(:ADTM)
Example 9 : Extract Time from DateTime Deriving
Analysis Time (ATM)
from ADTM
. ATM = Time(:ADTM)
By column
The By Column checkbox should be selected when the operation is performed on a column-by-column basis.
Example 1 : Create new columns with a single string and a number
EVID
(Event ID) - represents the whether the data records are dosing or not.PC = 0
,EX = 1
VISITNUMU
(VISITNUM units) -days
are used here
Example 2 : Find maximum value across the full dataset The new column
maxpcstresn
is created and populated with the maximum PCSTRESN
value across all subjects, using the code maximum(:PCSTRESN)
.
Example 3 : Find maximum value for each Subject To calculate the maximum
PCSTRESN
for each subject, grouping by USUBJID
is required before deriving the column.
Derive can be used to create new columns or mutate existing columns by specifying the column name as the old column name. The new column will replace the old column.
The
USUBJID
column has been mutated here.
Convert
Convert all values in the selected column(s) to the chosen type.
Functionality:
- When strict conversion is enabled, any values that cannot be converted to the chosen type will raise an error, and the operation will fail to run.
- When strict conversion is disabled, any values that cannot be converted will result in missing values instead of errors.
- Note that the column type appears in parentheses next to the column name in the column header below the column name.
Example: Convert APERIOD
(string column) to Integer
Convert Units
Convert the selected column to the given unit.
Functionality:
- When no entry is made in the "New column name" textbox, then the selected column is used as the output column for the unit conversion operation.
- Selecting
Unitless
as the unit type will simply assign a unit to that column rather than performing a conversion between units. This is the default when the dataset column variables have no prior assigned units. - An assigned unit will appear in the column header next to the column type in parentheses. Example - (ng/mL)
Initial Dataset
Example 1: Assign unit to PCSTRESN
(Number column) Units (ng/mL) are assigned to
PCSTRESN
. If the old and new columns are the same, 'unitless' becomes the default option, indicating that units are being assigned to the existing column.
Example 2 : Convert units of PCSTRESN
(Assigned unit Number column) A new column
PCSTRESN [mg/mL]
is created from PCSTRESN
, which has the unit ng/mL
. The process involves selecting ng/mL
as the input unit and mg/mL
as the output unit. Consequently, the unit mg/mL
is assigned to the PCSTRESN [mg/mL]
column.
Unit conversion can only be applied to columns that have an assigned unit. If a conversion is attempted on a unitless column, the resultant column will remain the same as the original column.
Infer Units
Infer the units for all columns in the dataset based either on the content of the column name, the first row of the dataset or an entire column that has units defined.
Functionality:
- In a column header the unit type is denoted by parentheses or square brackets after the column name.
- If the unit is defined in the first row then only the unit text, e.g.
ng/mL
is required. - Selecting
Automatic
first checks the columnHeader
and then theFirst Row
of the dataset for units. This is default. - A user can instead explicitly select
Header
orFirst Row
to infer units from the column name or the first row of the dataset respectively instead of theAutomatic
option. - If units are available, it would be inferred. Or else, the column will be retained as it is.
- If
Columns
is selected, the units are inferred from another column in the dataset that hasU
as the suffix for the selected columns. For example, ifPCSTRESN
is selected, the units will be inferred from the columnPCSTRESNU
.
Example 1: A column named PCSTRESN [ng/mL]
or PCSTRESN (ng/mL)
would be inferred to have a column name of PCSTRESN
and unit of ng/mL
.
By renaming PCSTRESN
to PCSTRESN [ng/mL]
, units are assigned to the column, as it is inferred from its name PCSTRESN [ng/mL]
as seen below.
The Infer Units
operation currently uses a pre-defined set of units. The table below provides all the supported units for the Infer Units
operation.
Unit Type | Unit | Aliases |
---|---|---|
Mass | Kilogram (kg ) | |
Gram (g ) | ||
Milligram (mg ) | ||
Microgram (μg ) | ug | |
Nanogram (ng ) | ||
Time | Second (s ) | sec , secs , second , seconds |
Minute (minute ) | min , mins , minutes | |
Hour (hr ) | h , hour , hours | |
Day (d ) | day , days | |
Volume | Liter (L ) | |
Milliliter (mL ) | ||
Concentration | ng/mL | |
μg/mL | ug/mL | |
mg/mL | ||
g/mL | ||
ng/L | ||
μg/L | ug/L | |
mg/L | ||
g/L |
Define Route
Provide a manually defined route
column with the select route type.
Route Abbreviation | Expansion | Description |
---|---|---|
ev | Extra Vascular | All routes of administration except those in which the drug is directly introduced into the bloodstream. |
po | Per Os (Oral) | Administration to or by way of the mouth |
im | Intramuscular | administration within a muscle |
sc | Subcutaneous | administration beneath the skin; hypodermic (synonymous with the term SUBDERMAL) |
oral | Oral | Administration to or by way of the mouth. |
iv | Intravenous | administration within or into a vein or veins. |
inf | Infiltration | administered in a diffuse pattern to multiple sites. |
Example 1: Create column route
as oral
- Using this method allows the addition of a single
route
to all subjects. - To add different routes for different groups, the
derive
function should be used. sc
,po
,im
,oral
are just aliases forev
. The actual route isev
in the internal representation.
Rename
Change the name of the selected column.
Example 1: Rename PCSTRESN
to PCSTRESN [ng/mL]
Sample Rows
- Randomly sample the dataset for the desired number of rows from the original dataset.
- A seed can be provided to ensure reproducibility.
- By selecting
With Replacement
, the same rows will be sampled multiple times and vice versa.
Example 1: Sample 10 rows (random number seed = 1
, with replacement
)
Different seeds result in unique combinations. However, using e.g. seed = 1
will consistently produce the initial combination.
Slice Rows
Select a range of rows to keep in the dataset.
Example 1: Select the first 5 rows (1 to 5)
Example 2: Select the last 5 rows (1531 to 1536 (last row of PC
))
The start and end indices are inclusive.
Relocate
- Move the selected column to the desired location in the dataset.
- The column will be inserted
before
orafter
the selected column depending on the option selected.
Original Dataset
Example 1: Relocate after
Example 2: Relocate before
Count
- Displays the count of rows in the dataset.
- This functionality can be useful when you want to count the number of rows in a group.
Original Dataset
Example 1: Count without groupby
Example 2: Count after Groupby
Ungroup
Ungroups dataset after groupby operations are performed.