Data Upload
Welcome to the PumasCP data upload guide. This essential feature allows easy data import, facilitating robust analysis and report generation. Users can quickly upload various types of data, including single and multiple dose studies, enabling efficient analysis and insightful reporting. This section will detail the upload process, supported formats, and best practices, helping users leverage their data for impactful results. Let’s explore how to optimize your data analysis and reporting with PumasCP.
Prerequisites
Before proceeding with data upload, please ensure that the following prerequisites are met to ensure a smooth and successful process:
Data Preparation:
Prior to uploading, ensure that your data is formatted correctly and adheres to the supported data formats. Cleane and preprocess your data to remove any inconsistencies, errors, or invalid entries that may affect the analysis process.
File Size Limitations:
The dataset size limitation on PumasCP is 500 mb. Be aware of any limitations on file size imposed by your organization's IT policies or the PumasCP platform. Split large datasets into smaller files if necessary to meet size restrictions.
Network Connectivity:
For the online version, ensure a stable internet connection to prevent interruptions during the data upload process.
Supported Data Formats
The PumasCP data upload feature supports various common formats, allowing easy dataset import for analysis and reporting. Users can rename datasets and add notes post-upload for convenience. Supported formats include:
File type | File extension | Notes |
---|---|---|
Comma-Separated Values | CSV | Widely used for tabular data; rows are lines, columns separated by commas. |
Microsoft Excel Open XML Spreadsheet | XLSX | Excel spreadsheet files, containing multiple sheets of tabular data. XLS is not supported. |
Tab-Separated Values | TSV | Similar to CSV, but uses tabs as delimiters. |
Stata Data File | DTA | Native format for Stata statistical software for storing datasets. |
SAS Binary Encoded Data | SAS7BDAT | Binary data format used by SAS software for storing datasets. |
SAS Transport File | XPT | Used for transporting datasets between different SAS systems or versions. |
SPSS Portable File | POR | Used by SPSS for storing datasets in a portable format. |
SPSS Data File | SAV | Format used by SPSS for storing datasets. |
Please note that there may be restrictions or limitations on file size or encoding for certain formats.
When uploading XLSX files any empty columns mark the end of the data that is loaded into the application from the file, even if there are subsequent columns with data. Store individual tables in separate sheets rather than within the same sheet.
Data Types and Structures
When uploading data to PumasCP, it's essential to understand the expected data types for each column or field in your dataset. The software supports various data types, including:
Data Type | Notes | Examples |
---|---|---|
Integer (Int) | Whole numbers without decimal points. | -1 , 10 , -5 , 0 |
Number (Floating Point) | Numbers with decimal points. | -4.0055 , 0.922 |
String | Sequences of characters, used for text data. | "hello" , "world" , "123" , "2019-08-08" |
Date | Date values in various formats. | 2019-08-08 , 08/08/2019 , 08-08-2019 |
If a column type has a '?' as a suffix, it means that there are some missing/empty values in the column along with values of the parent type, e.g. String?, Number?, and Integer?.
Data Upload Workflow
Uploading a Dataset
Navigate to the study page and click on the Upload Dataset button. This action opens the file upload dialog.
In the dialog, users can either:
- Click on the file upload button to select a file from their disk.
- Drag and drop a file from their local disk into the dialog.
After selecting the file, a preview window appears. It displays the data on the left and upload options on the right panel.
Users can name the uploaded file and customize the upload settings based on the file type using the options in the right panel.
Users can only upload one data file at a time. To upload multiple files, repeat the process for each file.
Cancelling the Upload
- To interrupt the upload process, users can click on Cancel, close the dialog with the X, or press Esc on the keyboard.
Completing the Upload
- Clicking the Done button will upload the file to PumasCP, tagging it as original and timestamping it.
During upload, the original precision of numbers is maintained, though the preview may show truncated values for better viewing, set to 3 significant digits. Full precision is used in subsequent computations.
CSV Customisation Options
1. Data Starts from Row number
The "data starts from row number" is a feature in PumasCP available while customising CSV files. Datasets often contain header rows or metadata at the beginning that describe the content of the data. These header rows typically include information such as variable names, units, or other annotations. Therefore this feature in PumasCP enables users to specify the starting point of the actual data within the dataset, allowing for accurate and efficient data importation and analysis.
It is particularly useful when dealing with datasets that have a variable number of header rows or when the user wants to skip certain rows before the actual data starts. By providing this information, PumasCP can accurately read and parse the data, ignoring any header rows or metadata that precede it.
By default, this option is set to
auto
, but manipulated using positive numbers.
- 1 = Represents dataset where header is repeated as data. This means if there are any existing column names that are identified, they are bought down to first row of the dataset and the column names become auto-generated like Column1, Column2, ..., ColumnN. Therefore the
row_length
increases by 1. - 2 = same as
auto
. Header is identified and used as column names. - 3, 4, ..., n =
(row_length - n + 2)
. After 2, use the number in the formula to find out how many rows you want to retain. The rows are retained from last, including the column headers.
2. Specify values as missing
The "Specify values as missing" feature in PumasCP is a functionality that allows users to explicitly identify certain values in their dataset as missing or undefined. This feature is particularly useful when working with datasets that contain information suggesting that a value can be considered as missing, which are common in real-world datasets due to various reasons such as data collection errors, or incomplete records. In PumasCP, users can specify which values in their dataset should be treated as missing by assigning them a special missing
designation. By doing so, users can ensure that these missing values are handled appropriately during data analysis and modeling tasks, preventing them from skewing the results or causing errors in calculations.
There are different options on how to specify missing/empty values in different types of columns in the dataset.
Options | Integer? (Integer + missing) | Number? (Float + missing) | String? (String + missing) |
---|---|---|---|
. | String | String | String |
Empty String | Integer? | Number? | String? |
NA | String | String | String |
na | String | String | String |
NULL | String | String | String |
null | String | String | String |
<LOQ | String | String | String |
<=LOQ | String | String | String |
>LOQ | String | String | String |
Broken Sample | String | String | String |
Missing Sample | String | String | String |
By selecting these options, you can cast the column type to a String
or retain the existing one by using Empty String
.
3. Add Notes
During the file upload process, users can enter notes to provide additional context or information about the dataset. These notes can include details such as the source of the data, data collection methods, or any other relevant information that may be useful for other users or for future reference.