Statistics Corner: Data Management in Rcmdr
Corresponding Author: Kamal Kishore, Department of Biostatistics, Postgraduate Institute of Medical Education & Research, Chandigarh, India, Phone: +91 9591349768, e-mail: firstname.lastname@example.org
Data collection, analysis, and interpretation are integral components of health research—the majority of literature focus on the same. However, quality analysis requires quality data. Unfortunately, routine academic teaching, training, and research emphasize study designs and instruments to collect good data before entry. Many applied researchers who don’t analyze data believe that a computer quickly analyses the data–a fact. Thus, many applied researchers demand immediate analysis results from statistical collaborators. However, data cleaning consumes a significant chunk of time in any analysis as the same is riddled with errors and inconsistencies despite taking adequate precautions. Data scientists know that data cleaning is a laborious and time-consuming process. The data analysis tools are always shiny and new; the data cleaning process is ancient and unchanging. There are general and software-specific guidelines to be followed as different software have different interfaces and capabilities. The researcher needs to understand the general and particular approaches to exploit the vast potential of Rcmdr. And to utilize the enormous potential of Rcmdr, we need to know the data entry, cleaning, and export options of the same. Statistical software is like a fancy gadget whose output quality depends on input quality–data entry. Therefore, in this article, researchers will be learning:
How to enter data in Rcmdr?
How to clean and code data in Rcmdr?
How and what data formats Rcmdr can import?
How and what data formats Rcmdr can export?
How to cite this article: Kishore K, Jaswal V. Statistics Corner: Data Management in Rcmdr. J Postgrad Med Edu Res 2022;56(2):102-105.
Source of support: Nil
Conflict of interest: None
Keywords: Data export, Data import, Data management, Rcmdr
Health researchers routinely collect and analyze data to make evidence-based decisions for improving healthcare. The quality of evidence is as good as the data—this involves data collection, entry, cleaning, and coding. Like a doctor, who spent considerable time preparing before operating, data scientists spent significant time cleaning and coding data. However, many eager applied researchers hasten the data cleaning process without realizing that data is sensitive to many errors, which may compromise the quality of evidence. The old age adage of “garbage in garbage out” (GIGO) in the current scenario is “garbage in flower out” (GIFO)—because the computer hardware and software have grown exponentially from their infancy to give output from vague input. Computers offer excellent analysis results without understanding the subtleties of the data. The software can apply t-test and ANOVA to categorical data–calculate the mean for nominal coded and skewed data—the human brain understands and addresses data intricacies.
Many people routinely use spreadsheets to enter, clean, and calculate descriptive statistics (tables and graphs). The spreadsheets are flexible but not ideal for data analysis.1 The analysis software requires input in a specific way. It works on an algorithmic approach—in contrast, researchers primarily use the heuristics approach (such as title and subtitle of project, names, categorical text variables, and multiple entries in a column) in spreadsheets to enter and store data. Generally, structured data entry is rectangular as each row and column represents the individual and variables, respectively. Thus, the first step in data entry is to understand the general structure of the data—literature broadly segments it into wide and long formats. A reader can get more detail about structured and unstructured data from the previous article in the series.2 Typically, the initially entered data is riddled with errors and inconsistencies that require cleaning before analysis. After data entry, the analysis may appear as the first logical step. However, a couple of intermediary steps require researchers’ immediate and urgent attention. Readers can also read previous articles in the series to identify and implement data cleaning checks per study phase.3
To begin with, save the original data file and start with a duplicate copy—the researcher can revisit the original data for reference during data cleaning and analysis. Serial numbers help to sort patients in original form—use this feature. Many academicians collaborate and work on multiple projects–the naming of datasheets and workbooks is helpful to identify and track projects. Make a habit of uniform data entry–many software is case-sensitive–therefore, intermixing words and codes (such as Male, MALE, male, M, or m) lead to errors and inconsistencies. Many researchers habitually keep titles and subtitles during the data entry–it is suitable for record-keeping but not for analysis. Do not use blank rows, columns, special symbols, and units to enter data for analysis. Multiple responses in a single column make it a text or string variable, and it is not compatible with software for analysis. Table 1 displays the general rules for data cleaning and identification of errors–keep a record and use it regularly to refine data cleaning habits.
|Data cleaning||It is a general process that detects, diagnoses, and edits data to make it error-free.|
|Coding||A process in which categorical variables such as sex and disease are assigned numerals. E.g. male = one, female = two or vice-versa.|
|labeling||A process where variables are labeled in shortcodes to save memory. e.g., What is your date of birth is labeled as DOB.|
|Detecting outliers and inconsistencies||A process that utilizes numerical (such as frequencies and sorting) and graphical measures (such as histograms and box plots) to assess data quality.|
|Logical errors||An implausible error such as male coded as pregnant and negative lab test of the patient is written as positive.|
|Recoding||A process in which a continuous variable is recorded as categorical (BMI: High, Normal, Low) or categorical variable with small frequency (Religion: Hindu = 75, Muslim = 15, Christian = 10 are recoded as Hindu = 75 and Others = 25) are merged and recoded for inferential statistics.|
|Missing data||Data consists of variables for which some observations are missing. Literature segments missing data broadly into three domains: Missing completely at random (MCAR), Missing at random (MAR), and Not missing at random (NMAR)|
|Anonymization||A process that eliminates all identifiers such as Aadhar cards and clinic numbers before sharing and analyzing data with stakeholders.|
|Metadata||A data dictionary about data explains the various labels and codes used in the master data for analysis.|
|Transformation||It converts skewed data using logarithmic, square root, inverse, and other transformations to make it normally distributed.|
“To err is human”– despite taking appropriate measures, mistakes happen during data entry. Data editing is continuous practice during and after data entry–it is a must to refine and clean data. However, knowing the requirement of data cleansing is one thing, and implementing them is another. Many researchers routinely face data import, export, and data cleaning challenges while learning to operate new software. Therefore, this article will attempt to demonstrate the basics of data import, export, editing, and cleaning options in Rcmdr. We will discuss the article under two broad headings: data management and data editing. Data management will discuss data input, import, and export. Whereas data editing will discuss recoding, relabeling, and calculating new variables. All the steps will be demonstrated with visual representation so that readers can replicate them in practice.
DATA MANAGEMENT IN RCMDR
Globally, Microsoft Excel® (Microsoft Corp., Redmond, WA, USA) is one of the most popular programs—it has a simple, intuitive, and easy-to-use interface. Thus, we will 1st display the steps to import data from Excel in Rcmdr.
First, select the “files” menu and click the “new data set” option to enter data. Subsequently, name the dataset, and then the data editor window will appear with one row and one column. There is an option in the data editor to generate more rows and columns—use enter (or click add row) and tab (or click add column) keys on a keyboard. The arrow keys on the keyboard will help to enter and navigate data. For the reader’s perusal, Figure 1. After data entry, the same is available in the background—click on the “view” and “edit” tabs in Rcmdr to visualize and modify data.
A major strength of any data analysis software is giving users the flexibility to import data for analysis. In Rcmdr, click on the “file” menu and select the “import” option to import data. R can import text, stata, minitab, SPSS and Excel file format (Fig. 2A). The 1st step while importing data is to rename data and choose appropriate checkboxes for importing data. Many researchers access and utilize different features in different software. Thus, exporting data files is a significant strength of any software. The default data file format in Rcmdr is “.rmd,”- but Rcmdr gives the flexibility to export data in “.csv” and “stata” formats (Fig. 2B). Another critical feature of Rcmdr is using data from inbuilt packages—use “load dataset” for the same. The researchers can also work with multiple datasets in Rcmdr. By default, Rcmdr display operational data for analysis at the top (Fig. 2C). Any data set can be selected as an active dataset with a “one-click.”
Despite adequate precaution and provision, data editing is a perennial data analysis requirement. Data labeling and coding are the vital pillars of good software. Therefore, several statistical software offers rich and handy options to edit data. Rcmdr provides an extensive list of data editing options to the user. Figure 3 highlights the list of options to edit data in the Rcmdr. Applied researchers in general and healthcare researchers in particular face the daunting task of managing missing data. Rcmdr gives various options for a user to address the missing data. A few times, researchers need the flexibility of formulae to recode and calculate new variables. The Rcmdr provides extensive opportunities to calculate variables with the help of formulae. Table 2 gives the list of helpful symbols and operators available in Rcmdr—the user can use them to calculate advanced and straightforward variables.4 Moreover, other useful functions such as subset, sort, and merge are valuable in routine practice—these are also available in Rcmdr. Moreover, it also warns the user when a variable label is duplicated. Similarly, many other beneficial features that make good software are available in Rcmdr.
|1.||Subtraction||–||Age – 2|
|2.||Addition||+||Age + 2|
|3.||Multiplication||*||Height * Weight|
|4.||Division||/||Height / Weight|
|5.||Natural log||log||log (Height)|
|9.||Equal to||= =||A = = Height|
|10.||Not equal to||! =||Sex ! = “Male”|
|11.||Greater than||>||Height >150|
|12.||Greater than or equal to||>=||Height >= 150|
|13.||Less than||<||Height < 150|
|14.||Less than or equal to||<=||Height <= 150|
|15.||And||&||Height >150 & Sex = = “Male”|
|16.||Or|||||Height >150 | Sex = = “Male”|
|17.||Not||!||!(Height >150 & Sex = = “Male”)|
Easy usage, data management, and robust analyses are the fundamental triads of good statistical software. In this article, we demonstrated a few data management capabilities of Rcmdr. Data cleansing, a vital component of data management, is nonglamorous, laborious, and time-intensive activity–researchers can minimize cleaning by developing standard coding protocols, training, and inbuilt mechanism–but the same is impossible to eliminate. Further, academic teaching, training, and research rarely emphasize data cleaning practices in a routine. A good researcher knows that it is wishful to have clean data despite adopting necessary precautions. We hope data cleansing also gets its due share in literature and learning, like data analysis. We also hope that our article will help and motivate researchers to undertake and appreciate data cleaning more seriously. We found Rcmdr as user-friendly software to enter and manage data. Therefore, we hope researchers actively adapt and use Rcmdr in routine data management and analysis.
The authors would like to thank Dr Suchet Sachdev from the Department of Transfusion Medicine, Chandigarh, for his valuable time and input to improve the quality of the article.
© The Author(s). 2022 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted use, distribution, and non-commercial reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.