Cleansing and preparation of data for statistical analysis: A step necessary in oral health sciences research

Document Type : Review Article(s)


1 Assistant Professor, Department of MPH, School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran

2 Professor, Research Center for Modeling in Health, Institute of Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran

3 Professor, Endodontology Research Center AND Oral and Dental Diseases Research Center AND Kerman Social Determinants on Oral Health Research ‎‎Center, Kerman university of Medical Sciences, Kerman, Iran

4 Assistant Professor, Oral and Dental Diseases Research Center, Kerman university of Medical Sciences, Kerman, Iran


In many published articles, there is still no mention of quality control processes, which might be an indication of the insufficient importance the researchers attach to undertaking or reporting such processes. However, quality control of data is one of the most important steps in research projects. Lack of sufficient attention to quality control of data might have a detrimental effect on the results of research studies. Therefore, directing the attention of researchers to quality control of data is considered a step necessary to promote the quality of research studies and reports. We have made an attempt to define the processes of cleansing and preparing data and determine its position in research protocols. An algorithm was presented for cleansing and preparing data. Then, the most important potential errors in data were introduced by giving some examples, and their effects on the results of studies were demonstrated. We made attempts to introduce the most important reasons behind errors of different natures; the techniques used to identify them and the techniques used to prevent or rectify them. Subsequently, the procedures used to prepare the data were dealt with. In this section, techniques were introduced which are used to manage the relationships established between the premises of statistical models before carrying out analyses. Considering the widespread use of statistical models with the premise of normality, such premises were focused on. Techniques used to identify lack of normal distribution of data and methods used to manage them were presented. Cleansing and preparation of data can have a significant effect on promotion of quality and accuracy of the results of research studies. It is incumbent on researchers to recognize techniques used to identify, reasons for occurrence, methods to prevent or rectify different kinds of errors in data, learn appropriate techniques in this context and mention them in study reports.


Szklo M, Nieto J. Epidemiology: beyond the basics. 3rd ed. Burlington, MA: Jones and Bartlett Learning; 2014.
Barchard KA, Pace LA. Preventing human error: The impact of data entry methods on data accuracy and statistical results. Comput Human Behav 2011; 27(5): 1834-9.
Van den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med 2005; 2(10): e267.
Peat J, Barton B. Medical statistics: a guide to data analysis and critical appraisal. 1st ed. Hoboken, NJ: Wiley Blackwell/BMJ Books; 2005. p. 338.
Barnett V, Lewis T. Outliers in Statistical Dat. New York, NY: Willey; 1994.
Osborne JW. Data cleaning basics: best practices in dealing with extreme scores. Newborn Infant Nurs Rev 2010; 10(1): 37-43.
Osborne JW, Overbay A. The power of outliers (and why researchers should always check for them). Pract Assess Res Eval 2004; 9(6): 1–12.
Hawkins D. Identification of outliers. New York, NY: Springer; 1980.
Selst MV, Jolicoeur P. A solution to the effect of sample size on outlier elimination. Q J Exp Psychol A 1994; 47(3): 631-50.
Iglewicz B, Hoaglin DC. How to detect and handle outliers. Milwaukee, WI: ASQC Quality Press; 1993.
Babaee G, Amani F, Biglarian A, Keshavarz M. Detection of outliers methods in medical studies. Tehran Univ Med J 2007; 65(7): 24-7. [In Persian].
Hamilton LC. Regression with Graphics: A Second Course In Applied Statistics. 1st ed. Belmont, CA: Duxbury Press; 1991.
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009; 338: b2393.
Baneshi MR, Talei AR. Impact of imputation of missing data on estimation of survival rates: an example in breast cancer. Iran J Cancer Prev 2010; 3(3): 127-31.
Pigott TD. A review of methods for missing data. Educ Res Eval 2001; 7(4): 353-83.
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing-data methods for generalized linear models. J Am Stat Assoc 2005; 100(469): 332-46.
Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006; 59(10): 1087-91.
Tabachnick BG, Fidell LS. Using multivariate statistics. 6th ed. Boston, MA: Boston, Allyn and Bacon; 2012. p. 1024.
Dong Y, Peng CY. Principled missing data methods for researchers. Springerplus 2013; 2(1): 222.
Park HM. Univariate analysis and normality test using SAS, Stata, and SPSS. Technical Working Paper. Bloomington, IN: The University Information TechnologyServices (UITS) Center for Statistical and Mathematical Computing, Indiana University; 2008.
Doornik JA, Hansen H. An omnibus test for univariate and multivariate normality. Oxf Bull Econ Stat 2008; 70(1): 927-39.
Bulmer MG. Principles of Statistics. New York, NY: Dover Publications; 1979.