Data editing
Errors or inconsistencies can be introduced into the survey data either through field error or data entry error.
- Field errors are generally attributed to the physical processes of collecting data from households (in the case of a household survey) and can include enumerator error; controller error or in some cases misinformation from the household itself.
- Data entry errors are errors limited to the process of faithfully keying the data from the survey form into the data entry system.
The process of editing therefore can be described as having two components:
- Preventing data entry errors in order to assure that the data reported on the survey form is faithfully digitized regardless of any inherent inconsistencies due to field error (see Data Entry Practices)
- Identifying inconsistent data that may be due to field error after it has been faithfully digitized, and developing the rules to correct for these errors. This second level of editing identifies errors in structure, logic and reasonableness of response. It may involve a process of automatically correcting for the inconsistency through imputation.
The extent to which imputations are used depends mostly on the nature of the survey and is usually contingent on the volume of the data. For example, population censuses will generally impute inconsistent data using a series of logical rules designed to replace the inconsistent data automatically, whereas a sample survey may rely on the production of an error report and subsequently entail a physical check and return to the questionnaire prior to correction.
Most editing relies on the coding of editing specifications provided by survey statisticians to computer programmers in pseudo-code. Pseudo-code is a simple logical written technique used to convey to computer programmers the conditions for identifying errors and inconsistencies in the data. The computer programmer will in turn program the edit specifications in a data editing program such as CSPro or Blaise. These processing programs usually have their own scripting syntax and have the ability to issue error reports that identify the specific questionnaire and variable triggering the error as well as impute or replace the erroneous value in the data set. Documentation and transparency in this process is extremely important.
Error types can be broadly categorized as being structural or logical.
Structural errors. Editing for structural errors include integrity checks that assure the sample design is respected and complete. Structural errors should also verify the appropriateness of non-responses within records. This includes the proper coding of missing and non-applicable responses. If the data entry programs are properly designed (i.e. designed as system-controlled applications in CSPro) these kinds of errors can be controlled and minimized.
These structural errors refer to the flow of the interview. This flow is frequently referred to as the path of the interview. Structural errors require familiarity with the questionnaire and knowledge of the path. The path accounts for the applicability of the response vis-à-vis the subject of the interview. For example, a particular response may require a skip of certain non-applicable questions down a path (i.e. specific education questions may be skipped if a person has not attended school). It is important to assure that non-applicable questions are coded as "not applicable" and that the structural path is preserved through a skip. Missing data on the other hand should be coded separately as this is data that is in fact applicable to the subject, yet for some reason, the data is missing.
Logical Errors. Logical errors are errors for which there is a reported value. The value however, is determined as illogical. These can be problems with an invalid range of a response (such as the consumption of an illogical amount of food per day) or an illogical response such as a 6 year old being reported with a university degree. These problems of logic are identified using conditional statements in a computer program and generally require correction.
Data editing should identify and sometime correct for errors in inconsistent data. The process of correcting for data using programmed rules of logic is called imputation. Imputation should be used with care and variables that are imputed should be identified (flagged) and documented. Imputations made to data should preserve the original data and the original responses. The original data should never be overwritten and should be made available under certain circumstances to researchers who may be evaluating the effects of the imputation and may determine that a different series of rules for replacing the data is necessary. However, imputations of surveys generally differ from those of censuses. Due to the volume of data collected in a census, automatic imputations may be more broadly used than in a survey.
Static vs. Dynamic Imputation
The process of identifying outliers and determining a valid threshold is an issue usually determined by a statistician. The replacement of the value of the outlier can be done according to static or dynamic techniques, namely cold deck and hot deck.
- Cold deck or static imputation replaces an outlying value with the mean or median of a particular observation. It may be a geographic mean or median or some other determining parameter (such as the mean observation of a particular age group or sex). The replacement value is always the same for the given parameter.
- Hot deck or dynamic imputation is a more versatile method of replacing values and requires more programmatic skill. With the hot deck technique the previous valid value is kept in memory and is continually replaced until an outlying value is identified. The replacement value will be the previous correct value which has been held in memory. This means that the replacement value continually changes within the established range and there is no bias introduced into the data by continually resolving an outlier with the same replacement value.
Missing Data
Missing data does not necessarily require imputation. On one level, missing data simply needs to be properly coded and identified as missing. If the level of non-response is determined as compromising the reliability of the data, the survey statistician should make a determination on the viability of the variable.
Data editing in practice
Practical guidelines and other materials related to statistical data editing are available from the UNECE website. These include materials related to data editing methods and techniques (1994 and 1997), impact of data editing on quality (2006), glossary of terms related to data editing (2000), and a large collection of proceedings and papers presented at various work sessions of the annual meetings of the Conference of European Statisticians.
Guidelines specific to census data can be found in Handbook on Population and Housing Census Editing , United Nations Statistics Division, 2001. This publication provides an overview of census and survey data editing methodology along with information on the use of various approaches to census editing. It also reviews the advantages and disadvantages of manual and computer-assisted editing. It details procedures and techniques for editing census data at various stages of the process. Technical considerations, particularly those pertinent to programming, are covered in annexes. Although the focus is primarily on editing for population and housing censuses, many of the concepts and techniques apply to survey operations as well.
- Editing data by imputation of other methods implies assumptions and transformation of data provided by respondents. It is good practice to always preserve an unedited copy of the dataset, to document in detail the data editing process, and to preserve the programs written to edit the data.
- In some cases, leaving some inconsistencies in the data is a better option than doing the "perfect cleaning". Data analysts are expected to run consistency and quality checks on the data, and will often prefer to implement their own editing methods instead of being provided with a suspiciously "perfect" dataset.
