Back to Top


This phase describes the cleaning of data records and their preparation for analysis. It is comprised of sub-processes that check, clean, and transform the collected data, and may be repeated several times.

The “Process” and “Analyze” phases can be iterative and parallel. Analysis can reveal a broader understanding of the data, which might make it apparent that additional processing is needed. Activities within the “Process” and “Analyze” phases may commence before the “Collect” phase is completed. This enables the compilation of provisional results where timeliness is an important concern for users, and increases the time available for analysis. Key difference between these phases is that “Process” concerns transformations of microdata, whereas “Analyze” concerns the further treatment of statistical aggregates.

This phase is comprised of eight sub-processes:

  • 5.1.Integrate data - This sub-process integrates data from one or more sources. The input data can be from a mixture of external or internal data sources and a variety of collection modes, including extracts of administrative data, resulting in a harmonized data set. Data integration typically includes matching and record linkage routines to link data from different sources, where those data refer to the same unit; and prioritizing, when two or more sources (with potentially different values) contain data for the same variable. Data integration may take place at any point in this phase, before or after any sub-processes. There may also be several instances of data integration in any statistical business process. Following integration, and depending on data protection requirements, data may be anonymized, i.e., stripped of identifiers such as name and address, to help to protect confidentiality.
  • 5.2.Classify and code - This sub-process classifies and codes the input data. For example, automatic or clerical coding routines may assign numeric codes to text responses according to a pre-determined classification scheme.
  • 5.3.Review, validate and edit - This sub-process applies to collected microdata. It looks at each record to identify and, where necessary, correct potential problems, errors, and discrepancies such as outliers, item non-response, and miscoding. Also referred to as input data validation. it may be run iteratively, validating data against predefined edit rules, usually in a set order. It may apply automatic edits, or raise alerts for manual inspection and correction of the data. Reviewing, validating, and editing can apply to unit records both from surveys and administrative sources, before and after integration. In certain cases, imputation (sub-process 5.4) may be used as a form of editing..
  • 5.4.Impute - Where data are missing or unreliable, estimates may be imputed, often using a rule-based approach. Specific steps typically include:
    • identification of potential errors and gaps;
    • selection of data to include or exclude from imputation routines;
    • imputation using one or more pre-defined methods e.g., “hot-deck” or “cold-deck;”
    • writing imputed data back to the data set and flagging them as imputed; and
    • the production of metadata on the imputation process.
  • 5.5.Derive new variables and statistical units - This sub-process derives values for variables and statistical units that are not explicitly provided in the collection, but are needed to deliver the required outputs. It derives new variables by applying arithmetic formulae to one or more of the variables already present in the dataset. This may need to be iterative, as some derived variables may themselves be based on other derived variables. It is therefore important to ensure that variables are derived in the correct order. New statistical units may be derived by aggregating or splitting data for collection units, or by various other estimation methods. Examples include deriving households where the collection units are persons, or enterprises where the collection units are legal units.
  • 5.6.Calculate weights - This sub-process creates weights for unit data records according to the methodology in sub-process 2.5 (Design statistical processing methodology). These weights can be used to “gross-up” sample survey results to make them representative of the target population, or to adjust for non-response in total enumerations. For informationand materials on sample calibration, see our related project page.

    ReGenesees (R Evolved Generalized Software for Sampling Estimates and Errors in Surveys) is a full-fledged R (open source) software developed and disseminated by the Italian Statistics Office. ReGenesees is a tool for design-based and model-assisted analysis of complex sample surveys. The package (and its graphical user interface package ReGenesees.GUI) run under Windows, Mac OS, Linux and most Unix-like operating systems.
  • 5.7.Calculate aggregates - This sub-process creates aggregate data and population totals from microdata. It includes summing data for records-sharing certain characteristics, determining measures of average and dispersion, and applying weights from sub-process 5.6 to sample survey data to derive population totals.
  • 5.8.Finalize data files – This sub-process compiles results of the other sub-processes in this phase to produce a data file, usually of macrodata, which is used as the input to phase 6 (Analyse). Sometimes this may be an intermediate file rather than a final, particularly for business processes where there are strong time pressures and a requirement to produce both preliminary and final estimates.