Microdata documentation
"From the archivist's and the end user's perspective, a "good" dataset is one that is easy to use. Its documentation is clear and easy to understand, the data contain no surprises, and users are able to access the dataset with relatively little start-up time." (Guide to Social Science Data Preparation and Archiving, ICPSR)
The data documentation, or metadata, helps the researcher to:
- Find the data they are interested in. Without names, abstracts, keywords and other important metadata element it might be difficult for a researcher to locate the datasets and variables that meet his or her research requirements. Any cataloguing and resource location system - be it manual or digital - is based on metadata.
- Understand what the data are measuring and how the data have been created. Without proper descriptions of the design of the survey and the methods used when collecting and processing the data, the risk is high that the user will misunderstand and even misuse them.
- Assess the quality of the data. Information about the data collection standards, as well as any deviations from the planned standards, is important knowledge for any researcher who wants to know whether the data are useful for his or her research project.
Traditionally, data producers and data archives produced text-based codebooks. Today's alternative to text-based codebooks are XML-based codebooks, produced according to international metadata standards such as the Data Documentation Initiative (DDI) and the Dublin Core. To facilitate the documentation of microdata, the IHSN distributes the Microdata Management Toolkit, and promotes the adoption of international good practices.
A good principle for data management in general, and metadata management in particular, is to capture data/metadata as early as possible and only once. Ideally data/metadata should be captured with as little human effort as possible, and as a direct result of events that imply the "birth" of the data/metadata (Bo Sundgren: Guidelines for the modeling of statistical data and metadata, United Nations, Geneva 1995).
What should be provided?
The information below was extracted from Good Practices in Data Documentation, UK Data Archive, University of Essex. See also the IHSN "Quick reference Guide for Data Archivists".
There are three main types of material that constitute ideal documentation for a dataset:
1. Explanatory material
This represents the minimum of material to create and preserve, and can be described as the material required to ensure the long-term viability and functionality of a dataset. Full understanding of the dataset and its contents cannot be achieved without this material.
Information about the data collection methods
This section describes the data collection process, whether it is a survey, the collection of administrative information, or the transcription of a document source. It should describe the instruments used, the methods employed, and how these were developed. If applicable, details of the sampling design and sampling frames should be included. It is also extremely useful to include information on any monitoring process undertaken during the data collection as well as details of quality controls.
Information about the structure of the dataset
Key to this type of information is a detailed document describing the structure of the dataset and including information about relationships between individual files or records within the study. It should include, for example, key variables required for unique identification of subjects across files. It should also include the number of cases and variables in each file and the number of files in the dataset. For relational models, a diagram showing the structure and the relations between the records and elements of the dataset should be constructed.
Technical information
This information relates to the technical framework and should include:
- the computer system used to generate the files;
- the software packages with which the files were created;
- the medium on which the data was stored, and
- a complete list of all data files present in the dataset.
Variables and values, coding and classification schemes
The documentation should contain a full list describing all variables (or fields) in the dataset, including a complete explanation and full details about the coding and classifications used for the information allocated to those fields. It is especially important to have blank and missing fields explained and accounted for. It is helpful to identify variables to which standard coding classifications apply, and to record the version of the classification scheme used - preferably with a bibliographic reference to that code.
Information about derived variables
Many data producers derive new variables from original data. This may be as simple as grouping raw age data (age in years) according to groups of years appropriate for the needs of the survey, or it may be much more complex and require the use of sophisticated algorithms. When grouped or derived variables are created, it is important that the logic for the grouping or derivation be clear. Simple grouping, such as for age, can be included within the data dictionary. More complex derivations require other means of recording the information. The best method of describing these is by using flow charts or accurate Boolean statements. It is recommended that sufficient supporting information be provided to allow an easy link between the core variables used and the resultant variables. We would also recommend that the computer algorithms used to create the derivations be saved together with information on the software.
Weighting and grossing
Weighting and grossing variables need to be fully documented, explaining the construction of the variables with a clear indication of the circumstances in which they should be used. The latter is particularly important when different weights need to be applied for different purposes.
Data source
Details about the source the data is derived from should be included in some details. For example, when the data source is made up of responses to survey questionnaires, each question should be carefully recorded in the documentation. Ideally, the text will include a reference to the generated variable(s). It is also useful to explain the conditions under which a question would be asked to a respondent, including if possible, the cases to which it applies, and ideally, a summary of response statistics.
Confidentiality and anonymization
It is important to note if the data contains any confidential information on individuals, households, organizations or institutions. Whenever this occurs, it is recommended to record such information together with any agreement on how to use the data - for example, with survey respondents. Issues of confidentiality may restrict the analyses to be undertaken or the results to be published, particularly if the data is to be made available for secondary use. If the data was anonymized to prevent subjects identification, it is wise to record the anonymization procedure and its impact on the data. Such modification may restrict subsequent analysis and an indication of it is useful.
2. Contextual information
This provides users with material about the context in which the data was collected, and how it was put to use. This type of information adds richness and depth to the documentation, and enables the secondary user to fully understand the background and processes behind the data collection exercise. This also forms a vital historical record for future researchers.
Description of the originating project
Details should be provided about the history of the project, or about the process that gave rise to the dataset. This should offer information on the intellectual and substantive framework. For example, the description could cover topics such as:
- why the data collection was felt necessary;
- the aims and objectives of the project;
- who or what was being studied;
- the geographic and temporal coverage;
- publications or policy developments it contributed to or that arose as a response, and
- any other relevant information.
Provenance of the dataset
This information relates to aspects such as the history of the data collection process, changes and developments that occurred in the data themselves and the methodology, or any adjustments made. The following can be provided as well:
- details of data errors;
- problems encountered in the process of data collection, data entry, data checking and cleaning;
- conversion to a different software or operating system;
- bibliographic references to reports or publications that stem from the study, and
- any other useful information on the life-cycle of the dataset.
Serial and time-series datasets, new editions
For repeated cross-section, panel or time-series datasets, it is extremely helpful to obtain additional information describing, for example, changes in the question text, variable labelling or sampling procedures.
3. Cataloguing material
This material serves two purposes. First, it serves as a bibliographic record of the dataset. This allows for the dataset to be properly acknowledged and cited in publications, and the material also acts as a formal record for long preservation purposes. Second, it is the basic instrument used for resource discovery, allowing the dataset to be uniquely identified within the collection by providing appropriate information to help secondary users identify the study as useful to their purpose.

