Data archiving

Rationale

Survey and census microdata are valuable resources for government departments and academic researchers. They constitute valuable and irreplaceable assets which should be managed in a way that encourages their widest possible use and re-use. At the same time, protecting respondents is a paramount concern of data collectors.

Robust procedures are thus necessary to reassure stakeholders that the microdata will be disseminated and used in an optimum manner, for the benefit of users and everyone in society, whether now or in the future. Establishing and implementing such procedures require expertise and resources. It is therefore recommended that national or international microdata archives be established to:

  • Promote the acquisition, documentation, dissemination and preservation of microdata essential for the production of national statistics, for research and for instruction in the social sciences,
  • Promote the effective use of existing survey and census data,
  • Ensure the continued viability and usability of microdata now and in the future,
  • Provide equitable access to these data within the framework of the national legislation in the interest of all citizens, by protecting confidentiality and following international recommendations and good practices.

Roles of a data archive

The roles of a data archive include the acquisition, preservation, documentation, cataloguing, anonymization, and dissemination of microdata.

Microdata are generated from various data collection activities: surveys, censuses, and administrative recording systems. The broader the scope, the greater the usefulness of the collection will be. Nevertheless, data archives can be overwhelmed by vast amounts of data if they do not define a clear acquisition policy describing their scope, resources and mandate. A set of criteria in the data acquisition plan must be defined to characterize the value of the data (e.g. geographic breadth, sample size, uniqueness of research question) as well as the cost of archiving it (e.g. amount and quality of documentation, type of media). When established by a national statistical agency, the priority will be to archive the data collected by the agency itself. However, National statistical agencies can play an important role by expanding the scope of their data archive to official sources such as line ministries, and even include non-official sources such as data collected by academic centers, international or non-governmental agencies, or even the private sector. To acquire datasets from other data producers, data archives use deposit forms and procedures.

Some examples of deposit forms and procedures:

Documentation of micro-datasets is often the last step in data collection activities. As a result, documentation is often of uneven or poor quality. As part of the data management plan, national data archives should present comprehensive information on the processes and methods used to produce microdata. This in turn, will foster greater awareness, improve usability and understanding of the data, and enhance its functionality. Among other things, the documentation should comprise a description of the data collection arrangements and should include items such as sample design, questionnaires, coding instructions and classifications, editing, validation, methodologies, reason for and method of data collection, data quality, confidentiality, as well as anonymisation procedures and any other relevant materials. Statistical resources should be documented according to international standards and best practices such as the Data Documentation Initiative (DDI) and the Dublin Core metadata specifications.

The IHSN has developed tools and guidelines on microdata documentation. See the Documentation section of our website.

Cataloguing procedures must be put in place to enable users to identify and access information relevant to their needs for every resource. The information should be made available in a comprehensive catalogue and should be kept up to date. It should be easily accessible and should contain information about the title, content, geographic context, timeliness, availability and accessibility of each resource. Access should be enhanced through the provision of comprehensive indexes, and availability should be increased via web-based search engines. Documenting data using international XML metadata standards such as the DDI and Dublin Core will make cataloguing considerably easier and more powerful.

The IHSN has developed tools and guidelines on survey gataloguing. See the Cataloguing section of our website.

There is a fundamental tension at the core of every statistical agency's mission. Staistical agencies are charged with collecting and disseminating high quality data to inform national policy and enable statistical research. They are also charged with protecting the confidentiality of survey respondents - not only because of legal and ethical mandates, but also because of the overall public perception that trust is an important contributor to data quality and response rates. Protecting confidentiality necessitates some sort of data anonymization so that individual respondents can not be identified. In sum, while statistical agencies go to great lengths to collect high quality data, the necessity of protecting the very same data cannot be met without compromising its quality. Greater confidentiality protection very often means less valuable data. National data archive must develop and implement a set of anonymization procedures as part of their data dissemination plan, and these procedures must be publicly available.

The IHSN is developing tools and guidelines on microdata anonymization, and maintains a list of key references and links. See the Data Anonymization section of our website.

Access to microdata is not a right. National data archives provide data to bona fide users only, and only for statistical and research purposes. Access to potentially identifying data requires authorization and is only allowed when the national data archive is satisfied the data will be used exclusively for justifiable research and the information cannot be reasonably obtained elsewhere. A national data archive must develop a formal, transparent microdata dissemination policy.

The IHSN has developed tools and guidelines on microdata dissemination, covering the technical and policy aspects of the issue. See the Data Dissemination section of our website.

Micro-datasets can be damaged or lost because of human error, because of technical problems that lead, for example, to the corruption of data files, or because of disasters such as fire or flood. New technologies can also render old data unreadable, because of either hardware or software advances. A national data archive is responsible for developing a data management plan that will include standard procedures for ensuring the physical security of its resources together with associated backup arrangements for minimizing the impact of adverse events.

The IHSN has developed tools and guidelines on data and metadata preservation. See the Preservation section of our website.

Data archive in practice: two models

In industrialized countries, data archives are typically specialized data centers attached to universities. These centers do not collect data but operate as “trusted repositories” for data producers. They usually have high level of expertise, adequate infrastructure, comply with international standards and best practices for documentation, have formal dissemination and preservation policies and procedures, and provide support to users.

Increasingly, in particular in developing countries, data producing agencies such as national statistics offices are developing in-house data archiving services. But for many of them, data archiving is not seen as a key role, expertise is lacking, infrastructure remains inadequate, access policies and procedures are ad-hoc, and little if any resources are available to provide support to users.

The situation is however changing, and data producers in developing countries are increasingly adopting international standards and good practices of data archiving. These agencies do not need to have all features of advanced data centers, but data archive is part of their mandate. Documentation and preservation of the data are a MUST, even if microdata are not disseminated, as good documentation of past surveys contributes to improve the quality of future surveys. Good practices and international standards are relatively easy to implement.

Standards

Various international standards have been developed for some of the specialized functions of a data archive.

  • The OAIS (ISO 14721) provides a reference model for the organization of a “digital repository” (not specific to statistical data).
  • Various metadata standards have been developed for the documentation and cataloguing of microdata and related materials, in particular the Data Documentation Initiative (DDI) and the Dublin Core Metadata Initiative (DCMI).

Recommended websites

Selected data archive websites International initiatives International metadata standards