Long-term preservation of data and metadata

The IHSN has contracted the Inter-university Consortium for Political and Social Research (ICPSR, University of Michigan) to develop guidelines for statistical agencies interested in establishing preservation good practices.

Working Paper No 003

Principles and Good Practice for Preserving Data

This document provides basic guidance for managers in statistical agencies who are responsible for preserving data using the principles and good practice defined by the digital preservation community. The guidance defines the rationale for preserving data and the principles and standards of good practice as applied to data preservation; documents the development of a digital preservation policy; and uses digital archive audit principles to suggest good practice for data.

Download: IHSN-WP003.pdf size: 1.3MB

See also the on-line tutorial available at the ICPSR website, from which much of the content below was taken.

What do we mean by data preservation?

Microdata preservation refers to the management of digital data and related metadata over time to guarantee their long term usability. It requires the establishment and implementation of a preservation policy and procedures to ensure that data and all related metadata are preserved against:

  • Hardware or software obsolescence
  • Media failure, and
  • Other physical threats.

Unlike the preservation of information on paper, the preservation of digital information demands constant attention. In most developing countries, statistical agencies and other data producers pay insufficient attention to the issue, and few have formal preservation policies and satisfactory practices. Common issues include:

  • Loss of data and metadata
  • Data available, but on unreadable formats/media
  • Data available, but undocumented
  • Documentation only available in hard copy
  • Multiple versions of datasets available, with no “versioning” information

The solution would consist of establishing formal preservation policies and procedures (no ad-hoc action !) to:

  • Back up data regularly, and store data in different locations
  • Ensure suitable data storage
  • Refresh media periodically (copy digital information from one medium to another)
  • Migrate data periodically (convert data from one technology to another, whether hardware or software)
  • Enforce security and controlled access to the data
  • Develop a disaster recovery plan

Why is it so important ?

Reasons to preserve your survey and census data and metadata (and their readability) include:

  • Allow users in the near and distant future to exploit them
  • Allow replication or data collection and analysis
  • Build time series of data
  • Build institutional memory
  • Satisfy a legal obligation

What are the main issues?

    Problem 1 - Hardware obsolescence

Storage medium are rapidly superseded by smaller, denser, faster media. The device needed to read an “old” medium may no longer be manufactured.

    Problem 2 - Software obsolescence

A file format may be superseded by newer versions and no longer be supported. Various factors contribute to software obsolescence:

  • New computing hardware opens the door to new and improved software. Software upgrades fail to support legacy files, leading to software and file format obsolescence
  • Software supporting the format fails in the marketplace or is bought by a competitor and withdrawn
  • The format is superseded by another
  • The format "take up" is low or industry fails to create compatible software
  • The format is no longer compatible with the current environment

The most vulnerable files are the files in proprietary, closed specifications (example: SAS). Files in proprietary, open specifications have a lower risk because the specification has been publicly released, allowing others to produce software that can read them (example: PDF). The less vulnerable files are those in non-proprietary, open specifications formats. In terms of guaranteed long-term availability, published specifications produced by international standards bodies are the safest (examples: XML, ASCII, JPEG).

    Problem 3 - Physical threats

Physical damage can occur to hardware and media due to:

  • Material instability
  • Improper storage environment (temperature, humidity, light, dust)
  • Overuse (mainly for physical contact media)
  • Natural disaster (fire, flood, earthquake)
  • Infrastructure failure (plumbing, electrical, climate control)
  • Inadequate hardware maintenance
  • Human error (including improper handling)
  • Sabotage (theft, vandalism)

References

Two key documents have emerged from the digital preservation community.

See also: