Back to Top

Microdata anonymization

Statistical agencies and other data producers are increasingly publishing microdata obtained from sample surveys, censuses, and administrative data collection systems. The dissemination of microdata is made necessary by a high demand from the research community, a push for transparency, and sometimes by legal or contractual obligations. This must be done in such a way that the confidentiality of the information provided by respondents is preserved.

In this section we present:

Links to available tools are also provided, as well as a compilation of practices.

Anonymization is typically required for the production of public use files, and to a lesser extent, for generating licensed files. But anonymization is only one of many solutions to minimize the risk of disclosure when distributing microdata. Other legal and organizational measures contribute to this endeavor as well. For datasets provided to selected bona fide users, the legal agreement may include a higher level of security than anonymization alone (see the section on formulating a data dissemination policy).

This guide, Introduction to Statistical Disclosure Control (SDC), discusses common SDC methods for microdata obtained from sample surveys, censuses and administrative sources.

Download

A practice guide has been produced, which relies on the scdMicro open source R package.

Releasing data in a safe way is required to protect the integrity of the statistical system, by ensuring agencies honor their commitment to respondents to protect their identity. Agencies do not widely share, in substantial detail, their knowledge and experience using SDC and the processes for creating safe data with other agencies. This makes it difficult for agencies new to the process to implement solutions. To fill this experience and knowledge gap, we evaluated the use of a broad suite of SDC methods on a range of survey microdata covering important development topics related to health, labor, education, poverty and inequality. The data we used were all previously treated to make them safe for release. Given that their producers had already treated these data, it was not possible, nor was it our goal, to pass any judgment on the safety of these data, many of which are in the public domain The focus was rather on measuring the effects that various methods would have on the risk-utility trade-off for microdata produced to measure common development indicators. We used the experience from this large-scale experimentation to inform our discussion of the processes and methods in this guide.

Download (8 MB)