Health Identification and De-identification Toolkit

We are conducting research on appropriate architectures for systems of patient identification, to develop a toolkit of techniques and implementations that will allow us to build a handful of demonstrable and testable such systems, to develop and study methods for de-identifying patient data without destroying its utility for research, and to evaluate our results against existing and proposed desiderata for the usability and protection of health data. This research project is funded by the National Library of Medicine, R01 LM06587-01, Isaac Kohane is the principal investigator with a subcontract to the Clinical Decision Making Group at MIT.

Many trends make the function of patient identification one of the higher and more controversial priorities for the implementation of health information systems . The desire to better understand the long-term health status of patients when addressing their immediate needs, to study the effectiveness of different patterns of care, to investigate the long-term outcomes of proposed interventions through clinical research studies, and to optimize the system of healthcare delivery all create the need for coherent, comprehensive longitudinal records about the care of individual patients. Because most data are, however, collected in disparate, unintegrated ways, it is vitally important to be able to identify the same individual’s data though coming from different institutions and collected by different means and at different times. The most convenient way to address this problem would be to associate with each individual a unique, permanent identification number that would be used universally in every database that collected information about that individual. Under such a scheme, every database in the country could, at least in principle, be joined on this common key to produce a complete database about everyone. This ability is, of course, both the advantage and the defect in this simple scheme. Although it makes collation of data relatively easy, our national traditions of privacy and patients’ expectations of confidentiality of their health data are too easily violated. Nevertheless, many contemporary proposals envision just such a system, based on adoption of the Social Security Number (SSN) as the unique identifier.

Fortunately, the past two decades have seen remarkable advances in the development of cryptographic techniques that can assure the confidential communication of data and the reliable identification of individuals and institutions through public-key cryptographic systems and related digital signatures. Our aim in this proposal is to develop a set of tools that will allow the creation of a broad range of patient identification systems. These differ in the trade-offs that they make among competing desiderata for an identification system, including dimensions such as who controls the creation and dissemination of identifiers, the extent to which the same identifier can be used for multiple purposes, the source of trust who certifies the identity of a patient or institution, the degree to which the identifier itself is kept secret, and the complexity of the resulting system of identification. Our designs and implementations are based on a recently-proposed "Simple Distributed Security Infrastructure" (SDSI) that provides a small, powerful set of security capabilities in terms of the underlying cryptographic techniques.

Once comprehensive data sets are collected, it is also critical to be able to de-identify ("scrub") those data so that researchers and managers interested in studying aspects of the care process that do not require linkage to specific individuals can do so with real data, while minimizing the exposure of the subject individuals. Our work in HIIDIT extends our previous work in this area to develop adaptive methods that can de-identify patient records in a combination of tabular and textual data, and to study appropriate methods of formalizing and testing the adequacy of de-identification methods.

The results of our work on HIIDIT will be useful to policy makers and information system architects to inform them of the range of possible solutions to the tasks of patient identification and de-identification. It will also provide a reference set of tools for implementors who are developing health information systems.

The research and evaluation plan is summarized in the following specific aims:.

1. Develop the Health Information Identification and De-Identification Toolkit (HIIDIT), a toolkit that provides a range of solutions for patient identification and de-identification to meet various national and patient objectives in healthcare access, delivery, and research.

This goal involves formalizing the problem space of potential solutions to the identification and de-identification task. For the identification task we will map this problem space to different applications of the SDSI cryptographic system . For the de-identification system are upon earlier work on the SCRUB system of Latanya Sweeney.

2. Apply HIIDIT to two different tasks: retrieving individual patient data for clinical care within a multi-institutional healthcare system and retrieving aggregate data for a multi-institutional clinical research trial.

The flexibility in the use of HIIDIT will be demonstrated by applying it to these two tasks. For each task we will implement a centralized, conventional information management scheme and a decentralized, patient-controlled scheme. Existing information systems will be used as the test-bed for the various HIIDIT implementations. The technical performance of each HIIDIT implementation will be evaluated.

3. Evaluate the application of HIIDIT in terms of protection of confidentiality within the context of national healthcare delivery and research goals.

The task domain of HIIDIT is in part a social one: identifying patients for a variety of purposes, not all directly related to patient care, while meeting societal standards of patient confidentiality. Therefore, we will evaluate the degree to which the various instances of HIIDIT implementations meet national standards, particularly those recommended by the Committee on Maintaining Privacy and Security in Health Care Applications of the National Information Infrastructure of the National Research Council.

4. Develop new methods to de-identify data in databases that include both coded fields and narrative text, and develop formal criteria for evaluating the degree of success of de-identification methods.

De-identifying data about individuals requires obscuring the relationship between data about the individual and his or her identity. This can typically be accomplished by replacing names, addresses, etc. by pseudonyms, false addresses, etc., in both coded data and narrative text. We propose to extend existing techniques to learn adaptively how to find and replace such identifying information more successfully. Even data with obvious identifiers removed still contain enough unique characteristics that they might be successfully matched against other data about known individuals. We plan to develop a formal model of this possibility of matching and, based on this, to create consistent testing methods for de-identification techniques.

The HIIDIT proposal is narrowly defined to include only issues of identification and de-identification. It does not encompass the much larger agenda of creating a Master Patient Index (MPI). Several other groups are already beginning to address the task of an MPI, notably the MPI workshops run by Los Alamos National Laboratories. HIIDIT is positioned to make a contribution to various MPI efforts as well as the other applications of identifier systems such as multi-center research studies and population-wide DNA data banks where there are different tradeoffs in security, privacy and access.

Last updated 5/98 ISK