De-identification

Last updated
While a person can usually be readily identified from a picture taken directly of them, the task of identifying them on the basis of limited data is harder, yet sometimes possible. Walking reflection.jpg
While a person can usually be readily identified from a picture taken directly of them, the task of identifying them on the basis of limited data is harder, yet sometimes possible.

De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws. [1]

Contents

When applied to metadata or general data about identification, the process is also known as data anonymization. Common strategies include deleting or masking personal identifiers, such as personal name, and suppressing or generalizing quasi-identifiers, such as date of birth. The reverse process of using de-identified data to identify individuals is known as data re-identification. Successful re-identifications [2] [3] [4] [5] cast doubt on de-identification's effectiveness. A systematic review of fourteen distinct re-identification attacks found "a high re-identification rate […] dominated by small-scale studies on data that was not de-identified according to existing standards". [6]

De-identification is adopted as one of the main approaches toward data privacy protection. [7] It is commonly used in fields of communications, multimedia, biometrics, big data, cloud computing, data mining, internet, social networks, and audio–video surveillance. [8]

Examples

In designing surveys

When surveys are conducted, such as a census, they collect information about a specific group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in a way that when people participate in a survey, it will not be possible to match any participant's individual response(s) with any data published. [9]

Before using information

When an online shopping website wants to know its users' preferences and shopping habits, it decides to retrieve customers' data from its database and do analysis on them. The personal data information include personal identifiers which were collected directly when customers created their accounts. The website needs to pre-handle the data through de-identification techniques before analyzing data records to avoid violating their customers' privacy.

Anonymization

Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition. [10] [11] De-identification may also include preserving identifying information which can only be re-linked by a trusted party in certain situations. [10] [11] [12] There is a debate in the technology community on whether data that can be re-linked, even by a trusted party, should ever be considered de-identified. [13]

Techniques

Common strategies of de-identification are masking personal identifiers and generalizing quasi-identifiers. Pseudonymization is the main technique used to mask personal identifiers from data records, and k-anonymization is usually adopted for generalizing quasi-identifiers.

Pseudonymization

Pseudonymization is performed by replacing real names with a temporary ID. It deletes or masks personal identifiers to make individuals unidentified. This method makes it possible to track the individual's record over time, even though the record will be updated. However, it can not prevent the individual from being identified if some specific combinations of attributes in the data record indirectly identify the individual. [14]

k-anonymization

k-anonymization defines attributes that indirectly points to the individual's identity as quasi-identifiers (QIs) and deal with data by making at least k individuals have some combination of QI values. [14] QI values are handled following specific standards. For example, the k-anonymization replaces some original data in the records with new range values and keep some values unchanged. New combination of QI values prevents the individual from being identified and also avoid destroying data records.

Applications

Research into de-identification is driven mostly for protecting health information. [15] Some libraries have adopted methods used in the healthcare industry to preserve their readers' privacy. [15]

In big data, de-identification is widely adopted by individuals and organizations. [8] With the development of social media, e-commerce, and big data, de-identification is sometimes required and often used for data privacy when users' personal data are collected by companies or third-party organizations who will analyze it for their own personal usage.

In smart cities, de-identification may be required to protect the privacy of residents, workers and visitors. Without strict regulation, de-identification may be difficult because sensors can still collect information without consent. [16]

Limits

Whenever a person participates in genetics research, the donation of a biological specimen often results in the creation of a large amount of personalized data. Such data is uniquely difficult to de-identify. [17]

Anonymization of genetic data is particularly difficult because of the huge amount of genotypic information in biospecimens, [17] the ties that specimens often have to medical history, [18] and the advent of modern bioinformatics tools for data mining. [18] There have been demonstrations that data for individuals in aggregate collections of genotypic data sets can be tied to the identities of the specimen donors. [19]

Some researchers have suggested that it is not reasonable to ever promise participants in genetics research that they can retain their anonymity, but instead, such participants should be taught the limits of using coded identifiers in a de-identification process. [11]

De-identification laws in the United States of America

In May 2014, the United States President's Council of Advisors on Science and Technology found de-identification "somewhat useful as an added safeguard" but not "a useful basis for policy" as "it is not robust against near‐term future re‐identification methods". [20]

The HIPAA Privacy Rule provides mechanisms for using and disclosing health data responsibly without the need for patient consent. These mechanisms center on two HIPAA de-identification standards – Safe Harbor and the Expert Determination Method. Safe harbor relies on the removal of specific patient identifiers (e.g. name, phone number, email address, etc.), while the Expert Determination Method requires knowledge and experience with generally accepted statistical and scientific principles and methods to render information not individually identifiable. [21]

Safe harbor

The safe harbor method uses a list approach to de-identification and has two requirements:

  1. The removal or generalization of 18 elements from the data.
  2. That the Covered Entity or Business Associate does not have actual knowledge that the residual information in the data could be used alone, or in combination with other information, to identify an individual. Safe Harbor is a highly prescriptive approach to de-identification. Under this method, all dates must be generalized to year and zip codes reduced to three digits. The same approach is used on the data regardless of the context. Even if the information is to be shared with a trusted researcher who wishes to analyze the data for seasonal variations in acute respiratory cases and, thus, requires the month of hospital admission, this information cannot be provided; only the year of admission would be retained.

Expert Determination

Expert Determination takes a risk-based approach to de-identification that applies current standards and best practices from the research to determine the likelihood that a person could be identified from their protected health information. This method requires that a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods render the information not individually identifiable. It requires:

  1. That the risk is very small that the information could be used alone, or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information;
  2. Documents the methods and results of the analysis that justify such a determination.

Research on decedents

The key law about research in electronic health record data is HIPAA Privacy Rule. This law allows use of electronic health record of deceased subjects for research (HIPAA Privacy Rule (section 164.512(i)(1)(iii))). [22]

See also

Related Research Articles

Medical privacy, or health privacy, is the practice of maintaining the security and confidentiality of patient records. It involves both the conversational discretion of health care providers and the security of medical records. The terms can also refer to the physical privacy of patients from other patients and providers while in a medical facility, and to modesty in medical settings. Modern concerns include the degree of disclosure to insurance companies, employers, and other third parties. The advent of electronic medical records (EMR) and patient care management systems (PCMS) have raised new concerns about privacy, balanced with efforts to reduce duplication of services and medical errors.

<span class="mw-page-title-main">Health Insurance Portability and Accountability Act</span> United States federal law concerning health information

The Health Insurance Portability and Accountability Act of 1996 is a United States Act of Congress enacted by the 104th United States Congress and signed into law by President Bill Clinton on August 21, 1996. It aimed to alter the transfer of healthcare information, stipulated the guidelines by which personally identifiable information maintained by the healthcare and healthcare insurance industries should be protected from fraud and theft, and addressed some limitations on healthcare insurance coverage. It generally prohibits healthcare providers and businesses called covered entities from disclosing protected information to anyone other than a patient and the patient's authorized representatives without their consent. The bill does not restrict patients from receiving information about themselves. Furthermore, it does not prohibit patients from voluntarily sharing their health information however they choose, nor does it require confidentiality where a patient discloses medical information to family members, friends or other individuals not employees of a covered entity.

<span class="mw-page-title-main">Medical record</span> Medical term

The terms medical record, health record and medical chart are used somewhat interchangeably to describe the systematic documentation of a single patient's medical history and care across time within one particular health care provider's jurisdiction. A medical record includes a variety of types of "notes" entered over time by healthcare professionals, recording observations and administration of drugs and therapies, orders for the administration of drugs and therapies, test results, X-rays, reports, etc. The maintenance of complete and accurate medical records is a requirement of health care providers and is generally enforced as a licensing or certification prerequisite.

Personal data, also known as personal information or personally identifiable information (PII), is any information related to an identifiable person.

A personal health record (PHR) is a health record where health data and other information related to the care of a patient is maintained by the patient. This stands in contrast to the more widely used electronic medical record, which is operated by institutions and contains data entered by clinicians to support insurance claims. The intention of a PHR is to provide a complete and accurate summary of an individual's medical history which is accessible online. The health data on a PHR might include patient-reported outcome data, lab results, and data from devices such as wireless electronic weighing scales or from a smartphone.

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.

Protected health information (PHI) under U.S. law is any information about health status, provision of health care, or payment for health care that is created or collected by a Covered Entity, and can be linked to a specific individual. This is interpreted rather broadly and includes any part of a patient's medical record or payment history.

Biobank ethics refers to the ethics pertaining to all aspects of biobanks. The issues examined in the field of biobank ethics are special cases of clinical research ethics.

Infoveillance is a type of syndromic surveillance that specifically utilizes information found online. The term, along with the term infodemiology, was coined by Gunther Eysenbach to describe research that uses online information to gather information about human behavior.

Privacy for research participants is a concept in research ethics which states that a person in human subject research has a right to privacy when participating in research. Some typical scenarios this would apply to include, or example, a surveyor doing social research conducts an interview with a participant, or a medical researcher in a clinical trial asks for a blood sample from a participant to see if there is a relationship between something which can be measured in blood and a person's health. In both cases, the ideal outcome is that any participant can join the study and neither the researcher nor the study design nor the publication of the study results would ever identify any participant in the study. Thus, the privacy rights of these individuals can be preserved.

In Electronic Health Records (EHR's) data masking, or controlled access, is the process of concealing patient health data from certain healthcare providers. Patients have the right to request the masking of their personal information, making it inaccessible to any physician, or a particular physician, unless a specific reason is provided. Data masking is also performed by healthcare agencies to restrict the amount of information that can be accessed by external bodies such as researchers, health insurance agencies and unauthorised individuals. It is a method used to protect patients’ sensitive information so that privacy and confidentiality are less of a concern. Techniques used to alter information within a patient's EHR include data encryption, obfuscation, hashing, exclusion and perturbation.

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.

<span class="mw-page-title-main">Health information on the Internet</span>

Health information on the Internet refers to all health-related information communicated through or available on the Internet.

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

<span class="mw-page-title-main">Latanya Sweeney</span> Computer scientist

Latanya Arvette Sweeney is an American computer scientist. She is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is the founder and director of the Public Interest Tech Lab, founded in 2021 with a $3 million grant from the Ford Foundation as well as the Data Privacy Lab. She is the current Faculty Dean in Currier House at Harvard.

k-anonymity is a property possessed by certain anonymized data. The term k-anonymity was first introduced by Pierangela Samarati and Latanya Sweeney in a paper published in 1998, although the concept dates to a 1986 paper by Tore Dalenius.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person the data belong to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Genetic privacy involves the concept of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to one's genetic information. This concept also encompasses privacy regarding the ability to identify specific individuals by their genetic sequence, and the potential to gain information on specific characteristics about that person via portions of their genetic information, such as their propensity for specific diseases or their immediate or distant ancestry.

DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.

Health data is any data "related to health conditions, reproductive outcomes, causes of death, and quality of life" for an individual or population. Health data includes clinical metrics along with environmental, socioeconomic, and behavioral information pertinent to health and wellness. A plurality of health data are collected and used when individuals interact with health care systems. This data, collected by health care providers, typically includes a record of services received, conditions of those services, and clinical outcomes or information concerning those services. Historically, most health data has been sourced from this framework. The advent of eHealth and advances in health information technology, however, have expanded the collection and use of health data—but have also engendered new security, privacy, and ethical concerns. The increasing collection and use of health data by patients is a major component of digital health.

References

  1. Rights (OCR), Office for Civil (2012-09-07). "Methods for De-identification of PHI". HHS.gov. Retrieved 2020-11-08.
  2. Sweeney, L. (2000). "Simple Demographics Often Identify People Uniquely". Data Privacy Working Paper. 3.
  3. de Montjoye, Y.-A. (2013). "Unique in the crowd: The privacy bounds of human mobility". Scientific Reports. 3: 1376. Bibcode:2013NatSR...3E1376D. doi:10.1038/srep01376. PMC   3607247 . PMID   23524645.
  4. de Montjoye, Y.-A.; Radaelli, L.; Singh, V. K.; Pentland, A. S. (29 January 2015). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science. 347 (6221): 536–539. Bibcode:2015Sci...347..536D. doi: 10.1126/science.1256297 . hdl: 1721.1/96321 . PMID   25635097.
  5. Narayanan, A. (2006). "How to break anonymity of the netflix prize dataset". arXiv: cs/0610105 .
  6. El Emam, Khaled (2011). "A Systematic Review of Re-Identification Attacks on Health Data". PLOS ONE. 10 (4): e28071. Bibcode:2011PLoSO...628071E. doi: 10.1371/journal.pone.0028071 . PMC   3229505 . PMID   22164229.
  7. Simson., Garfinkel. De-identification of personal information : recommendation for transitioning the use of cryptographic algorithms and key lengths. OCLC   933741839.
  8. 1 2 Ribaric, Slobodan; Ariyaeeinia, Aladdin; Pavesic, Nikola (September 2016). "De-identification for privacy protection in multimedia content: A survey". Signal Processing: Image Communication. 47: 131–151. doi:10.1016/j.image.2016.05.020. hdl: 2299/19652 .
  9. Bhaskaran, Vivek (2023-06-08). "Survey Research: Definition, Examples and Methods". QuestionPro. Retrieved 2023-12-17.
  10. 1 2 Godard, B. A.; Schmidtke, J. R.; Cassiman, J. J.; Aymé, S. G. N. (2003). "Data storage and DNA banking for biomedical research: Informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective". European Journal of Human Genetics. 11: S88–122. doi: 10.1038/sj.ejhg.5201114 . PMID   14718939.
  11. 1 2 3 Fullerton, S. M.; Anderson, N. R.; Guzauskas, G.; Freeman, D.; Fryer-Edwards, K. (2010). "Meeting the Governance Challenges of Next-Generation Biorepository Research". Science Translational Medicine. 2 (15): 15cm3. doi:10.1126/scitranslmed.3000361. PMC   3038212 . PMID   20371468.
  12. McMurry, AJ; Gilbert, CA; Reis, BY; Chueh, HC; Kohane, IS; Mandl, KD (2007). "A self-scaling, distributed information architecture for public health, research, and clinical care". J Am Med Inform Assoc. 14 (4): 527–33. doi:10.1197/jamia.M2371. PMC   2244902 . PMID   17460129.
  13. "Data de-identification". The Abdul Latif Jameel Poverty Action Lab (J-PAL). Retrieved 2023-12-17.
  14. 1 2 Ito, Koichi; Kogure, Jun; Shimoyama, Takeshi; Tsuda, Hiroshi (2016). "De-identification and Encryption Technologies to Protect Personal Information" (PDF). Fujitsu Scientific and Technical Journal. 52 (3): 28–36.
  15. 1 2 Nicholson, S.; Smith, C. A. (2005). "Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA" (PDF). Proceedings of the American Society for Information Science and Technology. 42: n/a. doi: 10.1002/meet.1450420106 .
  16. Coop, Alex. "Sidewalk Labs decision to offload tough decisions on privacy to third party is wrong, says its former consultant". IT World Canada. Retrieved 27 June 2019.
  17. 1 2 McGuire, A. L.; Gibbs, R. A. (2006). "GENETICS: No Longer De-Identified". Science. 312 (5772): 370–371. doi: 10.1126/science.1125339 . PMID   16627725.
  18. 1 2 Thorisson, G. A.; Muilu, J.; Brookes, A. J. (2009). "Genotype–phenotype databases: Challenges and solutions for the post-genomic era". Nature Reviews Genetics. 10 (1): 9–18. doi:10.1038/nrg2483. hdl: 2381/4584 . PMID   19065136. S2CID   5964522.
  19. Homer, N.; Szelinger, S.; Redman, M.; Duggan, D.; Tembe, W.; Muehling, J.; Pearson, J. V.; Stephan, D. A.; Nelson, S. F.; Craig, D. W. (2008). Visscher, Peter M. (ed.). "Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays". PLOS Genetics. 4 (8): e1000167. doi: 10.1371/journal.pgen.1000167 . PMC   2516199 . PMID   18769715.
  20. PCAST. "Report to the President - Big Data and Privacy: A technological perspective" (PDF). Office of Science and Technology Policy . Retrieved 28 March 2016 via National Archives.
  21. "De-Identification 201". Privacy Analytics. 2015.
  22. 45 CFR 164.512)