De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws. [1]
When applied to metadata or general data about identification, the process is also known as data anonymization. Common strategies include deleting or masking personal identifiers, such as personal name, and suppressing or generalizing quasi-identifiers, such as date of birth. The reverse process of using de-identified data to identify individuals is known as data re-identification. Successful re-identifications [2] [3] [4] [5] cast doubt on de-identification's effectiveness. A systematic review of fourteen distinct re-identification attacks found "a high re-identification rate […] dominated by small-scale studies on data that was not de-identified according to existing standards". [6]
De-identification is adopted as one of the main approaches toward data privacy protection. [7] It is commonly used in fields of communications, multimedia, biometrics, big data, cloud computing, data mining, internet, social networks, and audio–video surveillance. [8]
When surveys are conducted, such as a census, they collect information about a specific group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in a way that when people participate in a survey, it will not be possible to match any participant's individual response(s) with any data published. [9]
When an online shopping website wants to know its users' preferences and shopping habits, it decides to retrieve customers' data from its database and do analysis on them. The personal data information include personal identifiers which were collected directly when customers created their accounts. The website needs to pre-handle the data through de-identification techniques before analyzing data records to avoid violating their customers' privacy.
Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition. [10] [11] De-identification may also include preserving identifying information which can only be re-linked by a trusted party in certain situations. [10] [11] [12] There is a debate in the technology community on whether data that can be re-linked, even by a trusted party, should ever be considered de-identified. [13]
Common strategies of de-identification are masking personal identifiers and generalizing quasi-identifiers. Pseudonymization is the main technique used to mask personal identifiers from data records, and k-anonymization is usually adopted for generalizing quasi-identifiers.
Pseudonymization is performed by replacing real names with a temporary ID. It deletes or masks personal identifiers to make individuals unidentified. This method makes it possible to track the individual's record over time, even though the record will be updated. However, it can not prevent the individual from being identified if some specific combinations of attributes in the data record indirectly identify the individual. [14]
k-anonymization defines attributes that indirectly points to the individual's identity as quasi-identifiers (QIs) and deal with data by making at least k individuals have some combination of QI values. [14] QI values are handled following specific standards. For example, the k-anonymization replaces some original data in the records with new range values and keep some values unchanged. New combination of QI values prevents the individual from being identified and also avoid destroying data records.
Research into de-identification is driven mostly for protecting health information. [15] Some libraries have adopted methods used in the healthcare industry to preserve their readers' privacy. [15]
In big data, de-identification is widely adopted by individuals and organizations. [8] With the development of social media, e-commerce, and big data, de-identification is sometimes required and often used for data privacy when users' personal data are collected by companies or third-party organizations who will analyze it for their own personal usage.
In smart cities, de-identification may be required to protect the privacy of residents, workers and visitors. Without strict regulation, de-identification may be difficult because sensors can still collect information without consent. [16]
PHI can be present in various data and each format need specific techniques and tools for de-identify it:
Whenever a person participates in genetics research, the donation of a biological specimen often results in the creation of a large amount of personalized data. Such data is uniquely difficult to de-identify. [18]
Anonymization of genetic data is particularly difficult because of the huge amount of genotypic information in biospecimens, [18] the ties that specimens often have to medical history, [19] and the advent of modern bioinformatics tools for data mining. [19] There have been demonstrations that data for individuals in aggregate collections of genotypic data sets can be tied to the identities of the specimen donors. [20]
Some researchers have suggested that it is not reasonable to ever promise participants in genetics research that they can retain their anonymity, but instead, such participants should be taught the limits of using coded identifiers in a de-identification process. [11]
In May 2014, the United States President's Council of Advisors on Science and Technology found de-identification "somewhat useful as an added safeguard" but not "a useful basis for policy" as "it is not robust against near‐term future re‐identification methods". [21]
The HIPAA Privacy Rule provides mechanisms for using and disclosing health data responsibly without the need for patient consent. These mechanisms center on two HIPAA de-identification standards – Safe Harbor and the Expert Determination Method. Safe harbor relies on the removal of specific patient identifiers (e.g. name, phone number, email address, etc.), while the Expert Determination Method requires knowledge and experience with generally accepted statistical and scientific principles and methods to render information not individually identifiable. [22]
The safe harbor method uses a list approach to de-identification and has two requirements:
Expert Determination takes a risk-based approach to de-identification that applies current standards and best practices from the research to determine the likelihood that a person could be identified from their protected health information. This method requires that a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods render the information not individually identifiable. It requires:
The key law about research in electronic health record data is HIPAA Privacy Rule. This law allows use of electronic health record of deceased subjects for research (HIPAA Privacy Rule (section 164.512(i)(1)(iii))). [23]
Medical privacy, or health privacy, is the practice of maintaining the security and confidentiality of patient records. It involves both the conversational discretion of health care providers and the security of medical records. The terms can also refer to the physical privacy of patients from other patients and providers while in a medical facility, and to modesty in medical settings. Modern concerns include the degree of disclosure to insurance companies, employers, and other third parties. The advent of electronic medical records (EMR) and patient care management systems (PCMS) have raised new concerns about privacy, balanced with efforts to reduce duplication of services and medical errors.
The Health Insurance Portability and Accountability Act of 1996 is a United States Act of Congress enacted by the 104th United States Congress and signed into law by President Bill Clinton on August 21, 1996. It aimed to alter the transfer of healthcare information, stipulated the guidelines by which personally identifiable information maintained by the healthcare and healthcare insurance industries should be protected from fraud and theft, and addressed some limitations on healthcare insurance coverage. It generally prohibits healthcare providers and businesses called covered entities from disclosing protected information to anyone other than a patient and the patient's authorized representatives without their consent. The bill does not restrict patients from receiving information about themselves. Furthermore, it does not prohibit patients from voluntarily sharing their health information however they choose, nor does it require confidentiality where a patient discloses medical information to family members, friends or other individuals not employees of a covered entity.
The terms medical record, health record and medical chart are used somewhat interchangeably to describe the systematic documentation of a single patient's medical history and care across time within one particular health care provider's jurisdiction. A medical record includes a variety of types of "notes" entered over time by healthcare professionals, recording observations and administration of drugs and therapies, orders for the administration of drugs and therapies, test results, X-rays, reports, etc. The maintenance of complete and accurate medical records is a requirement of health care providers and is generally enforced as a licensing or certification prerequisite.
A personal health record (PHR) is a health record where health data and other information related to the care of a patient is maintained by the patient. This stands in contrast to the more widely used electronic medical record, which is operated by institutions and contains data entered by clinicians to support insurance claims. The intention of a PHR is to provide a complete and accurate summary of an individual's medical history which is accessible online. The health data on a PHR might include patient-reported outcome data, lab results, and data from devices such as wireless electronic weighing scales or from a smartphone.
Genetic discrimination occurs when people treat others differently because they have or are perceived to have a gene mutation(s) that causes or increases the risk of an inherited disorder. It may also refer to any and all discrimination based on the genotype of a person rather than their individual merits, including that related to race, although the latter would be more appropriately included under racial discrimination. Some legal scholars have argued for a more precise and broader definition of genetic discrimination: "Genetic discrimination should be defined as when an individual is subjected to negative treatment, not as a result of the individual's physical manifestation of disease or disability, but solely because of the individual's genetic composition." Genetic Discrimination is considered to have its foundations in genetic determinism and genetic essentialism, and is based on the concept of genism, i.e. distinctive human characteristics and capacities are determined by genes.
Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.
Protected health information (PHI) under U.S. law is any information about health status, provision of health care, or payment for health care that is created or collected by a Covered Entity, and can be linked to a specific individual. This is interpreted rather broadly and includes any part of a patient's medical record or payment history.
Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual. The genotyping stage employs different techniques, including single-nucleotide polymorphism (SNP) analysis chips, or partial or full genome sequencing. Once the genotypes are known, the individual's variations can be compared with the published literature to determine likelihood of trait expression, ancestry inference and disease risk.
Biobank ethics refers to the ethics pertaining to all aspects of biobanks. The issues examined in the field of biobank ethics are special cases of clinical research ethics.
Privacy for research participants is a concept in research ethics which states that a person in human subject research has a right to privacy when participating in research. Some typical scenarios this would apply to include, or example, a surveyor doing social research conducts an interview with a participant, or a medical researcher in a clinical trial asks for a blood sample from a participant to see if there is a relationship between something which can be measured in blood and a person's health. In both cases, the ideal outcome is that any participant can join the study and neither the researcher nor the study design nor the publication of the study results would ever identify any participant in the study. Thus, the privacy rights of these individuals can be preserved.
In Electronic Health Records (EHR's) data masking, or controlled access, is the process of concealing patient health data from certain healthcare providers. Patients have the right to request the masking of their personal information, making it inaccessible to any physician, or a particular physician, unless a specific reason is provided. Data masking is also performed by healthcare agencies to restrict the amount of information that can be accessed by external bodies such as researchers, health insurance agencies and unauthorised individuals. It is a method used to protect patients’ sensitive information so that privacy and confidentiality are less of a concern. Techniques used to alter information within a patient's EHR include data encryption, obfuscation, hashing, exclusion and perturbation.
Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.
Health information on the Internet refers to all health-related information communicated through or available on the Internet.
Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.
k-anonymity is a property possessed by certain anonymized data. The term k-anonymity was first introduced by Pierangela Samarati and Latanya Sweeney in a paper published in 1998, although the concept dates to a 1986 paper by Tore Dalenius.
Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person the data belong to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.
Genetic privacy involves the concept of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to one's genetic information. This concept also encompasses privacy regarding the ability to identify specific individuals by their genetic sequence, and the potential to gain information on specific characteristics about that person via portions of their genetic information, such as their propensity for specific diseases or their immediate or distant ancestry.
DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.
Health data is any data "related to health conditions, reproductive outcomes, causes of death, and quality of life" for an individual or population. Health data includes clinical metrics along with environmental, socioeconomic, and behavioral information pertinent to health and wellness. A plurality of health data are collected and used when individuals interact with health care systems. This data, collected by health care providers, typically includes a record of services received, conditions of those services, and clinical outcomes or information concerning those services. Historically, most health data has been sourced from this framework. The advent of eHealth and advances in health information technology, however, have expanded the collection and use of health data—but have also engendered new security, privacy, and ethical concerns. The increasing collection and use of health data by patients is a major component of digital health.
Biological data refers to a compound or information derived from living organisms and their products. A medicinal compound made from living organisms, such as a serum or a vaccine, could be characterized as biological data. Biological data is highly complex when compared with other forms of data. There are many forms of biological data, including text, sequence data, protein structure, genomic data and amino acids, and links among others.