Quasi-identifier

Last updated

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier. [1]

Quasi-identifiers can thus, when combined, become personally identifying information. This process is called re-identification. As an example, Latanya Sweeney has shown that even though neither gender, birth dates nor postal codes uniquely identify an individual, the combination of all three is sufficient to identify 87% of individuals in the United States. [2]

The term was introduced by Tore Dalenius in 1986. [3] Since then, quasi-identifiers have been the basis of several attacks on released data. For instance, Sweeney linked health records to publicly available information to locate the then-governor of Massachusetts' hospital records using uniquely identifying quasi-identifiers, [4] [5] and Sweeney, Abu and Winn used public voter records to re-identify participants in the Personal Genome Project. [6] Additionally, Arvind Narayanan and Vitaly Shmatikov discussed on quasi-identifiers to indicate statistical conditions for de-anonymizing data released by Netflix. [7]

Motwani and Ying warn about potential privacy breaches being enabled by publication of large volumes of government and business data containing quasi-identifiers. [8]

See also

Related Research Articles

Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them. It is also known as data privacy or data protection.

Personal data, also known as personal information or personally identifiable information (PII), is any information related to an identifiable person.

The Computers, Freedom and Privacy Conference is an annual academic conference held in the United States or Canada about the intersection of computer technology, freedom, and privacy issues. The conference was founded in 1991, and since at least 1999, it has been organized under the aegis of the Association for Computing Machinery. It was originally sponsored by CPSR.

Anonymous web browsing refers to the utilization of the World Wide Web that hides a user's personally identifiable information from websites visited.

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

Protected health information (PHI) under the U.S. law is any information about health status, provision of health care, or payment for health care that is created or collected by a Covered Entity, and can be linked to a specific individual. This is interpreted rather broadly and includes any part of a patient's medical record or payment history.

Evercookie

Evercookie is a JavaScript application programming interface (API) that identifies and reproduces intentionally deleted cookies on the clients' browser storage. It was created by Samy Kamkar in 2010 to demonstrate the possible infiltration from the websites that use respawning. Websites that have adopted this mechanism can identify users even if they attempt to delete the previously stored cookies.

Privacy for research participants is a concept in research ethics which states that a person in human subject research has a right to privacy when participating in research. Some typical scenarios this would apply to include, or example, a surveyor doing social research conducts an interview with a participant, or a medical researcher in a clinical trial asks for a blood sample from a participant to see if there is a relationship between something which can be measured in blood and a person's health. In both cases, the ideal outcome is that any participant can join the study and neither the researcher nor the study design nor the publication of the study results would ever identify any participant in the study. Thus, the privacy rights of these individuals can be preserved.

De-identification

De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

Latanya Sweeney Computer scientist

Latanya Arvette Sweeney is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is the director of the Public Interest Tech Lab, founded in 2021 with a $3 million grant from the Ford Foundation. She founded the Data Privacy Lab, part of the Public Interest Tech Lab, and she is the Faculty Dean in Currier House at Harvard.

Datafly algorithm is an algorithm for providing anonymity in medical data. The algorithm was developed by Latanya Arvette Sweeney in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in role-based security within an institution, and in batch mode for exporting data from an institution. Organizations release and receive medical data with all explicit identifiers -- such as name -- removed, in the erroneous belief that patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the fields and records of the database itself.

k-anonymity is a property possessed by certain anonymized data. The concept of k-anonymity was first introduced by Latanya Sweeney and Pierangela Samarati in a paper published in 1998 as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful." A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least individuals whose information also appear in the release.

Arvind Narayanan

Arvind Narayanan is a computer scientist and an associate professor at Princeton University. Narayanan is recognized for his research in the de-anonymization of data.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the individual to which the data belong. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Genetic privacy involves the concept of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to one's genetic information. This concept also encompasses privacy regarding the ability to identify specific individuals by their genetic sequence, and the potential to gain information on specific characteristics about that person via portions of their genetic information, such as their propensity for specific diseases or their immediate or distant ancestry.

DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.

Yaniv Erlich is an Israeli-American scientist. He is an Associate Professor of Computer Science at Columbia University and the Chief Science Officer of MyHeritage. Erlich's work combines computer science and genomics.

Spatial cloaking is a privacy mechanism that is used to satisfy specific privacy requirements by blurring users’ exact locations into cloaked regions. This technique is usually integrated into applications in various environments to minimize the disclosure of private information when users request location-based service. Since the database server does not receive the accurate location information, a set including the satisfying solution would be sent back to the user. General privacy requirements include K-anonymity, maximum area, and minimum area.

References

  1. "Glossary of Statistical Terms: Quasi-identifier". OECD. November 10, 2005. Retrieved 29 September 2013.
  2. Sweeney, Latanya. Simple demographics often identify people uniquely. Carnegie Mellon University, 2000. http://dataprivacylab.org/projects/identifiability/paper1.pdf
  3. Dalenius, Tore. Finding a Needle In a Haystack or Identifying Anonymous Census Records. Journal of Official Statistics, Vol.2, No.3, 1986. pp. 329–336. http://www.jos.nu/Articles/abstract.asp?article=23329 Archived 2017-08-08 at the Wayback Machine
  4. Anderson, Nate. Anonymized data really isn’t—and here’s why not. Ars Technica, 2009. https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/
  5. Barth-Jones, Daniel C. The're-identification'of Governor William Weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (June 4, 2012) (2012).
  6. Sweeney, Latanya, Akua Abu, and Julia Winn. "Identifying participants in the personal genome project by name." Available at SSRN 2257732 (2013).
  7. Narayanan, Arvind and Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets. The University of Texas at Austin, 2008. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  8. Rajeev Motwani and Ying Xu (2008). Efficient Algorithms for Masking and Finding Quasi-Identifiers (PDF). Proceedings of SDM’08 International Workshop on Practical Privacy-Preserving Data Mining.