K-anonymity

Last updated

k-anonymity is a property possessed by certain anonymized data. The term k-anonymity was first introduced by Pierangela Samarati and Latanya Sweeney in a paper published in 1998, [1] although the concept dates to a 1986 paper by Tore Dalenius. [2]

Contents

k-anonymity is an attempt to solve the problem "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful." [3] [4] [5] A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least individuals whose information also appear in the release. The guarantees provided by k-anonymity are aspirational, not mathematical.

Methods for k-anonymization

To use k-anonymity to process a dataset so that it can be released with privacy protection, a data scientist must first examine the dataset and decide whether each attribute (column) is an identifier (identifying), a non-identifier (not-identifying), or a quasi-identifier (somewhat identifying). Identifiers such as names are suppressed, non-identifying values are allowed to remain, and the quasi-identifiers need to be processed so that every distinct combination of quasi-identifiers designates at least k records.

In the example table below presents a fictional nonanonymized database consisting of the patient records for a fictitious hospital. The Name column is an identifier, Age, Gender, State of domicile, and Religion are quasi-identifiers, and Disease is a non-identifying sensitive value. But what about Height and Weight? Are they also non-identifying sensitive values, or are they quasi-identifiers?

Patients treated in the study on April 30
NameAgeGenderHeightWeightState of domicileReligionDisease
Ramsha30Female165 cm72 kg Tamil Nadu Hindu Cancer
Yadu24Female162 cm70 kg Kerala Hindu Viral infection
Salima28Female170 cm68 kgTamil Nadu Muslim Tuberculosis
Sunny27Male170 cm75 kg Karnataka Parsi No illness
Joan24Female165 cm71 kgKerala Christian Heart-related
Bahuksana23Male160 cm69 kgKarnataka Buddhist Tuberculosis
Rambha19Male167 cm85 kgKeralaHindu Cancer
Kishor29Male180 cm81 kgKarnatakaHinduHeart-related
Johnson17Male175 cm79 kgKeralaChristianHeart-related
John19Male169 cm82 kgKeralaChristian Viral infection

There are 6 attributes and 10 records in this data. There are two common methods for achieving k-anonymity for some value of k:

  1. Suppression. In this method, certain values of the attributes are replaced by an asterisk "*". All or some values of a column may be replaced by "*". In the anonymized table below, we have replaced all the values in the Name attribute and all the values in the Religion attribute with a "*".
  2. Generalization. In this method, individual values of attributes are replaced with a broader category. For example, the value "19" of the attribute Age may be replaced by "≤ 20", the value "23" by "20 < Age ≤ 30", etc.

The next table shows the anonymized database.

Patients treated in the study on April 30
NameAgeGenderHeightWeightState of domicileReligionDisease
*20 < Age ≤ 30Female165 cm72 kgTamil Nadu*Cancer
*20 < Age ≤ 30Female162 cm70 kgKerala*Viral infection
*20 < Age ≤ 30Female170 cm68 kgTamil Nadu*Tuberculosis
*20 < Age ≤ 30Male170 cm75 kgKarnataka*No illness
*20 < Age ≤ 30Female165 cm71 kgKerala*Heart-related
*20 < Age ≤ 30Male160 cm69 kgKarnataka*Tuberculosis
*Age ≤ 20Male167 cm85 kgKerala*Cancer
*20 < Age ≤ 30Male180 cm81 kgKarnataka*Heart-related
*Age ≤ 20Male175 cm79 kgKerala*Heart-related
*Age ≤ 20Male169 cm82 kgKerala*Viral infection

This data has 2-anonymity with respect to the attributes Age, Gender and State of domicile, since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes. The attributes available to an adversary are called quasi-identifiers. Each quasi-identifier tuple occurs in at least k records for a dataset with k-anonymity. [6]

Critiques of k-anonymity

The following example demonstrates a failing with k-anonymity: there may exist other data records that can be linked on the variables that are allegedly non-identifying. For instance, suppose an attacker is able to obtain the log from the person who was taking vital signs as part of the study and learns that Kishor was at the hospital on April 30 and is 180 cm tall. This information can be used to link with the "anonymized" database (which may have been published on the Internet) and learn that Kishor has a heart-related disease. An attacker who knows that Kishor visited the hospital on April 30 may be able to infer this simply knowing that Kishor is 180 cm height, roughly 80–82 kg, and comes from Karnataka.

The root of this problem is the core problem with k-anonymity: there is no way to mathematically, unambiguously determine whether an attribute is an identifier, a quasi-identifier, or a non-identifying sensitive value. In fact, all values are potentially identifying, depending on their prevalence in the population and on auxiliary data that the attacker may have. Other privacy mechanisms such as differential privacy do not share this problem.

Although k-anonymity safeguards against identity revelations, it does not shield against the disclosure of specific attributes. This becomes problematic when attackers possess background knowledge. Additionally, the absence of diversity in sensitive domains may result in the exposure of personal information. In such scenarios, opting for ℓ-Diversity might offer a more robust privacy safeguard.

Meyerson and Williams (2004) demonstrated that optimal k-anonymity is an NP-hard problem, however heuristic methods such as k-Optimize as given by Bayardo and Agrawal (2005) often yield effective results. [7] [8] A practical approximation algorithm that enables solving the k-anonymization problem with an approximation guarantee of was presented by Kenig and Tassa. [9]

Attacks

While k-anonymity is a relatively simple-to-implement approach for de-identifying a dataset prior to public release, it is susceptible to many attacks. When background knowledge is available to an attacker, such attacks become even more effective. Such attacks include:

Because k-anonymization does not include any randomization, attackers can make reliable, unambiguous inferences about data sets that may harm individuals. For example, if the 19-year-old John from Kerala is known to be in the database above, then it can be reliably said that he has either cancer, a heart-related disease, or a viral infection.

K-anonymization is not a good method to anonymize high-dimensional datasets. [11]

It has also been shown that k-anonymity can skew the results of a data set if it disproportionately suppresses and generalizes data points with unrepresentative characteristics. [12] The suppression and generalization algorithms used to k-anonymize datasets can be altered, however, so that they do not have such a skewing effect. [13]

See also

Related Research Articles

<span class="mw-page-title-main">Steiner tree problem</span> On short connecting networks with added vertices

In combinatorial mathematics, the Steiner tree problem, or minimum Steiner tree problem, named after Jakob Steiner, is an umbrella term for a class of problems in combinatorial optimization. While Steiner tree problems may be formulated in a number of settings, they all require an optimal interconnect for a given set of objects and a predefined objective function. One well-known variant, which is often used synonymously with the term Steiner tree problem, is the Steiner tree problem in graphs. Given an undirected graph with non-negative edge weights and a subset of vertices, usually referred to as terminals, the Steiner tree problem in graphs requires a tree of minimum weight that contains all terminals and minimizes the total weight of its edges. Further well-known variants are the Euclidean Steiner tree problem and the rectilinear minimum Steiner tree problem.

A recommender system, or a recommendation system, is a subclass of information filtering system that provides suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

Data masking or data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

In data mining, k-means++ is an algorithm for choosing the initial values for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm. It is similar to the first of three seeding methods proposed, in independent work, in 2006 by Rafail Ostrovsky, Yuval Rabani, Leonard Schulman and Chaitanya Swamy.

Differential privacy (DP) is an approach for providing privacy while sharing information about a group of individuals, by describing the patterns within the group while withholding information about specific individuals. This is done by making arbitrary small changes to individual data that do not change the statistics of interest. Thus the data cannot be used to infer much about any individual.

Privacy for research participants is a concept in research ethics which states that a person in human subject research has a right to privacy when participating in research. Some typical scenarios this would apply to include, or example, a surveyor doing social research conducts an interview with a participant, or a medical researcher in a clinical trial asks for a blood sample from a participant to see if there is a relationship between something which can be measured in blood and a person's health. In both cases, the ideal outcome is that any participant can join the study and neither the researcher nor the study design nor the publication of the study results would ever identify any participant in the study. Thus, the privacy rights of these individuals can be preserved.

<span class="mw-page-title-main">De-identification</span> Preventing personal identity from being revealed

De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

<span class="mw-page-title-main">Latanya Sweeney</span> Computer scientist

Latanya Arvette Sweeney is an American computer scientist. She is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is the founder and director of the Public Interest Tech Lab, founded in 2021 with a $3 million grant from the Ford Foundation as well as the Data Privacy Lab. She is the current Faculty Dean in Currier House at Harvard.

Datafly algorithm is an algorithm for providing anonymity in medical data. The algorithm was developed by Latanya Arvette Sweeney in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in role-based security within an institution, and in batch mode for exporting data from an institution. Organizations release and receive medical data with all explicit identifiers—such as name—removed, in the erroneous belief that patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the fields and records of the database itself.

l-diversity, also written as -diversity, is a form of group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. This reduction is a trade off that results in some loss of effectiveness of data management or mining algorithms in order to gain some privacy. The l-diversity model is an extension of the k-anonymity model which reduces the granularity of data representation using techniques including generalization and suppression such that any given record maps onto at least k-1 other records in the data. The l-diversity model handles some of the weaknesses in the k-anonymity model where protected identities to the level of k-individuals is not equivalent to protecting the corresponding sensitive values that were generalized or suppressed, especially when the sensitive values within a group exhibit homogeneity. The l-diversity model adds the promotion of intra-group diversity for sensitive values in the anonymization mechanism.

t-closeness is a further refinement of l-diversity group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. This reduction is a trade off that results in some loss of effectiveness of data management or data mining algorithms in order to gain some privacy. The t-closeness model extends the l-diversity model by treating the values of an attribute distinctly by taking into account the distribution of data values for that attribute.

MAC address anonymization performs a one-way function on a MAC address so that the result may be used in tracking systems for reporting and the general public, while making it nearly impossible to obtain the original MAC address from the result. The idea is that this process allows companies like Google, Apple and CrowdVision - which track users movements via computer hardware to simultaneously preserve the identities of the people they are tracking, as well as the hardware itself.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person the data belong to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Spatial cloaking is a privacy mechanism that is used to satisfy specific privacy requirements by blurring users’ exact locations into cloaked regions. This technique is usually integrated into applications in various environments to minimize the disclosure of private information when users request location-based service. Since the database server does not receive the accurate location information, a set including the satisfying solution would be sent back to the user. General privacy requirements include K-anonymity, maximum area, and minimum area.

Egalitarian item allocation, also called max-min item allocation is a fair item allocation problem, in which the fairness criterion follows the egalitarian rule. The goal is to maximize the minimum value of an agent. That is, among all possible allocations, the goal is to find an allocation in which the smallest value of an agent is as large as possible. In case there are two or more allocations with the same smallest value, then the goal is to select, from among these allocations, the one in which the second-smallest value is as large as possible, and so on. Therefore, an egalitarian item allocation is sometimes called a leximin item allocation.

References

  1. Samarati, Pierangela; Sweeney, Latanya (1998). "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression" (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017.
  2. Tore Dalenius, "Finding a Needle In a Haystack", Journal of Official Statistics, Vol. 2, No. 3, 1986, pp. 326–336.
  3. Samarati, Pierangela (November 2001). "Protecting Respondents' Identities in Microdata Release" (PDF). IEEE Transactions on Knowledge and Data Engineering. 13 (6): 1010–1027. doi:10.1109/69.971193. S2CID   561716.
  4. Sweeney, Latanya. "Database Security: k-anonymity" . Retrieved 19 January 2014.
  5. Sweeney, Latanya (2002). "k-anonymity: a model for protecting privacy" (PDF). International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 10 (5): 557–570. doi:10.1142/S0218488502001648. S2CID   361794.
  6. Narayanan, Arvind; Shmatikov, Vitaly. "Robust De-anonymization of Large Sparse Datasets" (PDF).
  7. Roberto J. Bayardo; Rakesh Agrawal (2005). "Data Privacy through Optimal k-Anonymization". 21st International Conference on Data Engineering (ICDE'05) (PDF). pp. 217–228. doi:10.1109/ICDE.2005.42. ISBN   978-0-7695-2285-2. ISSN   1084-4627. S2CID   17044848. Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k  1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem.
  8. Adam Meyerson; Ryan Williams (2004). "On the complexity of optimal K-anonymity". Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PDF). New York, NY: ACM. pp. 223–228. doi:10.1145/1055558.1055591. ISBN   978-1581138580. S2CID   6798963. The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are NP-hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4. However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k log m)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice.
  9. Kenig, Batya; Tassa, Tamir (2012). "A practical approximation algorithm for optimal k-anonymity". Data Mining and Knowledge Discovery. 25: 134–168. doi:10.1007/s10618-011-0235-9. S2CID   14158546.
  10. Attacks on Deidentification's Defenses, Aloni Cohen, USENIX Security 2022, Distinguished Paper Award Winner. https://www.usenix.org/conference/usenixsecurity22/presentation/cohen
  11. Aggarwal, Charu C. (2005). "On k-Anonymity and the Curse of Dimensionality". VLDB '05 Proceedings of the 31st International Conference on Very large Data Bases. Trondheim, Norway. CiteSeerX   10.1.1.60.3155 . ISBN   1-59593-154-6.
  12. Angiuli, Olivia; Joe Blitzstein; Jim Waldo. "How to De-Identify Your Data". ACM Queue. ACM.
  13. Angiuli, Olivia; Jim Waldo (June 2016). "Statistical Tradeoffs between Generalization and Suppression in the De-identification of Large-Scale Data Sets". 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). pp. 589–593. doi:10.1109/COMPSAC.2016.198. ISBN   978-1-4673-8845-0. S2CID   17716908.