Data anonymization

Last updated January 14, 2025

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

Overview

Data anonymization has been defined as a "process by which personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party."^[1] Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies, while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.

In the context of medical data, anonymized data refers to data from which the patient cannot be identified by the recipient of the information. The name, address, and full postcode must be removed, together with any other information which, in conjunction with other data held by or disclosed to the recipient, could identify the patient.^[2]

There will always be a risk that anonymized data may not stay anonymous over time. Pairing the anonymized dataset with other data, clever techniques and raw power are some of the ways previously anonymous data sets have become de-anonymized; The data subjects are no longer anonymous.

De-anonymization is the reverse process in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.^[3] Generalization and perturbation are the two popular anonymization approaches for relational data.^[4] The process of obscuring data with the ability to re-identify it later is also called pseudonymization and is one way companies can store data in a way that is HIPAA compliant.

However, according to ARTICLE 29 DATA PROTECTION WORKING PARTY, Directive 95/46/EC refers to anonymisation in Recital 26 "signifies that to anonymise any data, the data must be stripped of sufficient elements such that the data subject can no longer be identified. More precisely, that data must be processed in such a way that it can no longer be used to identify a natural person by using “all the means likely reasonably to be used” by either the controller or a third party. An important factor is that the processing must be irreversible. The Directive does not clarify how such a de-identification process should or could be performed. The focus is on the outcome: that data should be such as not to allow the data subject to be identified via “all” “likely” and “reasonable” means. Reference is made to codes of conduct as a tool to set out possible anonymisation mechanisms as well as retention in a form in which identification of the data subject is “no longer possible”.^[5]

There are five types of data anonymization operations: generalization, suppression, anatomization, permutation, and perturbation.^[6]

GDPR requirements

The European Union's General Data Protection Regulation (GDPR) requires that stored data on people in the EU undergo either anonymization or a pseudonymization process.^[7] GDPR Recital (26) establishes a very high bar for what constitutes anonymous data, thereby exempting the data from the requirements of the GDPR, namely “…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” The European Data Protection Supervisor (EDPS) and the Spanish Agencia Española de Protección de Datos (AEPD) have issued joint guidance related to requirements for anonymity and exemption from GDPR requirements. According to the EDPS and AEPD, no one, including the data controller, should be able to re-identify data subjects in a properly anonymized dataset.^[8] Research by data scientists at Imperial College in London and UCLouvain in Belgium,^[9] as well as a ruling by Judge Michal Agmon-Gonen of the Tel Aviv District Court,^[10] highlight the shortcomings of "Anonymisation" in today's big data world. Anonymisation reflects an outdated approach to data protection that was developed when the processing of data was limited to isolated (siloed) applications, prior to the popularity of big data processing involving the widespread sharing and combining of data.^[11]

Anonymization of different types of data

Structured data:

Databases

Unstructured data:

PDF files - Anonymization of text, tables, images, scanned pages.
DICOM - Anonymization metadata, pixel data, overlay data, encapsulated documents.^[12]
Images

Removing identifying metadata from computer files is important for anonymizing them. Metadata removal tools are useful for achieving this.

Related Research Articles

Digital Imaging and Communications in Medicine (DICOM) is a technical standard for the digital storage and transmission of medical images and related information. It includes a file format definition, which specifies the structure of a DICOM file, as well as a network communication protocol that uses TCP/IP to communicate between systems. The primary purpose of the standard is to facilitate communication between the software and hardware entities involved in medical imaging, especially those that are created by different manufacturers. Entities that utilize DICOM files include components of picture archiving and communication systems (PACS), such as imaging machines (modalities), radiological information systems (RIS), scanners, printers, computing servers, and networking hardware.

Anonymity describes situations where the acting person's identity is unknown. Some writers have argued that namelessness, though technically correct, does not capture what is more centrally at stake in contexts of anonymity. The important idea here is that a person be non-identifiable, unreachable, or untrackable. Anonymity is seen as a technique, or a way of realizing, a certain other values, such as privacy, or liberty. Over the past few years, anonymity tools used on the dark web by criminals and malicious users have drastically altered the ability of law enforcement to use conventional surveillance techniques.

Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, contextual information norms, and the legal and political issues surrounding them. It is also known as data privacy or data protection.

An anonymous P2P communication system is a peer-to-peer distributed application in which the nodes, which are used to share resources, or participants are anonymous or pseudonymous. Anonymity of participants is usually achieved by special routing overlay networks that hide the physical location of each node from other participants.

The Data Protection Act 1998 (DPA) was an act of Parliament of the United Kingdom designed to protect personal data stored on computers or in an organised paper filing system. It enacted provisions from the European Union (EU) Data Protection Directive 1995 on the protection, processing, and movement of data.

Internet privacy involves the right or mandate of personal privacy concerning the storage, re-purposing, provision to third parties, and display of information pertaining to oneself via the Internet. Internet privacy is a subset of data privacy . Privacy concerns have been articulated from the beginnings of large-scale computer sharing and especially relate to mass surveillance.

Personal data, also known as personal information or personally identifiable information (PII), is any information related to an identifiable person.

An anonymous post, is an entry on a textboard, anonymous bulletin board system, or other discussion forums like Internet forum, without a screen name or more commonly by using a non-identifiable pseudonym. Some online forums such as Slashdot do not allow such posts, requiring users to be registered either under their real name or utilizing a pseudonym. Others like JuicyCampus, AutoAdmit, 2channel, and other Futaba-based imageboards thrive on anonymity. Users of 4chan, in particular, interact in an anonymous and ephemeral environment that facilitates rapid generation of new trends.

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.

Privacy-enhancing technologies (PET) are technologies that embody fundamental data protection principles by minimizing personal data use, maximizing data security, and empowering individuals. PETs allow online users to protect the privacy of their personally identifiable information (PII), which is often provided to and handled by services or applications. PETs use techniques to minimize an information system's possession of personal data without losing functionality. Generally speaking, PETs can be categorized as either hard or soft privacy technologies.

Privacy by design is an approach to systems engineering initially developed by Ann Cavoukian and formalized in a joint report on privacy-enhancing technologies by a joint team of the Information and Privacy Commissioner of Ontario (Canada), the Dutch Data Protection Authority, and the Netherlands Organisation for Applied Scientific Research in 1995. The privacy by design framework was published in 2009 and adopted by the International Assembly of Privacy Commissioners and Data Protection Authorities in 2010. Privacy by design calls for privacy to be taken into account throughout the whole engineering process. The concept is an example of value sensitive design, i.e., taking human values into account in a well-defined manner throughout the process.

De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.

The General Data Protection Regulation, abbreviated GDPR, is a European Union regulation on information privacy in the European Union (EU) and the European Economic Area (EEA). The GDPR is an important component of EU privacy law and human rights law, in particular Article 8(1) of the Charter of Fundamental Rights of the European Union. It also governs the transfer of personal data outside the EU and EEA. The GDPR's goals are to enhance individuals' control and rights over their personal information and to simplify the regulations for international business. It supersedes the Data Protection Directive 95/46/EC and, among other things, simplifies the terminology.

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.

k-anonymity is a property possessed by certain anonymized data. The term k-anonymity was first introduced by Pierangela Samarati and Latanya Sweeney in a paper published in 1998, although the concept dates to a 1986 paper by Tore Dalenius.

Privacy engineering is an emerging field of engineering which aims to provide methodologies, tools, and techniques to ensure systems provide acceptable levels of privacy. Its focus lies in organizing and assessing methods to identify and tackle privacy concerns within the engineering of information systems.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person to whom the data belongs. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Spatial cloaking is a privacy mechanism that is used to satisfy specific privacy requirements by blurring users’ exact locations into cloaked regions. This technique is usually integrated into applications in various environments to minimize the disclosure of private information when users request location-based service. Since the database server does not receive the accurate location information, a set including the satisfying solution would be sent back to the user. General privacy requirements include K-anonymity, maximum area, and minimum area.

Non-Personal Data (NPD) is electronic data that does not contain any information that can be used to identify a natural person. Thus, it can either be data that has no personal information to begin with ; or it is data that had personal data that was subsequently pseudoanonymized or anonymized. NPD is part of the overall Data Governance Strategy of a region or country. While personal data are covered by Data Protection Legislation such as GDPR, other kinds of data would fall under the scope of NPD Regulation.

The Personal Information Protection Law of the People's Republic of China referred to as the Personal Information Protection Law or ("PIPL") protecting personal information rights and interests, standardize personal information handling activities, and promote the rational use of personal information. It also addresses the transfer of personal data outside of China.

References

↑ ISO 25237:2017 Health informatics -- Pseudonymization. ISO. 2017. p. 7.
↑ "Data anonymization". The Free Medical Dictionary. Retrieved 17 January 2014.
↑ "De-anonymization". Whatis.com. Retrieved 17 January 2014.
↑ Bin Zhou; Jian Pei; WoShun Luk (December 2008). "A brief survey on anonymization techniques for privacy preserving publishing of social network data" (PDF). Newsletter ACM SIGKDD Explorations Newsletter. 10 (2): 12–22. doi:10.1145/1540276.1540279. S2CID 609178.
↑ "Opinion 05/2014 on Anonymisation Techniques" (PDF). EU Commission. 10 April 2014. Retrieved 31 December 2023.
↑ Eyupoglu, Can; Aydin, Muhammed; Zaim, Abdul; Sertbas, Ahmet (2018-05-17). "An Efficient Big Data Anonymization Algorithm Based on Chaos and Perturbation Techniques". Entropy. 20 (5): 373. Bibcode:2018Entrp..20..373E. doi: 10.3390/e20050373 . ISSN 1099-4300. PMC 7512893 . PMID 33265463. Text was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.
↑ Skiera, Bernd (2022). The impact of the GDPR on the online advertising market. Klaus Miller, Yuxi Jin, Lennart Kraft, René Laub, Julia Schmitt. Frankfurt am Main. ISBN 978-3-9824173-0-1. OCLC 1303894344.{{cite book}}: CS1 maint: location missing publisher (link)
↑ "Introduction to the Hash Function as a Personal Data Pseudonymisation Technique" (PDF). Spanish Data Protection Authority. October 2019. Retrieved 31 December 2023.
↑ Kolata, Gina (23 July 2019). "Your Data Were 'Anonymized'? These Scientists Can Still Identify You". The New York Times.
↑ "Attm (TA) 28857-06-17 Nursing Companies Association v. Ministry of Defense" (in Yiddish). Pearl Cohen. 2019. Retrieved 31 December 2023.
↑ Solomon, S. (31 January 2019). "Data is up for grabs under outdated Israeli privacy law, think tank says". The Times of Israel . Retrieved 31 December 2023.
↑ "DICOM De-identification/Anonymization: Protecting Patient Privacy in Medical Imaging". 2024.

External links

On the anonymization of Internet traffic: Data Sharing and Anonymization Reading List

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] ISO 25237:2017 Health informatics -- Pseudonymization. ISO. 2017. p. 7.

[2] "Data anonymization". The Free Medical Dictionary. Retrieved 17 January 2014.

[3] "De-anonymization". Whatis.com. Retrieved 17 January 2014.

[4] Bin Zhou; Jian Pei; WoShun Luk (December 2008). "A brief survey on anonymization techniques for privacy preserving publishing of social network data" (PDF). Newsletter ACM SIGKDD Explorations Newsletter. 10 (2): 12–22. doi:10.1145/1540276.1540279. S2CID 609178.

[OAT_1-5] "Opinion 05/2014 on Anonymisation Techniques" (PDF). EU Commission. 10 April 2014. Retrieved 31 December 2023.

[:0-6] Eyupoglu, Can; Aydin, Muhammed; Zaim, Abdul; Sertbas, Ahmet (2018-05-17). "An Efficient Big Data Anonymization Algorithm Based on Chaos and Perturbation Techniques". Entropy. 20 (5): 373. Bibcode:2018Entrp..20..373E. doi: 10.3390/e20050373 . ISSN 1099-4300. PMC 7512893 . PMID 33265463. Text was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.

[7] Skiera, Bernd (2022). The impact of the GDPR on the online advertising market. Klaus Miller, Yuxi Jin, Lennart Kraft, René Laub, Julia Schmitt. Frankfurt am Main. ISBN 978-3-9824173-0-1. OCLC 1303894344.{{cite book}}: CS1 maint: location missing publisher (link)

[ITH_1-8] "Introduction to the Hash Function as a Personal Data Pseudonymisation Technique" (PDF). Spanish Data Protection Authority. October 2019. Retrieved 31 December 2023.

[9] Kolata, Gina (23 July 2019). "Your Data Were 'Anonymized'? These Scientists Can Still Identify You". The New York Times.

[10] "Attm (TA) 28857-06-17 Nursing Companies Association v. Ministry of Defense" (in Yiddish). Pearl Cohen. 2019. Retrieved 31 December 2023.

[11] Solomon, S. (31 January 2019). "Data is up for grabs under outdated Israeli privacy law, think tank says". The Times of Israel . Retrieved 31 December 2023.

[12] "DICOM De-identification/Anonymization: Protecting Patient Privacy in Medical Imaging". 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]