Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research. [1]
SDC usually refers to 'output SDC'; ensuring that, for example, a published table or graph does not disclose confidential information about respondents. SDC can also describe protection methods applied to the data: for example, removing names and addresses, limiting extreme values, or swapping problematic observations. This is sometimes referred to as 'input SDC', but is more commonly called anonymization, de-identification, or microdata protection.
Textbooks (eg [2] ) typically cover input SDC and tabular data protection (but not other parts of output SDC). This is because these two problems are of direct interest to statistical agencies who supported the development of the field. [3] For analytical environments, output rules developed for statistical agencies were generally used until data managers began arguing for specific output SDC for research. [4]
This page focuses on output SDC.
Many kinds of social, economic and health research use potentially sensitive data as a basis for their research, such as survey or Census data, tax records, health records, educational information, etc. Such information is usually given in confidence, and, in the case of administrative data, not always for the purpose of research.
Researchers are not usually interested in information about one single person or business; they are looking for trends among larger groups of people. [5] However, the data they use is, in the first place, linked to individual people and businesses, and SDC ensures that these cannot be identified from published data, no matter how detailed or broad. [6]
It is possible that at the end of data analysis, the researcher somehow singles out one person or business through their research. For example, a researcher may identify the exceptionally good or bad service in a geriatric department within a hospital in a remote area, where only one hospital provides such care. In that case, the data analysis 'discloses' the identity of the hospital, even if the dataset used for analysis was properly anonymised or de-identified.
Statistical disclosure control will identify this disclosure risk and ensure the results of the analysis are altered to protect confidentiality. [7] It requires a balance between protecting confidentiality and ensuring the results of the data analysis are still useful for statistical research. [8]
Output SDC relies upon having a set of rules that can be followed by an output checker; for example, that a frequency table must have a minimum number of observations, or that survival tables should be right-censored for extreme values. The value and drawbacks of rules for frequency and magnitude tables have been discussed extensively since the late 20th Century. However, with awareness of the increasing need for rules for other types of analyses, a more structured approach is needed.
Some statistical outputs, such as frequency tables, have a high level of inherent risk: differencing, low numbers, class disclosure. They therefore need to be checked before release, ideally by someone with some understanding of the data, to ensure that there is no meaningful risk on release. These are referred to as 'unsafe statistics'. However, there are some statistics, such as the coefficients from modelling, that have no meaningful risk and therefore can be released with no further checks. These are called 'safe statistics'. By separating statistics into 'safe' and 'unsafe', output checks can be concentrated on the latter, improving both security and efficiency. [4]
This is less important for official statistics, where 'unsafe' statistics such as counts, means, medians and simple indexes dominate the outputs. However, for research output this is important, as a great deal of research output (particularly estimates and test statistics) is inherently 'safe'.
The safe/unsafe model is useful but limited with two simple categories; within those categories, guidelines for SDC largely consist of long lists of statistics and how to handle them. In 2023, the SACRO project https://dareuk.org.uk/driver-project-sacro/ undertook to review the whole field and see whether a more useful classification scheme could be introduced. The result is the 'statistical barn' (or 'statbarn') concept.
A statbarn is a classification of statistics for disclosure control purposes, where all of the statistics in that class share the same characteristics as far as disclosure control is concerned:
As of March 2024, 14 statbarns have been identified, with 12 described for output checkers: [9]
These cover almost all statistics. They also cover most graph forms, where the graph can be converted into the appropriate statsbarn (for example, a pie chart is another form of frequency table). The SACRO manual provides guidance on what to look out for, and the rules to be followed fro checking.
There are two main approaches to output SDC: principles-based and rules-based. [10] In principles-based systems, disclosure control attempts to uphold a specific set of fundamental principles—for example, "no person should be identifiable in released microdata". [11] Rules-based systems, in contrast, are evidenced by a specific set of rules that a person performing disclosure control follows (for example, "any frequency must be based on at least five observations"), after which the data are presumed to be safe to release. In general, official statistics are rules-based; research environments are more likely to be principles-based.
In research environments, the choice of output-checking regime can have significant operational implications. [12]
In rules-based SDC, a rigid set of rules is used to determine whether or not the results of data analysis can be released. The rules are applied consistently, which makes it obvious what kinds of output are acceptable. Rules-based systems are good for ensuring consistency across time, across data sources, and across production teams, which makes them appealing for statistical agencies. [12] Rules-based systems also work well for remote job serves such as microdata.no or Lissy.
However, because the rules are inflexible, either disclosive information may still slip through, or the rules are over-restrictive and may only allow for results that are too broad for useful analysis to be published. [10] In practice, research environments running rules-based systems may have to bring flexibility in 'ad hoc' systems. [12]
The Northern Ireland Statistics and Research Agency (NISRA) uses a rules-based approach to releasing statistics and research results. [13]
In principles-based SDC, both the researcher and the output checker are trained in SDC. They receive a set of rules, which are rules-of-thumb rather than hard rules as in rules-based SDC. This means that in principle, any output may be approved or refused. The rules-of-thumb are a starting point for the researcher. A researcher may request outputs which breach the 'rules of thumb' as long as (1) they are non-disclosive (2) they are important and (3) this is an exceptional request. [14] It is up to the researcher to prove that any 'unsafe' outputs are non-disclosive, but the checker has the final say. Since there are no hard rules, this requires knowledge on disclosure risks and judgment from both the researcher and the checker. It requires training and an understanding of statistics and data analysis, [10] although it has been argued [12] that this can be used to make the process more efficient than a rules-based model.
In the UK all major secure research environments in social science and public health, with the exception of Northern Ireland, are principles-based. This includes the UK Data Service's Secure Data Service, [15] the Office for National Statistics' Secure Research Service, the Scottish Safe Havens, Secure Anonymized Information Linkage (SAIL) and OpenSAFELY.
Many contemporary statistical disclosure control techniques, such as generalization and cell suppression, have been shown to be vulnerable to attack by a hypothetical data intruder. For example, Cox showed in 2009 that Complementary cell suppression typically leads to "over-protected" solutions because of the need to suppress both primary and complementary cells, and even then can lead to the compromise of sensitive data when exact intervals are reported. [16]
Many of the rules are arbitrary and reflect data owner's unwillingness to be different, rather than solid evidence. For example, Ritchie [17] demonstrated the choice of a minimum threshold is more about an organisation's wish to be in line with others than any statistical rationale.
A more substantive criticism is that the theoretical models used to explore control measures are not appropriate for guides for practical action. [18] Hafner et al provide a practical example of how a change in perspective can generate substantially different results. [3]
Artificial intelligence and machine learning models present different risks for output checking. [19] The GRAIMATTER project https://dareuk.org.uk/sprint-exemplar-project-graimatter/ provided some initial guidance and automatic tools. These were extended and simplified as part of the SACRO project (see below), and more guidelines for data services staff added. This is still a quickly-evolving area. The SDC-REBOOT community network https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=SDC-REBOOT is currently co-ordinating the ongoing development of the tools and guidance.
Output checking is generally labour-intensive, as it requires analysts who can understand what they are looking at and make a judgement about whether to release an output. There is therefore considerable interest in automated checking. A Eurostat-commissioned report [20] explored the options for output checking, which largely come down to two options:
tau-Argus and sdcTable are fully-automated open-source EoPR tools for tabular data protection (frequency and magnitude tables). They are designed to work with multiple tables. Metadata needs to be set up describing the output(s), and the control parameters. They provide the output checkers with extensive information on potential problems, including secondary disclosure across tables. They can also carry out correction measures, from suppression and simple rounding to secondary suppression and controlled tabular rounding. They do not deal with non-tabular outputs.
Because of the need to rewrite the metadata for each table, these tools are poorly suited for research use. However, in official statistics, where the same tables are being repeatedly generated and where secondary differencing is considered a significant problem, the investment in setting up the tools can be very cost-effective.
The software for both is open source at GitHub https://github.com/sdcTools/tauargus and CRAN https://cran.r-project.org/web/packages/sdcTable/
SACRO (Semi-autonomous checking of research outputs) is a WPR tool, originally commissioned (ACRO) by Eurostat in 2020 as a proof-of-concept to show that a general-purpose output checking tool could be developed. [21] In 2023 the UK Medical Research Council commissioned a generalised version (SACRO) which would work with multiple languages (as of 2024: Stata, R and Python) and provide a more user-friendly interface. [22] SACRO directly implements the statbarns model and is principles-based; hence, it is 'semi-automatic' as it allows users to request exceptions and for output checkers to override the automated recommendations. All UK social science secure facilities, and most UK public health secure facilities, are planning to adopt it.
The software is available on Github at https://github.com/AI-SDC, which also contains links to the original ACRO and tools for assessing AI models.
Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.
A census is the procedure of systematically acquiring, recording and calculating population information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses include censuses of agriculture, traditional culture, business, supplies, and traffic censuses. The United Nations (UN) defines the essential features of population and housing censuses as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be taken at least every ten years. UN recommendations also cover census topics to be collected, official definitions, classifications and other useful information to co-ordinate international practices.
Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
The Office for National Statistics is the executive office of the UK Statistics Authority, a non-ministerial department which reports directly to the UK Parliament.
Eurostat is a Directorate-General of the European Commission located in the Kirchberg quarter of Luxembourg City, Luxembourg. Eurostat's main responsibilities are to provide statistical information to the institutions of the European Union (EU) and to promote the harmonisation of statistical methods across its member states and candidates for accession as well as EFTA countries. The organisations in the different countries that cooperate with Eurostat are summarised under the concept of the European Statistical System.
An information technology audit, or information systems audit, is an examination of the management controls within an Information technology (IT) infrastructure and business applications. The evaluation of evidence obtained determines if the information systems are safeguarding assets, maintaining data integrity, and operating effectively to achieve the organization's goals or objectives. These reviews may be performed in conjunction with a financial statement audit, internal audit, or other form of attestation engagement.
Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.
Epi Info is statistical software for epidemiology developed by Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia (US).
Barnardisation is a method of statistical disclosure control for tables of counts. It involves adding +1, 0 or -1 to some or all of the internal non-zero cells in a table in a pseudo-random fashion. The probability of adjustment for each internal cell is calculated as p/2, 1-p, p/2. The table totals are then calculated as the sum of the post-adjustment internal counts.
OpenEpi is a free, web-based, open source, operating system-independent series of programs for use in epidemiology, biostatistics, public health, and medicine, providing a number of epidemiologic and statistical tools for summary data. OpenEpi was developed in JavaScript and HTML, and can be run in modern web browsers. The program can be run from the OpenEpi website or downloaded and run without a web connection. The source code and documentation is downloadable and freely available for use by other investigators. OpenEpi has been reviewed, both by media organizations and in research journals.
In the study of survey and census data, microdata is information at the level of individual respondents. For instance, a national census might collect age, home address, educational level, employment status, and many other variables, recorded separately for every person who responds; this is microdata.
Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while limiting information that is leaked about specific individuals. This is done by injecting carefully calibrated noise into statistical computations such that the utility of the statistic is preserved while provably limiting what can be inferred about any individual in the dataset.
Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.
Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.
De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.
The UK Data Service is the largest digital repository for quantitative and qualitative social science and humanities research data in the United Kingdom. The organisation is funded by the UK government through the Economic and Social Research Council and is led by the UK Data Archive at the University of Essex, in partnership with other universities.
Statistics Botswana (StatsBots) is the national statistical bureau of Botswana. The organization was previously under the Ministry of Finance and Development Planning as a department and was called Central Statistics Office. The organisation was initially set up in 1967 through an Act of Parliament – the Statistics Act and thereafter transformed into a parastatal through the revised Statistics Act of 2009. This act gives the Statistics Botswana the mandate and authority to collect, process, compile, analyse, publish, disseminate and archive official national statistics. It is also responsible for "coordinating, monitoring and supervising the National Statistical System" in Botswana. The office has its main offices in Gaborone and three satellite offices in Maun, Francistown and Ghanzi. The different areas in statistics that should be collected are covered under this Act and are clearly specified. The other statistics that are not specified can be collected as long as they are required by the Government, stakeholders and the users.
The GESIS – Leibniz Institute for the Social Sciences is the largest German infrastructure institute for the social sciences. It is headquartered in Mannheim, with a location in Cologne. With basic research-based services and consulting covering all levels of the scientific process, GESIS supports researchers in the social sciences. As of 2017, the president of GESIS is Christof Wolf.
In computer science, language-based security (LBS) is a set of techniques that may be used to strengthen the security of applications on a high level by using the properties of programming languages. LBS is considered to enforce computer security on an application-level, making it possible to prevent vulnerabilities which traditional operating system security is unable to handle.
The Five Safes is a framework for helping make decisions about making effective use of data which is confidential or sensitive. It is mainly used to describe or design research access to statistical data held by government and health agencies, and by data archives such as the UK Data Service.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)