Barnardisation

Last updated

Barnardisation is a method of statistical disclosure control for tables of counts. It involves adding +1, 0 or -1 to some or all of the internal non-zero cells in a table in a pseudo-random fashion. The probability of adjustment for each internal cell is calculated as p/2 (add 1), 1-p (leave as is), p/2 (subtract 1). The table totals are then calculated as the sum of the post-adjustment internal counts. [1] [2]

Contents

Etymology

The technique of Barnardisation appears to have been named after Professor George Alfred Barnard (19152002), a Professor of Mathematics at the University of Essex. Barnard, at that time President of the Royal Statistical Society, was one of three Fellows appointed by the Council of the Royal Statistical Society to help provide a government-commissioned review of data security for the 1971 UK Census. [3] The resulting report questioned whether rounding small numbers to the nearest five was the best approach to preserving respondent confidentiality. [3] :para 3.3.8 The formal government response to the report noted that an additional safeguard of small random adjustments had been introduced for 1971 Census, the suggestion for which they explicitly attributed to Professor Barnard, [3] :para 4.20 and footnote as did a New Scientist article dated July 1973. [4] Muddying the waters slightly, a 1973 paper in the Journal of the Royal Statistical Society discussing this new safeguard reported that "after much discussion, a variant of a procedure suggested in Canada was adopted." [5] :p.520. Presumably Professor Barnard was involved in these discussions, and was the inventor of the variant. In any case, no evidence can be found of any such safeguard being applied in Canada, with Statistics Canada seeming to stick instead to the use of random rounding of all counts to the nearest 0 or 5. [6] :p.13

Despite originating from Prof Barnard, in documentation surrounding the 1971 Census the method of adjustmment now known as Barnardisation was simply described as a 'procedure'; [5] an 'adjustment of values'; [7] a 'special procedure'; [1] a 'process of random error injection'; [8] or a 'modification' or 'adjustment'. [9] [10]

The earliest use of the term 'Barnardisation' found in print so far dates to an Office for Population Censuses and Surveys working paper written by Hakim in 1979, where the term is mentioned without citation, and without ascribing it to Prof G A Barnard. [11] But, at the time, Hakim's coinage of this term appears to have been either widely overlooked or widely ignored, at least in print, as demonstrated by the wide range of later publications already cited above.

The term 'Barnardisation' does not appear to have remerged in print until the 1995 publication of Stan Openshaw's Census Users' Handbook, [12] where it is used by two separate chapter authors and by the index compiler. However, by at least the late 1980s the term was already in widespread conversational usage during UK academic conferences and meetings. [13] More recently the term 'Barnardisation' has also become firmly ensconced in the lexicon of official reports produced by official UK statistical agencies and others. [2] [14]

Operational details

As originally conceived and implemented in the 1971 UK Census, Barnardisation had the added characteristic of pairing tables from separate areas, and applying equal and opposite adjustments to the two areas. For example, if a given table cell in Area A had its value increased by 1, then in paired Area B the equivalent table cell would have its value reduced by 1 (subject to not making the value negative). The purpose of this pairing was to cancel out, as much as possible, the amount of noise introduced via the Barnardisation process at a more aggregate level. [1]

For the 1991 UK Census the pairing of areas prior to the application of Barnardisation was dropped; and for the more detailed Local Base Statistics, its scope was extended to include adjustments of -2, -1, 0, +1 or +2, achieved by applying the +1, 0 or +1 adjustment twice. [10]

In the United Kingdom, barnardisation became increasingly employed by public agencies in order to enable them to provide information for statistical purposes without infringing the information privacy rights of the individuals to whom the information relates (e.g. [2] [15] ). In some cases this has involved further modifications to the Barndardisation procedure. For example, as implemented by the Common Service Agency, adjustments of -1, 0 or +1 were only applied to counts of 1 to 4, whilst counts of 0, instead of being left unchanged, were adjusted by the addition of 0 or +1. [15] :para 16

Pros and Cons

A review of Statistical Disclosure Control methods in the run up to the 2011 UK Census [14] identified the following list of pros/cons of Barnardisation from the point-of view of the data provider:

Advantages

Disadvantages

From a user point-of-view, another advantage of Barnardisation is that it has been shown to have a smaller impact on typical user analyses than the following Statistical Disclose Control measures: random rounding to base 5; as used by Statistics Canada; random rounding to base 3, as used by Statistics New Zealand; and Small Cell Adjustment, as used at various points in time by the Office for National Statistics and the Australian Bureau of Statistics. [16]

Efficacy reappraised

Since the late 1990s concerns over the efficacy of Barnardisation in protecting confidentiality have increased to the point where it is now no longer recommended as a 'go to' tool, but rather as a technique only to be used in special circumstances. This change in attitudes appears to centre around the relatively high probability that Barnardisation will leave a small count (in particular a 1) unadjusted [2] [15] and, secondarily, to the dangers of reverse engineering the original value if sufficient overlapping barnardised tables are released. [14] For these and other reasons UK Censuses from 2001 onwards have abandoned the use of Barnardisation. See Spicer for a good review of the 2001, 2011 and 2021 alternatives to Barnardisation that have been adopted, and the rationale for this, [17] .

The question of whether barnardisation may fall short of the complete anonymisation of data, and the status of barnardised data under the complex provisions of the Data Protection Act 1998, were considered by the Scottish Information Commissioner. Some aspects of an initial decision by the Commissioner were overturned on appeal to the House of Lords, and the Commissioner was invited to revisit his original decision. The Commissioner's final decision ruled that barnardisation provided insufficient disclosure protection for rare events (in this case, Childhood Leukaemia), reversing in part his original decision: "the barnardised data, by themselves, can lead to identification, and [...] the effect of barnardisation on the actual figures, at least as deployed by the CSA, does not have the effect of concealing or disguising the data which he [the Commissioner] had originally considered that it would." [15] :para 20 However, in his written decision the Commissioner offered no statistical justification for this assertion. Instead the Commissioner's decision centred mainly around addressing points of law relating to the nature of the original and barnardised data, and how this related to legal definitions of (sensitive) personal data.

Related Research Articles

Census Acquiring and recording information about the members of a given population

A census is the procedure of systematically calculating, acquiring and recording information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses include the census of agriculture, and other censuses such as the traditional culture, business, supplies, and traffic censuses. The United Nations defines the essential features of population and housing censuses as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be taken at least every ten years. United Nations recommendations also cover census topics to be collected, official definitions, classifications and other useful information to co-ordinate international practices.

Demographic statistics are measures of the characteristics of, or changes to, a population. Records of births, deaths, marriages, immigration and emigration and a regular census of population provide information that is key to making sound decisions about national policy.

Statistical inference Process of using data analysis

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Spreadsheet Computer application for organization, analysis, and storage of data in tabular form

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. A spreadsheet may also refer to one such electronic document.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.

United States Census Bureau Bureau of the United States responsible for the census and related statistics

The United States Census Bureau (USCB), officially the Bureau of the Census, is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. The Census Bureau is part of the U.S. Department of Commerce and its director is appointed by the President of the United States.

William Sealy Gosset British statistician

William Sealy Gosset was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics. He pioneered small sample experimental design and analysis with an economic approach to the logic of uncertainty. Gosset published under the pen name Student and developed most famously Student's t-distribution – originally called Student's "z" – and "Student's test of statistical significance".

Statistics Canada, formed in 1971, is the agency of the Government of Canada commissioned with producing statistics to help better understand Canada, its population, resources, economy, society, and culture. It is headquartered in Ottawa.

Chi-squared test Statistical hypothesis test

A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

Statistics New Zealand National statistical service of New Zealand

Statistics New Zealand, branded as Stats NZ, is the public service department of New Zealand charged with the collection of statistics related to the economy, population and society of New Zealand. To this end, Stats NZ produces censuses and surveys.

American Community Survey Demographic survey in the United States

The American Community Survey (ACS) is a demographics survey program conducted by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, such as ancestry, citizenship, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. These data are used by many public-sector, private-sector, and not-for-profit stakeholders to allocate funding, track shifting demographics, plan for emergencies, and learn about local communities. Sent to approximately 295,000 addresses monthly, it is the largest household survey that the Census Bureau administers.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

In statistics, a contingency table is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

Stephen Elliott Fienberg was a Professor Emeritus in the Department of Statistics, the Machine Learning Department, Heinz College, and Cylab at Carnegie Mellon University. Fienberg was the founding co-editor of the Annual Review of Statistics and Its Application and of the Journal of Privacy and Confidentiality.

Official statistics Statistics published by government agencies

Official statistics are statistics published by government agencies or other public bodies such as international organizations as a public good. They provide quantitative or qualitative information on all major areas of citizens' lives, such as economic and social development, living conditions, health, education, and the environment.

Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

Synthetic data is "any production data applicable to a given situation that are not obtained by direct measurement" according to the McGraw-Hill Dictionary of Scientific and Technical Terms; where Craig S. Mullins, an expert in data management, defines production data as "information that is persistently stored and used by professionals to conduct business processes."

Census in Botswana

The Population and Housing Census is a decennial census which is the once-a-decade population and housing count of all people in Botswana. This count is carried out by the Statistics Botswana.

Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research.

The Five Safes is a framework for helping make decisions about making effective use of data which is confidential or sensitive. It is mainly used to describe or design research access to statistical data held by government agencies, and by data archives such as the UK Data Service.

References

  1. 1 2 3 Newman, Dennis (1978). Techniques for ensuring the confidentiality of census information in Great Britain (Occasional Paper 4 ed.). Census Division, OPCS.
  2. 1 2 3 4 ONS (2006). Review of the dissemination of health statistics: confidentiality guidance (PDF) (Working Paper 3: Risk Management ed.). Office for National Statistics.
  3. 1 2 3 Moore, P G (1973). "'Security of the Census of Population". Journal of the Royal Statistical Society. Series A (General). 136 (4): 583–596.
  4. New Scientist (1973). "Census data not so secret". New Scientist (19th July): 142.
  5. 1 2 Jones, H. J. M.; Lawson, H. B.; Newman, D. (1973). "Population census: recent British developments in methodology". Royal Statistical Society. Series A (General). 136 (4): 505–538. Retrieved 16 May 2022.
  6. Statistics Canada (1974). 1971 Census of Canada : population : vol. I - part 1 (PDF) (Introduction to volume I (part 1) ed.). Ottawa: Statistics Canada. Retrieved 16 May 2022.
  7. Rhind, D W (1975). Geographical analysis and mapping of the 1971 UK Census data, Working Paper 3. Dept of Geography, University of Durham: Census Research Unit.
  8. Hakim, Catherine (1978). Census confidentiality, microdata and census analysis (Occasional Paper 3 ed.). Census Division, OPCS.
  9. J. C. Dewdney (1983). "Censuses past and present". In Rhind, D W (ed.). A Census User’s Handbook. London: Methuen. pp. 1–16.
  10. 1 2 Marsh (1993). "Privacy, confidentiality and anonymity in the 1991 Census". In Dale, A; Marsh, C (eds.). The 1991 Census User’s Guide. London: HMSO. pp. 129–154. ISBN   0-11-691527-7.
  11. Hakim, Catherine (1979). "Census confidentiality in Britain". In Bulmer, M (ed.). Censuses, Surveys and Privacy. London: Palgrave. pp. 132–157.
  12. Openshaw, Stan (1995). Census Users' Handbok. Cambridge: Pearson. ISBN   1-899761-06-3.
  13. Williamson, Paul (2022). "Personal communication". Dept. of Geography and Planning, University of Liverpool.
  14. 1 2 3 SDC UKCDMAC Subgroup. "Statistical Disclosure Control (SDC) methods short-listed for 2011 UK Census tabular outputs, Paper 1" (PDF). Office for National Statistics. Retrieved 16 May 2022.
  15. 1 2 3 4 Scottish Information Commissioner (2010). "Decision 021/2005 Mr Michael Collie and the Common Services Agency for the Scottish Health ServiceChildhood leukaemia statistics in Dumfries and Galloway" (PDF). Retrieved 16 May 2022.
  16. Willliamson, Paul (2007). "The impact of cell adjustment on the analysis of aggregate census data". Environment and Planning A. 39: 1058–1078. doi:10.1068/a38142.
  17. Spicer, K. EAP125 on Statistical disclosure control (SDC) for Census 2021. Titchfield: Office for National Statistics. Retrieved 16 May 2022.[ date missing ]