Barnardisation

Last updated

Barnardisation is a method of statistical disclosure control for tables of counts. It involves adding +1, 0 or -1 to some or all of the internal non-zero cells in a table in a pseudo-random fashion. The probability of adjustment for each internal cell is calculated as p/2 (add 1), 1-p (leave as is), p/2 (subtract 1). The table totals are then calculated as the sum of the post-adjustment internal counts. [1] [2]

Contents

Etymology

The technique of Barnardisation appears to have been named after Professor George Alfred Barnard (19152002), a Professor of Mathematics at the University of Essex. Barnard, at that time President of the Royal Statistical Society, was one of three Fellows appointed by the Council of the Royal Statistical Society to help provide a government-commissioned review of data security for the 1971 UK Census. [3] The resulting report questioned whether rounding small numbers to the nearest five was the best approach to preserving respondent confidentiality. [3] :para 3.3.8 The formal government response to the report noted that an additional safeguard of small random adjustments had been introduced for 1971 Census, the suggestion for which they explicitly attributed to Professor Barnard, [3] :para 4.20 and footnote as did a New Scientist article dated July 1973. [4] Muddying the waters slightly, a 1973 paper in the Journal of the Royal Statistical Society discussing this new safeguard reported that "after much discussion, a variant of a procedure suggested in Canada was adopted.". [5] :p.520 Presumably Professor Barnard was involved in these discussions, and was the inventor of the variant. In any case, no evidence can be found of any such safeguard being applied in Canada, with Statistics Canada seeming to stick instead to the use of random rounding of all counts to the nearest 0 or 5. [6] :p.13

Despite originating from Prof Barnard, in documentation surrounding the 1971 Census the method of adjustment now known as Barnardisation was simply described as a 'procedure'; [5] an 'adjustment of values'; [7] a 'special procedure'; [1] a 'process of random error injection'; [8] or a 'modification' or 'adjustment'. [9] [10]

The earliest use of the term 'Barnardisation' found in print so far dates to an Office for Population Censuses and Surveys working paper written by Hakim in 1979, where the term is mentioned without citation, and without ascribing it to Prof G A Barnard. [11] But, at the time, Hakim's coinage of this term appears to have been either widely overlooked or widely ignored, at least in print, as demonstrated by the wide range of later publications already cited above.

The term 'Barnardisation' does not appear to have reemerged in print until the 1995 publication of Stan Openshaw's Census Users' Handbook, [12] where it is used by two separate chapter authors and by the index compiler. However, by at least the late 1980s the term was already in widespread conversational usage during UK academic conferences and meetings. [13] More recently the term 'Barnardisation' has also become firmly ensconced in the lexicon of official reports produced by official UK statistical agencies and others. [2] [14]

Operational details

As originally conceived and implemented in the 1971 UK Census, Barnardisation had the added characteristic of pairing tables from separate areas, and applying equal and opposite adjustments to the two areas. For example, if a given table cell in Area A had its value increased by 1, then in paired Area B the equivalent table cell would have its value reduced by 1 (subject to not making the value negative). The purpose of this pairing was to cancel out, as much as possible, the amount of noise introduced via the Barnardisation process at a more aggregate level. [1]

For the 1991 UK Census the pairing of areas prior to the application of Barnardisation was dropped; and for the more detailed Local Base Statistics, its scope was extended to include adjustments of -2, -1, 0, +1 or +2, achieved by applying the +1, 0 or +1 adjustment twice. [10]

In the United Kingdom, barnardisation became increasingly employed by public agencies in order to enable them to provide information for statistical purposes without infringing the information privacy rights of the individuals to whom the information relates (e.g. [2] [15] ). In some cases this has involved further modifications to the Barndardisation procedure. For example, as implemented by the Common Service Agency, adjustments of -1, 0 or +1 were only applied to counts of 1 to 4, whilst counts of 0, instead of being left unchanged, were adjusted by the addition of 0 or +1. [15] :para 16

Pros and cons

A review of Statistical Disclosure Control methods in the run up to the 2011 UK Census [14] identified the following list of pros/cons of Barnardisation from the point-of view of the data provider:

Advantages

Disadvantages

From a user point-of-view, another advantage of Barnardisation is that it has been shown to have a smaller impact on typical user analyses than the following Statistical Disclose Control measures: random rounding to base 5; as used by Statistics Canada; random rounding to base 3, as used by Statistics New Zealand; and Small Cell Adjustment, as used at various points in time by the Office for National Statistics and the Australian Bureau of Statistics. [16]

Efficacy reappraised

Since the late 1990s concerns over the efficacy of Barnardisation in protecting confidentiality have increased to the point where it is now no longer recommended as a 'go to' tool, but rather as a technique only to be used in special circumstances. This change in attitudes appears to centre around the relatively high probability that Barnardisation will leave a small count (in particular a 1) unadjusted [2] [15] and, secondarily, to the dangers of reverse engineering the original value if sufficient overlapping barnardised tables are released. [14] For these and other reasons UK Censuses from 2001 onwards have abandoned the use of Barnardisation. See Spicer for a good review of the 2001, 2011 and 2021 alternatives to Barnardisation that have been adopted, and the rationale for this,. [17]

The question of whether barnardisation may fall short of the complete anonymisation of data, and the status of barnardised data under the complex provisions of the Data Protection Act 1998, were considered by the Scottish Information Commissioner. Some aspects of an initial decision by the Commissioner were overturned on appeal to the House of Lords, and the Commissioner was invited to revisit his original decision. The Commissioner's final decision ruled that barnardisation provided insufficient disclosure protection for rare events (in this case, Childhood Leukaemia), reversing in part his original decision: "the barnardised data, by themselves, can lead to identification, and [...] the effect of barnardisation on the actual figures, at least as deployed by the CSA, does not have the effect of concealing or disguising the data which he [the Commissioner] had originally considered that it would." [15] :para 20 However, in his written decision the Commissioner offered no statistical justification for this assertion. Instead the Commissioner's decision centred mainly around addressing points of law relating to the nature of the original and barnardised data, and how this related to legal definitions of (sensitive) personal data.

Related Research Articles

<span class="mw-page-title-main">Census</span> Acquiring and recording information about the members of a given population

A census is the procedure of systematically acquiring, recording and calculating population information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses include censuses of agriculture, traditional culture, business, supplies, and traffic censuses. The United Nations (UN) defines the essential features of population and housing censuses as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be taken at least every ten years. UN recommendations also cover census topics to be collected, official definitions, classifications and other useful information to co-ordinate international practices.

Demographic statistics are measures of the characteristics of, or changes to, a population. Records of births, deaths, marriages, immigration and emigration and a regular census of population provide information that is key to making sound decisions about national policy.

<span class="mw-page-title-main">Spreadsheet</span> Computer application for organization, analysis, and storage of data in tabular form

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. The term spreadsheet may also refer to one such electronic document.

<span class="mw-page-title-main">United States Census Bureau</span> U.S. agency responsible for the census and related statistics

The United States Census Bureau (USCB), officially the Bureau of the Census, is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. The Census Bureau is part of the U.S. Department of Commerce and its director is appointed by the President of the United States. Currently, Rob Santos is the Director of the U.S. Census Bureau and Dr. Ron Jarmin is the Deputy Director of the U.S. Census Bureau

<span class="mw-page-title-main">William Sealy Gosset</span> British statistician

William Sealy Gosset was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics. He pioneered small sample experimental design and analysis with an economic approach to the logic of uncertainty. Gosset published under the pen name Student and developed most famously Student's t-distribution – originally called Student's "z" – and "Student's test of statistical significance".

Statistics Canada, formed in 1971, is the agency of the Government of Canada commissioned with producing statistics to help better understand Canada, its population, resources, economy, society, and culture. It is headquartered in Ottawa.

<span class="mw-page-title-main">Chi-squared test</span> Statistical hypothesis test

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

Mann–Whitney test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

<span class="mw-page-title-main">Statistics New Zealand</span> National statistical service of New Zealand

Statistics New Zealand, branded as Stats NZ, is the public service department of New Zealand charged with the collection of statistics related to the economy, population and society of New Zealand. To this end, Stats NZ produces censuses and surveys.

<span class="mw-page-title-main">American Community Survey</span> Demographic survey in the United States

The American Community Survey (ACS) is an annual demographics survey program conducted by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, including ancestry, US citizenship status, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. These data are used by many public-sector, private-sector, and not-for-profit stakeholders to allocate funding, track shifting demographics, plan for emergencies, and learn about local communities.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

In statistics, a contingency table is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

<span class="mw-page-title-main">Beaumont, Alberta</span> City in Alberta, Canada

Beaumont is a city in Leduc County within the Edmonton Metropolitan Region of Alberta, Canada. It is located at the intersection of Highway 625 and Highway 814, adjacent to the City of Edmonton and 6.0 kilometres (3.7 mi) northeast of the City of Leduc. The Nisku Industrial Park and the Edmonton International Airport are located 4.0 kilometres (2.5 mi) to the west and 8.0 kilometres (5.0 mi) to the southwest respectively.

Stephen Elliott Fienberg was a professor emeritus in the Department of Statistics, the Machine Learning Department, Heinz College, and Cylab at Carnegie Mellon University. Fienberg was the founding co-editor of the Annual Review of Statistics and Its Application and of the Journal of Privacy and Confidentiality.

<span class="mw-page-title-main">Official statistics</span> Statistics published by government agencies

Official statistics are statistics published by government agencies or other public bodies such as international organizations as a public good. They provide quantitative or qualitative information on all major areas of citizens' lives, such as economic and social development, living conditions, health, education, and the environment.

According to the 2021 census, the City of Edmonton had a population of 1,010,899 residents, compared to 4,262,635 for all of Alberta, Canada. The total population of the Edmonton census metropolitan area (CMA) was 1,418,118, making it the sixth-largest CMA in Canada.

Differential privacy (DP) is an approach for providing privacy while sharing information about a group of individuals, by describing the patterns within the group while withholding information about specific individuals. This is done by making arbitrary small changes to individual data that do not change the statistics of interest. Thus the data cannot be used to infer much about any individual.

Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research.

The Five Safes is a framework for helping make decisions about making effective use of data which is confidential or sensitive. It is mainly used to describe or design research access to statistical data held by government and health agencies, and by data archives such as the UK Data Service.

References

  1. 1 2 3 Newman, Dennis (1978). Techniques for ensuring the confidentiality of census information in Great Britain (Occasional Paper 4 ed.). Census Division, OPCS.
  2. 1 2 3 4 ONS (2006). Review of the dissemination of health statistics: confidentiality guidance (PDF) (Working Paper 3: Risk Management ed.). Office for National Statistics.
  3. 1 2 3 Moore, P G (1973). "'Security of the Census of Population". Journal of the Royal Statistical Society. Series A (General). 136 (4): 583–596. doi:10.2307/2344751. JSTOR   2344751.
  4. New Scientist (1973). "Census data not so secret". New Scientist (19th July): 142.
  5. 1 2 Jones, H. J. M.; Lawson, H. B.; Newman, D. (1973). "Population census: recent British developments in methodology". Royal Statistical Society. Series A (General). 136 (4): 505–538. doi:10.2307/2344749. JSTOR   2344749. S2CID   133740484 . Retrieved 16 May 2022.
  6. Statistics Canada (1974). 1971 Census of Canada : population : vol. I - part 1 (PDF) (Introduction to volume I (part 1) ed.). Ottawa: Statistics Canada. Retrieved 16 May 2022.
  7. Rhind, D W (1975). Geographical analysis and mapping of the 1971 UK Census data, Working Paper 3. Dept of Geography, University of Durham: Census Research Unit.
  8. Hakim, Catherine (1978). Census confidentiality, microdata and census analysis (Occasional Paper 3 ed.). Census Division, OPCS.
  9. J. C. Dewdney (1983). "Censuses past and present". In Rhind, D W (ed.). A Census User's Handbook. London: Methuen. pp. 1–16.
  10. 1 2 Marsh (1993). "Privacy, confidentiality and anonymity in the 1991 Census". In Dale, A; Marsh, C (eds.). The 1991 Census User's Guide. London: HMSO. pp. 129–154. ISBN   0-11-691527-7.
  11. Hakim, Catherine (1979). "Census confidentiality in Britain". In Bulmer, M (ed.). Censuses, Surveys and Privacy. London: Palgrave. pp. 132–157. doi:10.1007/978-1-349-16184-3_10. ISBN   978-0-333-26223-8.
  12. Openshaw, Stan (1995). Census Users' Handbok. Cambridge: Pearson. ISBN   1-899761-06-3.
  13. Williamson, Paul (2022). "Personal communication". Dept. of Geography and Planning, University of Liverpool.
  14. 1 2 3 SDC UKCDMAC Subgroup. "Statistical Disclosure Control (SDC) methods short-listed for 2011 UK Census tabular outputs, Paper 1" (PDF). Office for National Statistics. Retrieved 16 May 2022.
  15. 1 2 3 4 Scottish Information Commissioner (2010). "Decision 021/2005 Mr Michael Collie and the Common Services Agency for the Scottish Health ServiceChildhood leukaemia statistics in Dumfries and Galloway" (PDF). Retrieved 16 May 2022.
  16. Willliamson, Paul (2007). "The impact of cell adjustment on the analysis of aggregate census data". Environment and Planning A. 39 (5): 1058–1078. doi:10.1068/a38142. S2CID   154653446.
  17. Spicer, K. EAP125 on Statistical disclosure control (SDC) for Census 2021. Titchfield: Office for National Statistics. Retrieved 16 May 2022.[ date missing ]