Statistical database

Last updated

A statistical database is a database used for statistical analysis purposes. It is an OLAP (online analytical processing), instead of OLTP (online transaction processing) system. Modern decision, and classical statistical databases are often closer to the relational model than the multidimensional model commonly used in OLAP systems today.

Contents

Statistical databases typically contain parameter data and the measured data for these parameters. For example, parameter data consists of the different values for varying conditions in an experiment (e.g., temperature, time). The measured data (or variables) are the measurements taken in the experiment under these varying conditions.

Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two options for dealing with the sparseness: (1) leave the null values in there and use compression techniques to squeeze them out or (2) remove the entries that only have null values.

Statistical databases often incorporate support for advanced statistical analysis techniques, such as correlations, which go beyond SQL. They also pose unique security concerns, which were the focus of much research, particularly in the late 1970s and early to mid-1980s.

Privacy in statistical databases

In a statistical database, it is often desired to allow query access only to aggregate data, not individual records. Securing such a database is a difficult problem, since intelligent users can use a combination of aggregate queries to derive information about a single individual.

Some common approaches are:

For many years, research in this area was stalled, and it was thought in 1980 that, to quote:

The conclusion is that statistical databases are almost always subject to compromise. Severe restrictions on allowable query set sizes will render the database useless as a source of statistical information but will not secure the confidential records. [1]

But in 2006, Cynthia Dwork defined the field of differential privacy, using work that started appearing in 2003. While showing that some semantic security goals, related to work of Tore Dalenius, were impossible, it identified new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database. This makes it possible in many cases to provide very accurate statistics from the database while still ensuring high levels of privacy. [2] [3]

Related Research Articles

<span class="mw-page-title-main">Data warehouse</span> Centralized storage of knowledge

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating reports. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Dorothy E. Denning</span> American information security researcher

Dorothy Elizabeth Denning is a US-American information security researcher known for lattice-based access control (LBAC), intrusion detection systems (IDS), and other cyber security innovations. She published four books and over 200 articles. Inducted into the National Cyber Security Hall of Fame in 2012, she is now Emeritus Distinguished Professor of Defense Analysis, Naval Postgraduate School.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to that such systems are expected to respond to user requests and process them in real-time. The term is contrasted with online analytical processing (OLAP) which instead focuses on data analysis.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

Data orientation refers to how tabular data is represented in a linear memory model such as in-disk or in-memory.The two most common representations are column-oriented and row-oriented.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed, and marks as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

An Inference Attack is a data mining technique performed by analyzing data in order to illegitimately gain knowledge about a subject or database. A subject's sensitive information can be considered as leaked if an adversary can infer its real value with a high confidence. This is an example of breached information security. An Inference attack occurs when a user is able to infer from trivial information more robust information about a database without directly accessing it. The object of Inference attacks is to piece together information at one security level to determine a fact that should be protected at a higher security level.

<span class="mw-page-title-main">Cynthia Dwork</span> American computer scientist

Cynthia Dwork is an American computer scientist best known for her contributions to cryptography, distributed computing, and algorithmic fairness. She is one of the inventors of differential privacy and proof-of-work.

<span class="mw-page-title-main">Differential privacy</span> Methods of safely sharing general data

Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while limiting information that is leaked about specific individuals. This is done by injecting carefully calibrated noise into statistical computations such that the utility of the statistic is preserved while provably limiting what can be inferred about any individual in the dataset.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.

A sensor network query processor (SNQP), also called a sensorDB, is a user-friendly interface for programming and running applications which translates instructions from declarative programming language with high-level instructions to low-level instructions understood by the operating system. The basic idea of SNQP is the addition of a layer modeling the WSN as a distributed database searchable by a query language similar to SQL.

A reconstruction attack is any method for partially reconstructing a private dataset from public aggregate information. Typically, the dataset contains sensitive information about individuals, whose privacy needs to be protected. The attacker has no or only partial access to the dataset, but has access to public aggregate statistics about the datasets, which could be exact or distorted, for example by adding noise. If the public statistics are not sufficiently distorted, the attacker is able to accurately reconstruct a large portion of the original private data. Reconstruction attacks are relevant to the analysis of private data, as they show that, in order to preserve even a very weak notion of individual privacy, any published statistics need to be sufficiently distorted. This phenomenon was called the Fundamental Law of Information Recovery by Dwork and Roth, and formulated as "overly accurate answers to too many questions will destroy privacy in a spectacular way."

<span class="mw-page-title-main">Gautam Das (computer scientist)</span> Indian computer scientist

Gautam Das is a computer scientist in the field of databases research. He is an ACM Fellow and IEEE Fellow.

Click tracking is when user click behavior or user navigational behavior is collected in order to derive insights and fingerprint users. Click behavior is commonly tracked using server logs which encompass click paths and clicked URLs. This log is often presented in a standard format including information like the hostname, date, and username. However, as technology develops, new software allows for in depth analysis of user click behavior using hypervideo tools. Given that the internet can be considered a risky environment, research strives to understand why users click certain links and not others. Research has also been conducted to explore the user experience of privacy with making user personal identification information individually anonymized and improving how data collection consent forms are written and structured.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

  1. Dorothy E. Denning, Peter J. Denning, and Mayer D. Schwartz, "The Tracker: A Threat to Statistical Database Security," ACM Transactions on Database Systems (TODS), Volume 4, Issue 1 (March 1979), Pages: 76 - 96, doi : 10.1145/320064.320069.
  2. HILTON, MICHAEL. "Differential Privacy: A Historical Survey" (PDF). S2CID   16861132. Archived from the original (PDF) on 2017-03-01.{{cite journal}}: Cite journal requires |journal= (help)
  3. Dwork, Cynthia (2008-04-25). "Differential Privacy: A Survey of Results". In Agrawal, Manindra; Du, Dingzhu; Duan, Zhenhua; Li, Angsheng (eds.). Theory and Applications of Models of Computation. Lecture Notes in Computer Science. Vol. 4978. Springer Berlin Heidelberg. pp. 1–19. doi:10.1007/978-3-540-79228-4_1. ISBN   9783540792277.

Further reading

An important series of conferences in this field:

Some key papers in this field:

  1. doi : 10.1145/320613.320616 - Dorothy E. Denning, Secure statistical databases with random sample queries, ACM Transactions on Database Systems (TODS), Volume 5, Issue 3 (September 1980), Pages: 291 - 315
  2. doi : 10.1145/319830.319834 - Wiebren de Jonge, Compromising statistical databases responding to queries about means, ACM Transactions on Database Systems, Volume 8, Issue 1 (March 1983), Pages: 60 - 80
  3. doi : 10.1145/320128.320138 - Dorothy E. Denning, Jan Schlörer, A fast procedure for finding a tracker in a statistical database, ACM Transactions on Database Systems, Volume 5, Issue 1 (March 1980) . Pages: 88 - 102
  4. A. Shoshani, “Statistical Databases: Characteristics, Problems, and some Solutions,” in Proceedings of the 8th International Conference on Very Large Data Bases, San Francisco, CA, USA, 1982, pp. 208–222.