VACUUM

Last updated

VACUUM [1] [2] [3] [4] [5] is a set of normative guidance principles for achieving training and test dataset quality for structured datasets in data science and machine learning. The garbage-in, garbage out principle motivates a solution to the problem of data quality but does not offer a specific solution. Unlike the majority of the ad-hoc data quality assessment metrics often used by practitioners [6] VACUUM specifies qualitative principles for data quality management and serves as a basis for defining more detailed quantitative metrics of data quality. [7]

VACUUM is an acronym that stands for:

Related Research Articles

<span class="mw-page-title-main">PostgreSQL</span> Free and open-source object relational database management system

PostgreSQL also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transactions with atomicity, consistency, isolation, durability (ACID) properties, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is supported on all major operating systems, including Windows, Linux, macOS, FreeBSD, and OpenBSD, and handles a range of workloads from single machines to data warehouses, data lakes, or web services with many concurrent users.

Structured Query Language (SQL) is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.

<span class="mw-page-title-main">Accuracy and precision</span> Characterization of measurement error

Accuracy and precision are two measures of observational error. Accuracy is how close a given set of measurements are to their true value. Precision is how close the measurements are to each other.

<span class="mw-page-title-main">Digital elevation model</span> 3D computer-generated imagery and measurements of terrain

A digital elevation model (DEM) or digital surface model (DSM) is a 3D computer graphics representation of elevation data to represent terrain or overlaying objects, commonly of a planet, moon, or asteroid. A "global DEM" refers to a discrete global grid. DEMs are used often in geographic information systems (GIS), and are the most common basis for digitally produced relief maps. A digital terrain model (DTM) represents specifically the ground surface while DEM and DSM may represent tree top canopy or building roofs.

A temporal database stores data relating to time instances. It offers temporal data types and stores information relating to past, present and future time. Temporal databases can be uni-temporal, bi-temporal or tri-temporal.

In the context of software engineering, software quality refers to two related but distinct notions:

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, data governance is used to form agreed upon definitions and standards for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Multi-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. All members are responsive to client data queries. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group and resolving any conflicts that might arise between concurrent changes made by different members.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.

A time series database is a software system that is optimized for storing and serving time series through associated pairs of time(s) and value(s). In some fields, time series may be called profiles, curves, traces or trends. Several early time series databases are associated with industrial applications which could efficiently store measured values from sensory equipment, but now are used in support of a much wider range of applications. In many cases, the repositories of time-series data will utilize compression algorithms to manage the data efficiently. Although it is possible to store time-series data in many different database types, the design of these systems with time as a key index is distinctly different from relational databases which reduce discrete relationships through referential models.

The acronyms BAPP and BAMP refer to a set of open-source software programs commonly used together to run dynamic websites or servers. This set is a solution stack, and an open source web platform.

<span class="mw-page-title-main">GQM</span>

GQM, the initialism for goal, question, metric, is an established goal-oriented approach to software metrics to improve and measure software quality.

Bitemporal modeling is a specific case of temporal database information modeling technique designed to handle historical data along two different timelines. This makes it possible to rewind the information to "as it actually was" in combination with "as it was recorded" at some point in time. In order to be able to do so, information cannot be discarded even if it is erroneous. Within, for example, financial reporting it is often desirable to be able to recreate an old report both as it actually looked at the time of creation and as it should have looked given corrections made to the data after its creation.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Apache SINGA</span> Open-source machine learning library

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

TimescaleDB is an open-source time series database developed by Timescale Inc. It is written in C and extends PostgreSQL. TimescaleDB is a relational database and supports standard SQL queries. Additional SQL functions and table structures provide support for time series data oriented towards storage, performance, and analysis facilities for data-at-scale.

<span class="mw-page-title-main">YugabyteDB</span> Transactional distributed SQL database

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte.

References

  1. https://books.google.com/books?id=XPBbEAAAQBAJ&q=VACUUM
  2. "The VACUUM Model: Valid, Accurate, Consistent, Uniform, and Unified". archive.is.
  3. "VACUUM". www.enterprisedb.com. Retrieved 2021-04-27.
  4. Jim Nasby (2015), All the Dirt on VACUUM, PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross, retrieved 2021-04-27
  5. "An Overview of VACUUM Processing in PostgreSQL". Severalnines. 2019-11-22. Retrieved 2021-04-27.
  6. Pipino, Leo L.; Lee, Yang W.; Wang, Richard Y. (2002-04-01). "Data quality assessment". Communications of the ACM. 45 (4): 211–218. doi:10.1145/505248.506010. ISSN   0001-0782. S2CID   426050.
  7. Wang, R.Y.; Storey, V.C.; Firth, C.P. (August 1995). "A framework for analysis of data quality research". IEEE Transactions on Knowledge and Data Engineering. 7 (4): 623–640. doi:10.1109/69.404034.