VACUUM

Last updated

VACUUM [1] [2] [3] [4] is a set of normative guidance principles for achieving training and test dataset quality for structured datasets in data science and machine learning. The garbage-in, garbage out principle motivates a solution to the problem of data quality but does not offer a specific solution. Unlike the majority of the ad-hoc data quality assessment metrics often used by practitioners [5] VACUUM specifies qualitative principles for data quality management and serves as a basis for defining more detailed quantitative metrics of data quality. [6]

VACUUM is an acronym that stands for:

Related Research Articles

<span class="mw-page-title-main">PostgreSQL</span> Free and open-source relational database management system

PostgreSQL, also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the Ingres database developed at the University of California, Berkeley. In 1996, the project was renamed to PostgreSQL to reflect its support for SQL. After a review in 2007, the development team decided to keep the name PostgreSQL and the alias Postgres.

Structured Query Language, abbreviated as SQL, is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.

The impact factor (IF) or journal impact factor (JIF) of an academic journal is a scientometric index calculated by Clarivate that reflects the yearly mean number of citations of articles published in the last two years in a given journal, as indexed by Clarivate's Web of Science.

In the context of software engineering, software quality refers to two related but distinct notions:

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.

The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. SSIM is used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

<span class="mw-page-title-main">EnterpriseDB</span> American software company

EnterpriseDB (EDB), a privately held company based in Massachusetts, provides software and services based on the open-source database PostgreSQL, and is one of the largest contributors to Postgres. EDB develops and integrates performance, security, and manageability enhancements into Postgres to support enterprise-class workloads. EDB has also developed database compatibility for Oracle to facilitate the migration of workloads from Oracle to EDB Postgres and to support the operation of many Oracle workloads on EDB Postgres.

In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row. Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function.

In computer science, region-based memory management is a type of memory management in which each allocated object is assigned to a region. A region, also called a zone, arena, area, or memory context, is a collection of allocated objects that can be efficiently reallocated or deallocated all at once. Like stack allocation, regions facilitate allocation and deallocation of memory with low overhead; but they are more flexible, allowing objects to live longer than the stack frame in which they were allocated. In typical implementations, all objects in a region are allocated in a single contiguous range of memory addresses, similarly to how stack frames are typically allocated.

GIS Live DVD is a type of the thematic Live CD containing GIS/RS applications and related tutorials, and sample data sets. The general sense of a GIS Live DVD is to demonstrate the power of FLOSS GIS and encourage users to start on FLOSS GIS. However, a disc can be used for GIS data processing and training, too. A disc usually includes some selected Linux-based or Wine (software)-enabled Windows applications for GIS and Remote Sensing use. Using this disc the end users can execute GIS functions to get experience in free and open source software solutions or solve some simple business operations. The set-up and the operating behaviour of the applications can also be studied prior to building real FLOSS GIS-based systems. Recently a LiveDVD image is stored and booted from USB.

Amazon Relational Database Service is a distributed relational database service by Amazon Web Services (AWS). It is a web service running "in the cloud" designed to simplify the setup, operation, and scaling of a relational database for use in applications. Administration processes like patching the database software, backing up databases and enabling point-in-time recovery are managed automatically. Scaling storage and compute resources can be performed by a single API call to the AWS control plane on-demand. AWS does not offer an SSH connection to the underlying virtual machine as part of the managed service.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

GeoSPARQL is a standard for representation and querying of geospatial linked data for the Semantic Web from the Open Geospatial Consortium (OGC). The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.

Data blending is a process whereby big data from multiple sources are merged into a single data warehouse or data set. It concerns not merely the merging of different file formats or disparate sources of data but also different varieties of data. Data blending allows business analysts to cope with the expansion of data that they need to make critical business decisions based on good quality business intelligence.

Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric developed by Netflix in cooperation with the University of Southern California, The IPI/LS2N lab Nantes Université, and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

TimescaleDB is an open-source time series database developed by Timescale Inc. It is written in C and extends PostgreSQL. TimescaleDB supports standard SQL queries and is a relational database. Additional SQL functions and table structures provide support for time series data oriented towards storage, performance, and analysis facilities for data-at-scale. Performance characteristics have been compared to InfluxDB. Time-based data partitioning provides for improved query execution and performance when used for time oriented applications. More granular partition definition is achieved through the use of user defined attributes.

<span class="mw-page-title-main">YugabyteDB</span> Transactional distributed SQL database

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte.

References

  1. https://books.google.com/books?id=XPBbEAAAQBAJ&q=VACUUM
  2. "VACUUM". www.enterprisedb.com. Retrieved 2021-04-27.
  3. Jim Nasby (2015), All the Dirt on VACUUM, PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross, retrieved 2021-04-27
  4. "An Overview of VACUUM Processing in PostgreSQL". Severalnines. 2019-11-22. Retrieved 2021-04-27.
  5. Pipino, Leo L.; Lee, Yang W.; Wang, Richard Y. (2002-04-01). "Data quality assessment". Communications of the ACM. 45 (4): 211–218. doi:10.1145/505248.506010. ISSN   0001-0782. S2CID   426050.
  6. Wang, R.Y.; Storey, V.C.; Firth, C.P. (August 1995). "A framework for analysis of data quality research". IEEE Transactions on Knowledge and Data Engineering. 7 (4): 623–640. doi:10.1109/69.404034.