Skyline operator

Last updated

The skyline operator is the subject of an optimization problem and computes the Pareto optimum on tuples with multiple dimensions.

Contents

This operator is an extension to SQL proposed by Börzsönyi et al. [1] to filter results from a database to keep only those objects that are not worse in multiple dimensions than any other. The name skyline comes from the view on Manhattan from the Hudson River, where those buildings can be seen that are not hidden by any other. A building is visible if it is not dominated by a building that is taller or closer to the river (two dimensions, distance to the river minimized, height maximized). Another application of the skyline operator involves selecting a hotel for a holiday. The user wants the hotel to be both cheap and close to the beach. However, hotels that are close to the beach may also be expensive. In this case, the skyline operator would only present those hotels that are not worse than any other hotel in both price and distance to the beach.

Formal specification

The skyline operator returns tuples that are not dominated by any other tuple. A tuple dominates another if it is at least as good in all dimensions and better in at least one dimension. Formally, we can think of each tuple as a vector . dominates (written: ) if is at least as good as in every dimension, and superior in at least one: [2]

Dominance () can be defined as any strict partial ordering, for example greater (with and ) or less (with and ).

Assuming two dimensions and defining dominance in both dimensions as greater, we can compute the skyline in SQL-92 as follows:

WITHtuples(id,i,j)as(values(1,1,1),(1,2,1),(1,1,2))SELECT*FROMtuplest1WHERENOTEXISTS(-- which is not dominated bySELECT*FROMtuplest2-- a tuple that isWHEREt2.i>=t1.iandt2.j>=t1.j-- at least as good in all dimensionsAND(t2.i>t1.iort2.j>t1.j)-- and better in at least one dimension);

Proposed syntax

As an extension to SQL, Börzsönyi et al. [1] proposed the following syntax for the skyline operator:

SELECT...FROM...WHERE...GROUPBY...HAVING...SKYLINEOF[DISTINCT]d1[MIN|MAX|DIFF],...,dm[MIN|MAX|DIFF]ORDERBY...

where d1, ... dm denote the dimensions of the skyline and MIN, MAX and DIFF specify whether the value in that dimension should be minimised, maximised or simply be different.

Without an SQL extension, the SQL query requires an antijoin with not exists:

SELECT...FROM(...)qWHERENOTEXISTS(SELECT*FROM(...)pWHEREp.d1[<=|>=]q.d1AND...ANDp.dm[<=|>=]q.dmAND(p.d1[<|>]q.d1OR...ORp.dm[<|>]q.dm))

Implementation

The skyline operator can be implemented directly in SQL using current SQL constructs, but this has been shown to be very slow in disk-based database systems. [1] Other algorithms have been proposed that make use of divide and conquer, indices, [1] MapReduce [3] and general-purpose computing on graphics cards. [4] Skyline queries on data streams (i.e. continuous skyline queries) have been studied in the context of parallel query processing on multicores, owing to their wide diffusion in real-time decision making problems and data streaming analytics. [5]

See also

Related Research Articles

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

The SQL SELECT statement returns a result set of records, from one or more tables.

Descriptive complexity is a branch of computational complexity theory and of finite model theory that characterizes complexity classes by the type of logic needed to express the languages in them. For example, PH, the union of all complexity classes in the polynomial hierarchy, is precisely the class of languages expressible by statements of second-order logic. This connection between complexity and the logic of finite structures allows results to be transferred easily from one area to the other, facilitating new proof methods and providing additional evidence that the main complexity classes are somehow "natural" and not tied to the specific abstract machines used to define them.

In mathematical economics, the Arrow–Debreu model is a theoretical general equilibrium model. It posits that under certain economic assumptions there must be a set of prices such that aggregate supplies will equal aggregate demands for every commodity in the economy.

<span class="mw-page-title-main">Position (geometry)</span> Vector representing the position of a point with respect to a fixed origin

In geometry, a position or position vector, also known as location vector or radius vector, is a Euclidean vector that represents a point P in space. Its length represents the distance in relation to an arbitrary reference origin O and its direction represents the angular orientation with respect to given reference axes. Usually denoted x, r, or s, it corresponds to the straight line segment from O to P. In other words, it is the displacement or translation that maps the origin to P:

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

Finite model theory is a subarea of model theory. Model theory is the branch of logic which deals with the relation between a formal language (syntax) and its interpretations (semantics). Finite model theory is a restriction of model theory to interpretations on finite structures, which have a finite universe.

Multidimensional Expressions (MDX) is a query language for online analytical processing (OLAP) using a database management system. Much like SQL, it is a query language for OLAP cubes. It is also a calculation language, with syntax similar to spreadsheet formulae.

Semidefinite programming (SDP) is a subfield of mathematical programming concerned with the optimization of a linear objective function over the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron.

Gadfly is a relational database management system written in Python. Gadfly is a collection of Python modules that provides relational database functionality entirely implemented in Python. It supports a subset of the standard RDBMS Structured Query Language (SQL).

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

Tagsistant is a semantic file system for the Linux kernel, written in C and based on FUSE. Unlike traditional file systems that use hierarchies of directories to locate objects, Tagsistant introduces the concept of tags.

In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational databases can be expressed in this way. Conjunctive queries also have a number of desirable theoretical properties that larger classes of queries do not share.

In cooperative game theory and social choice theory, the Nakamura number measures the degree of rationality of preference aggregation rules, such as voting rules. It is an indicator of the extent to which an aggregation rule can yield well-defined choices.

In database theory and systems, a monotonic query is one that does not lose any tuples it previously made output, with the addition of new tuples in the database. Formally, a query q over a schema R is monotonic if and only if for every two instances I, J of R, (q must be a monotonic function).

In economics, the Debreu's theorems are preference representation theorems—statements about the representation of a preference ordering by a real-valued utility function. The theorems were proved by Gerard Debreu during the 1950s.

In mathematical logic, fixed-point logics are extensions of classical predicate logic that have been introduced to express recursion. Their development has been motivated by descriptive complexity theory and their relationship to database query languages, in particular to Datalog.

The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.

References

  1. 1 2 3 4 Borzsonyi, Stephan; Kossmann, Donald; Stocker, Konrad (2001). "The Skyline operator". Proceedings 17th International Conference on Data Engineering. pp. 421–430. doi:10.1109/ICDE.2001.914855. ISBN   0-7695-1001-9. S2CID   5812098.
  2. Maximilian E. Schüle, Alex Kulikov, Alfons Kemper, Thomas Neumann (2020). "ARTful Skyline Computation for In-Memory Database Systems". New Trends in Databases and Information Systems - ADBIS 2020 Short Papers, Lyon, France, August 25-27, 2020, Proceedings. pp. 3–12. doi:10.1007/978-3-030-54623-6_1. ISBN   978-3-030-54622-9.{{cite book}}: CS1 maint: multiple names: authors list (link)
  3. Mullesgaard, Kasper; Pedersen, Jens Laurits; Lu, Hua; Zhou, Yongluan (2014). "Efficient Skyline Computation in MapReduce" (PDF). Proc. 17th International Conference on Extending Database Technology (EDBT): 37–48.
  4. Bøgh, Kenneth S; Assent, Ira; Magnani, Matteo (2013). "Efficient GPU-based skyline computation". Proceedings of the Ninth International Workshop on Data Management on New Hardware. pp. 5:1–5:6. doi:10.1145/2485278.2485283. ISBN   9781450321969. S2CID   13195757.
  5. De Matteis, Tiziano; Di Girolamo, Salvatore; Mencagli, Gabriele (25 August 2016). "Continuous skyline queries on multicore architectures". Concurrency and Computation: Practice and Experience. 28 (12): 3503–3522. doi:10.1002/cpe.3866. S2CID   6562372.