Arthur Zimek

Last updated
Arthur Zimek
NationalityGerman
Alma mater Ludwig-Maximilians-Universität München
Scientific career
Fields outlier detection, correlation clustering
Institutions University of Southern Denmark, University of Alberta, Ludwig-Maximilians-Universität München
Doctoral advisor Hans-Peter Kriegel

Arthur Zimek is a professor in data mining, data science and machine learning at the University of Southern Denmark in Odense, Denmark.

He graduated from the Ludwig Maximilian University of Munich in Munich, Germany, where he worked with Prof. Hans-Peter Kriegel. [1] His dissertation on "Correlation Clustering" was awarded the "SIGKDD Doctoral Dissertation Award 2009 Runner-up" [2] by the Association for Computing Machinery.

He is well known [3] for his work on outlier detection, [4] [5] density-based clustering, [6] correlation clustering, [7] [8] and the curse of dimensionality. [9] [10]

He is one of the founders and core developers of the open-source ELKI data mining framework. [11] [12]

Related Research Articles

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small relative to the intrinsic dimension of the data.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

<span class="mw-page-title-main">Parallel coordinates</span> Chart displaying multivariate data

Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

<span class="mw-page-title-main">DBSCAN</span> Density-based data clustering algorithm

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

<span class="mw-page-title-main">OPTICS algorithm</span> Algorithm for finding density based clusters in spatial data

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

<span class="mw-page-title-main">ELKI</span> Data mining framework

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

<span class="mw-page-title-main">Local outlier factor</span>

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.

Hans-Peter Kriegel is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at the University of Würzburg and the University of Bremen after habilitation at the Technical University of Dortmund and doctorate from Karlsruhe Institute of Technology.

AMiner is a free online service used to index, search, and mine big scientific data.

Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. According to theoretical results, random projection preserves distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name random indexing.

<span class="mw-page-title-main">Author name disambiguation</span>

Author name disambiguation is a type of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

<span class="mw-page-title-main">Gautam Das (computer scientist)</span> Indian computer scientist

Gautam Das is a computer scientist in the field of databases research. He is an ACM Fellow and IEEE Fellow.

References

  1. News, SIGKDD. "SIGKDD Awards : 2015 SIGKDD Innovation Award: Hans-Peter Kriegel". www.kdd.org. Retrieved 2017-05-29. with his team members Peer Kroeger, Erich Schubert and Arthur Zimek{{cite web}}: |last= has generic name (help)
  2. "SIGKDD Doctoral Dissertation Award". ACM SIGKDD. Archived from the original on 2010-11-29. Retrieved 30 May 2010.
  3. E.g. Aggarwal, Charu C. (2016-12-10). Outlier analysis. Springer. pp. 49pp. ISBN   9783319475783. OCLC   967215852.
  4. Kriegel, Hans-Peter; Schubert, Matthias; Zimek, Arthur (2008). "Angle-based outlier detection in high-dimensional data". Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD '08. New York, NY, USA: ACM. pp. 444–452. CiteSeerX   10.1.1.329.7579 . doi:10.1145/1401890.1401946. ISBN   9781605581934. S2CID   3072058.
  5. Kriegel, Hans-Peter; Kröger, Peer; Schubert, Erich; Zimek, Arthur (2009). "LoOP". Proceedings of the 18th ACM conference on Information and knowledge management. CIKM '09. New York, NY, USA: ACM. pp. 1649–1652. doi:10.1145/1645953.1646195. ISBN   9781605585123. S2CID   14401236.
  6. Kriegel, Hans-Peter; Kröger, Peer; Sander, Jörg; Zimek, Arthur (2011-04-05). "Density-based clustering" . Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1 (3): 231–240. doi:10.1002/widm.30. S2CID   36920706.
  7. Böhm, Christian; Kailing, Karin; Kröger, Peer; Zimek, Arthur (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data. SIGMOD '04. New York, NY, USA: ACM. pp. 455–466. CiteSeerX   10.1.1.5.1279 . doi:10.1145/1007568.1007620. ISBN   978-1581138597. S2CID   6411037.
  8. Achtert, E.; Böhm, C.; David, J.; Kröger, P.; Zimek, A. (2008-04-24). Proceedings of the 2008 SIAM International Conference on Data Mining. Proceedings. Society for Industrial and Applied Mathematics. pp. 763–774. doi:10.1137/1.9781611972788.69. ISBN   9780898716542.
  9. Zimek, Arthur; Erich, Schubert; Hans-Peter, Kriegel (2012-08-27). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 5. doi:10.1002/sam.11161. S2CID   6724536.
  10. Houle, Michael E.; Kriegel, Hans-Peter; Kröger, Peer; Schubert, Erich; Zimek, Arthur (2010-06-30). "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?". Scientific and Statistical Database Management. Lecture Notes in Computer Science. Vol. 6187. Springer, Berlin, Heidelberg. pp. 482–500. CiteSeerX   10.1.1.378.3285 . doi:10.1007/978-3-642-13818-8_34. ISBN   978-3-642-13817-1.
  11. Achtert, Elke; Kriegel, Hans-Peter; Zimek, Arthur (2008-07-09). "ELKI: A Software System for Evaluation of Subspace Clustering Algorithms". Scientific and Statistical Database Management. Lecture Notes in Computer Science. Vol. 5069. Springer, Berlin, Heidelberg. pp. 580–585. CiteSeerX   10.1.1.144.3263 . doi:10.1007/978-3-540-69497-7_41. ISBN   978-3-540-69476-2.
  12. "The ELKI Team". elki-project.github.io. Retrieved 2017-05-29.