Peter Rousseeuw

Last updated
Peter J. Rousseeuw
Peter Rousseeuw in 2022.png
Peter Rousseeuw in 2022
Born (1956-10-13) 13 October 1956 (age 67)
Wilrijk, Belgium
Nationality Belgian
Education Vrije Universiteit Brussel
ETH Zurich
Scientific career
Fields Statistics
Institutions Delft University of Technology
University of Fribourg
University of Antwerp
Renaissance Technologies
KU Leuven
Thesis New Infinitesimal Methods in Robust Statistics  (1981)
Doctoral advisor Frank Hampel
Jean Haezendonck
Doctoral students Mia Hubert

Peter J. Rousseeuw (born 13 October 1956) is a statistician known for his work on robust statistics and cluster analysis. He obtained his PhD in 1981 at the Vrije Universiteit Brussel, following research carried out at the ETH in Zurich, which led to a book on influence functions. [1] Later he was professor at the Delft University of Technology, The Netherlands, at the University of Fribourg, Switzerland, and at the University of Antwerp, Belgium. Next he was a senior researcher at Renaissance Technologies. He then returned to Belgium as professor at KU Leuven, [2] [3] until becoming emeritus in 2022. His former PhD students include Annick Leroy, Hendrik Lopuhaä, Geert Molenberghs, Christophe Croux, Mia Hubert, Stefan Van Aelst, Tim Verdonck and Jakob Raymaekers. [4]

Contents

Research

Rousseeuw has constructed and published many useful techniques. [3] [5] [6] He proposed the Least Trimmed Squares method [7] [8] [9] and S-estimators [10] for robust regression, which can resist outliers in the data.

He also introduced the Minimum Volume Ellipsoid and Minimum Covariance Determinant methods [11] [12] for robust scatter matrices. This work led to his book Robust Regression and Outlier Detection with Annick Leroy.

With Leonard Kaufman he coined the term medoid when proposing the k-medoids method [13] [14] for cluster analysis, also known as Partitioning Around Medoids (PAM). His silhouette display [15] shows the result of a cluster analysis, and the corresponding silhouette coefficient is often used to select the number of clusters. The work on cluster analysis led to a book titled Finding Groups in Data. [16] Rousseeuw was the original developer of the R package cluster along with Mia Hubert and Anja Struyf. [17]

The Rousseeuw–Croux scale estimator [18] is an efficient alternative to the median absolute deviation (see robust measures of scale).

With Ida Ruts and John Tukey he introduced the bagplot, [19] a bivariate generalization of the boxplot.

His more recent work has focused on concepts and algorithms for statistical depth functions in the settings of multivariate, regression [20] and functional data, and on robust principal component analysis. [21] His current research is on visualization of classification [22] [23] and cellwise outliers. [24] [25]

Recognition

Rousseeuw was elected Member of International Statistical Institute (1991), Fellow of Institute of Mathematical Statistics (1993), and Fellow of the American Statistical Association (1994). His 1984 paper on robust regression [7] has been reprinted in Breakthroughs in Statistics, [26] which collected and annotated the 60 most influential papers in statistics from 1850 to 1990. He became an ISI highly cited researcher in 2003, and was awarded the Jack Youden Prize (2018) and the Frank Wilcoxon Prize (2021). In 2024, he received the Gottfried E. Distinguished Scholar Award of the American Statistical Association.

Creation of the Rousseeuw Prize for Statistics

From 2016 onward Peter Rousseeuw worked on creating a new biennial prize, sponsored by him. [27] The goal of the prize is to recognize outstanding statistical innovations with impact on society, and to promote awareness of the important role and intellectual content of statistics and its profound impact on human endeavors. The award amount is 1 million US dollars, similar to the Nobel Prize in other fields. The first award in 2022 went to the topic of Causal Inference in Medicine and Public Health. It was presented by His Majesty King Philippe of Belgium to the laureates James Robins, Andrea Rotnitzky, Thomas Richardson, Miguel Hernán and Eric Tchetgen Tchetgen.

Related Research Articles

<span class="mw-page-title-main">Interquartile range</span> Measure of statistical dispersion

In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 − Q1.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

The median of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as the “middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

The k-medoids problem is a clustering problem similar to k-means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the k-means and k-medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in k-means, where the center of a cluster is not necessarily one of the input data points. Furthermore, k-medoids can be used with arbitrary dissimilarity measures, whereas k-means generally requires Euclidean distance for efficient solutions. Because k-medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than k-means.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

The Nelson–Aalen estimator is a non-parametric estimator of the cumulative hazard rate function in case of censored data or incomplete data. It is used in survival theory, reliability engineering and life insurance to estimate the cumulative number of expected events. An "event" can be the failure of a non-repairable component, the death of a human being, or any occurrence for which the experimental unit remains in the "failed" state from the point at which it changed on. The estimator is given by

In statistics, robust measures of scale are methods that quantify the statistical dispersion in a sample of numerical data while resisting outliers. The most common such robust statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample standard deviation, which are greatly influenced by outliers.

Least trimmed squares (LTS), or least trimmed sum of squares, is a robust statistical method that fits a function to a set of data whilst not being unduly affected by the presence of outliers . It is one of a number of methods for robust regression.

<span class="mw-page-title-main">Theil–Sen estimator</span> Statistical method for fitting a line

In non-parametric statistics, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane by choosing the median of the slopes of all lines through pairs of points. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively, and after Maurice Kendall because of its relation to the Kendall tau rank correlation coefficient.

Pranab Kumar Sen was an Indian-American statistician who was a professor of statistics and the Cary C. Boshamer Professor of Biostatistics at the University of North Carolina at Chapel Hill.

<span class="mw-page-title-main">Influential observation</span> Observation that would cause a large change if deleted

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.

In robust statistics, repeated median regression, also known as the repeated median estimator, is a robust linear regression algorithm. The estimator has a breakdown point of 50%. Although it is equivariant under scaling, or under linear transformations of either its explanatory variable or its response variable, it is not under affine transformations that combine both variables. It can be calculated in time by brute force, in time using more sophisticated techniques, or in randomized expected time. It may also be calculated using an on-line algorithm with update time.

Colin Lingwood Mallows was an English statistician, who worked in the United States from 1960. He was known for Mallows's Cp, a regression model diagnostic procedure, widely used in regression analysis and the Fowlkes–Mallows index, a popular clustering validation criterion.

Mia Hubert is a Belgian mathematical statistician known for her research on topics in robust statistics including medoid-based clustering,[a] regression depth,[b] the medcouple for robustly measuring skewness,[c] box plots for skewed data,[f] and robust principal component analysis,[d] and for her implementations of robust statistical algorithms in the R statistical software system, MATLAB,[e] and S-PLUS.[a] She is a professor in the statistics and data science section of the department of mathematics at KU Leuven.

Robust Regression and Outlier Detection is a book on robust statistics, particularly focusing on the breakdown point of methods for robust regression. It was written by Peter Rousseeuw and Annick M. Leroy, and published in 1987 by Wiley.

References

  1. Hampel, Frank; Ronchetti, Elvezio; Rousseeuw, Peter J.; Stahel, Werner (1986). Robust statistics: the approach based on influence functions. New York: Wiley. doi:10.1002/9781118186435. ISBN   978-0-471-73577-9.
  2. "KU Leuven who's who - Peter Rousseeuw". Ku Leuven. Retrieved 21 December 2015.
  3. 1 2 "ROBUST@Leuven – Departement Wiskunde KU Leuven". Ku Leuven. Retrieved 21 December 2015.
  4. "Peter Rousseeuw". The Mathematics Genealogy Project.
  5. "Peter Rousseeuw". Google Scholar. Retrieved 21 December 2015.
  6. "Peter Rousseeuw". ResearchGate. Retrieved 6 November 2022.
  7. 1 2 Rousseeuw, Peter J. (1984). "Least Median of Squares Regression". Journal of the American Statistical Association. 79 (388): 871–880. CiteSeerX   10.1.1.464.928 . doi:10.1080/01621459.1984.10477105.
  8. Rousseeuw, Peter J.; Van Driessen, Katrien (2006). "Computing LTS Regression for Large Data Sets". Data Mining and Knowledge Discovery. 12 (1): 29–45. doi:10.1007/s10618-005-0024-4. S2CID   207113006.
  9. Rousseeuw, Peter J.; Leroy, Annick M. (1987). Robust Regression and Outlier Detection (3. print. ed.). New York: Wiley. doi:10.1002/0471725382. ISBN   978-0-471-85233-9.
  10. Rousseeuw, P.; Yohai, V. (1984). "Robust Regression by Means of S-Estimators". Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics. Vol. 26. pp. 256–272. doi:10.1007/978-1-4615-7821-5_15. ISBN   978-0-387-96102-6.
  11. Rousseeuw, Peter J.; van Zomeren, Bert C. (1990). "Unmasking Multivariate Outliers and Leverage Points". Journal of the American Statistical Association. 85 (411): 633–639. doi:10.1080/01621459.1990.10474920.
  12. Rousseeuw, Peter J.; Van Driessen, Katrien (1999). "A Fast Algorithm for the Minimum Covariance Determinant Estimator". Technometrics. 41 (3): 212–223. doi:10.1080/00401706.1999.10485670.
  13. Kaufman, L.; Rousseeuw, P.J. (1987). "Clustering by means of Medoids". Statistical Data Analysis Based on the L1–Norm and Related Methods, edited by Y. Dodge, North-Holland: 405–416.{{cite journal}}: Cite journal requires |journal= (help)
  14. Kaufman, Leonard; Rousseeuw, Peter J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley. doi:10.1002/9780470316801. ISBN   978-0-471-87876-6.
  15. Rousseeuw, Peter J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis". Journal of Computational and Applied Mathematics. 20: 53–65. doi: 10.1016/0377-0427(87)90125-7 .
  16. Kaufman, Leonard; Rousseeuw, Peter J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley. doi:10.1002/9780470316801. ISBN   978-0-471-87876-6.
  17. cluster: "Finding Groups in Data": Cluster Analysis Extended Rousseeuw et al., 2021-04-17, retrieved 2021-05-27
  18. Rousseeuw, Peter J.; Croux, Christophe (1993). "Alternatives to the Median Absolute Deviation". Journal of the American Statistical Association. 88 (424): 1273. doi:10.2307/2291267. JSTOR   2291267.
  19. Rousseeuw, Peter J.; Ruts, Ida; Tukey, John W. (1999). "The bagplot: a bivariate boxplot". The American Statistician. 53 (4): 382–387. doi:10.1080/00031305.1999.10474494.
  20. Rousseeuw, Peter J.; Hubert, Mia (1999). "Regression Depth". Journal of the American Statistical Association. 94 (446): 388. doi:10.2307/2670155. JSTOR   2670155.
  21. Hubert, Mia; Rousseeuw, Peter J; Vanden Branden, Karlien (2005). "ROBPCA: A New Approach to Robust Principal Component Analysis". Technometrics. 47 (1): 64–79. doi:10.1198/004017004000000563. S2CID   5071469.
  22. Raymaekers, Jakob; Rousseeuw, Peter J.; Hubert, Mia (2022). "Class Maps for Visualizing Classification Results". Technometrics. 64 (2): 151–165. arXiv: 2007.14495 . doi: 10.1080/00401706.2021.1927849 . eISSN   1537-2723. ISSN   0040-1706.
  23. Raymaekers, Jakob; Rousseeuw, Peter J. (4 April 2022). "Silhouettes and Quasi Residual Plots for Neural Nets and Tree-based Classifiers". Journal of Computational and Graphical Statistics. 31 (4): 1332–1343. arXiv: 2106.08814 . doi: 10.1080/10618600.2022.2050249 . eISSN   1537-2715. ISSN   1061-8600.
  24. Rousseeuw, Peter J.; Van Den Bossche, Wannes (2018). "Detecting Deviating Data Cells". Technometrics. 60 (2): 135–145. arXiv: 1601.07251 . doi: 10.1080/00401706.2017.1340909 . eISSN   1537-2723. ISSN   0040-1706.
  25. Raymaekers, Jakob; Rousseeuw, Peter J. (2021). "Fast Robust Correlation for High-Dimensional Data". Technometrics. 63 (2): 184–198. arXiv: 1712.05151 . doi: 10.1080/00401706.2019.1677270 . eISSN   1537-2723. ISSN   0040-1706.
  26. Kotz, Samuel; Johnson, Norman (1992). Breakthroughs in Statistics. Vol. III. New York: Springer. doi:10.1007/978-1-4612-0667-5. ISBN   978-0-387-94988-8.
  27. "The Rousseeuw Prize for Statistics". Rousseeuw Prize. Retrieved 1 November 2022.