Larry Rafsky | |
---|---|
Born | Lawrence C. Rafsky Philadelphia, Pennsylvania, United States |
Alma mater | Princeton University (A.B.), Yale University (P.h.D.) |
Occupation(s) | Data Scientist, Entrepreneur |
Known for | Friedman-Rafsky Test, FAME |
Relatives | Robert Rafsky (brother) |
Awards | Theory and Methods Award, American Statistical Association Lifetime Achievement Award Nominee, Software and Information Industry Association |
Lawrence C. Rafsky (Larry Rafsky), is an American data scientist, inventor, and entrepreneur. Rafsky created search algorithms and methodologies for the financial and news information industries. He is co-inventor of the Friedman-Rafsky Test commonly used to test goodness-of-fit for the multivariate normal distribution. [1] Rafsky founded and became chief scientist for Acquire Media, a news and information syndication company, now a subsidiary of Moody's. [2]
Rafsky invented the Friedman-Rafsky Test, along with Jerome H. Friedman, now a fundamental procedure in multivariate data. This Multivariate normality test checks a given set of data for goodness-of-fit to the multivariate normal distribution. The null hypothesis is that the data set is a sample from the normal distribution, therefore a sufficiently small p-value indicates non-normal data. In 1981, Rafsky outlined this algorithm in a study published by the Journal of the American Statistical Association. [3] Rafsky and Friedman qualified their test in a 1983 publication in collaboration with Stanford University and Gemnet Software. [1] The two asserted that interpoint-distance-based graphs are capable of being used to define measures of association that extend Kendall's notion of a correlized co-efficient. For this research, Rafsky was granted the Theory and Methods Award from the American Statistical Association [2] in 1981. Rafsky also holds ten US patents, focused on the syndication, delivery, and aggregation of news content. [4] [5] [6] [7] He has a scholar h-index of 15, a g-index of 37, and an Erdos number of 3.
In 1982, Rafsky founded Gemnet, the creator of the FAME (Forecasting Analysis and Modeling Environment) time series database.[ citation needed ] Rafsky established Gemnet's operations initially in Ann Arbor, Michigan, but was later moved following Citicorp's acquisition of the company. The first version of the software was delivered to Harris Bank in 1983, with a focus on creating a time series-oriented database engine and the 4GL scripting language. [8] This would eventually become known as the FAME model, or Forecasting Analysis and Modeling Environment. In 1984, the company was bought by Citicorp, which led a series of new developments with the FAME program until selling off the unit to private equity firm Warburg Pincus in 1994. [8]
In the 1990s, Rafsky research and management positions at Bell Labs, Citicorp, the IDD Information Services, and ADP. Rafsky worked with future Baidu founder and chief executive officer Robin Li, who helped pioneer early search engine algorithms. [9] IDD, a division of Dow Jones focused on financial news and information's relation to securities pricings. Rafsky partnered with ADP and Townsend-Greenspan, the consulting firm of US economist Alan Greenspan.[ citation needed ] In 1998, Rafsky founded Gari Software, which he later sold to Wavephore Labs. [10] He subsequently formed Acquire Media, a digital content syndication company in 2001. He retired from full-time work at Acquire Media in 2018, and currently serves the firm part-time doing research in Machine Learning, Natural Language Processing, and statistical modeling of the business news ecosystem. [11] He also serves as Director of Research for my4, an organization dedicated to investment analytics focused on ESG - Environmental, social and corporate governance.
Rafsky was born in Philadelphia. He is the brother of Robert Rafsky, author and AIDS rights activist known for his televised confrontation with Bill Clinton during the 1992 Presidential Election. [12] [13] Having spent most of his life living in New Jersey, he now resides in Jupiter, Florida.
Rafsky received the Theory and Methods award from the American Statistical Association following his work in the development of the Friedman-Rafsky Test. He was also a finalist for the Lifetime Achievement Award from the Software and Information Industry Association (SIIA).
In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.
Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.
John Wilder Tukey was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term bit and the first published use of the word software.
In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
Projection pursuit (PP) is a type of statistical technique that involves finding the most "interesting" possible projections in multidimensional data. Often, projections that deviate more from a normal distribution are considered to be more interesting. As each projection is found, the data are reduced by removing the component along that projection, and the process is repeated to find new projections; this is the "pursuit" aspect that motivated the technique known as matching pursuit.
Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.
Robin Li Yanhong is a Chinese software engineer and billionaire internet entrepreneur who is the co-founder and chief executive officer of Chinese multinational technology company Baidu. As of June 2023, his net worth was estimated at US$8.6 billion by Forbes.
In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).
In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.
Theodore Wilbur Anderson was an American mathematician and statistician who specialized in the analysis of multivariate data. He was born in Minneapolis, Minnesota. He was on the faculty of Columbia University from 1946 until moving to Stanford University in 1967, becoming Emeritus Professor in 1988. He served as Editor of Annals of Mathematical Statistics from 1950 to 1952. He was elected President of the Institute of Mathematical Statistics in 1962.
In mathematics and computer science, computer algebra, also called symbolic computation or algebraic computation, is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expressions and other mathematical objects. Although computer algebra could be considered a subfield of scientific computing, they are generally considered as distinct fields because scientific computing is usually based on numerical computation with approximate floating point numbers, while symbolic computation emphasizes exact computation with expressions containing variables that have no given value and are manipulated as symbols.
Jerome Harold Friedman is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.
Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.
In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Skedasticity comes from the Ancient Greek word skedánnymi, meaning “to scatter”. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.