Lawrence C. Rafsky

Last updated
Larry Rafsky
Born
Lawrence C. Rafsky

Alma mater Princeton University (A.B.), Yale University (P.h.D.)
Occupation(s)Data Scientist, Entrepreneur
Known forFriedman-Rafsky Test, FAME
Relatives Robert Rafsky (brother)
AwardsTheory and Methods Award, American Statistical Association Lifetime Achievement Award Nominee, Software and Information Industry Association

Lawrence C. Rafsky (Larry Rafsky), is an American data scientist, inventor, and entrepreneur. Rafsky created search algorithms and methodologies for the financial and news information industries. He is co-inventor of the Friedman-Rafsky Test commonly used to test goodness-of-fit for the multivariate normal distribution. [1] Rafsky founded and became chief scientist for Acquire Media, a news and information syndication company, now a subsidiary of Moody's. [2]

Contents

Research

Rafsky invented the Friedman-Rafsky Test, along with Jerome H. Friedman, now a fundamental procedure in multivariate data. This Multivariate normality test checks a given set of data for goodness-of-fit to the multivariate normal distribution. The null hypothesis is that the data set is a sample from the normal distribution, therefore a sufficiently small p-value indicates non-normal data. In 1981, Rafsky outlined this algorithm in a study published by the Journal of the American Statistical Association. [3] Rafsky and Friedman qualified their test in a 1983 publication in collaboration with Stanford University and Gemnet Software. [1] The two asserted that interpoint-distance-based graphs are capable of being used to define measures of association that extend Kendall's notion of a correlized co-efficient. For this research, Rafsky was granted the Theory and Methods Award from the American Statistical Association [2] in 1981. Rafsky also holds ten US patents, focused on the syndication, delivery, and aggregation of news content. [4] [5] [6] [7] He has a scholar h-index of 15, a g-index of 37, and an Erdos number of 3.

Business

Gemnet Software and FAME

In 1982, Rafsky founded Gemnet, the creator of the FAME (Forecasting Analysis and Modeling Environment) time series database.[ citation needed ] Rafsky established Gemnet's operations initially in Ann Arbor, Michigan, but was later moved following Citicorp's acquisition of the company. The first version of the software was delivered to Harris Bank in 1983, with a focus on creating a time series-oriented database engine and the 4GL scripting language. [8] This would eventually become known as the FAME model, or Forecasting Analysis and Modeling Environment. In 1984, the company was bought by Citicorp, which led a series of new developments with the FAME program until selling off the unit to private equity firm Warburg Pincus in 1994. [8]

Later ventures

In the 1990s, Rafsky research and management positions at Bell Labs, Citicorp, the IDD Information Services, and ADP. Rafsky worked with future Baidu founder and chief executive officer Robin Li, who helped pioneer early search engine algorithms. [9] IDD, a division of Dow Jones focused on financial news and information's relation to securities pricings. Rafsky partnered with ADP and Townsend-Greenspan, the consulting firm of US economist Alan Greenspan.[ citation needed ] In 1998, Rafsky founded Gari Software, which he later sold to Wavephore Labs. [10] He subsequently formed Acquire Media, a digital content syndication company in 2001. He retired from full-time work at Acquire Media in 2018, and currently serves the firm part-time doing research in Machine Learning, Natural Language Processing, and statistical modeling of the business news ecosystem. [11] He also serves as Director of Research for my4, an organization dedicated to investment analytics focused on ESG - Environmental, social and corporate governance.

Personal life

Rafsky was born in Philadelphia. He is the brother of Robert Rafsky, author and AIDS rights activist known for his televised confrontation with Bill Clinton during the 1992 Presidential Election. [12] [13] Having spent most of his life living in New Jersey, he now resides in Jupiter, Florida.

Awards

Rafsky received the Theory and Methods award from the American Statistical Association following his work in the development of the Friedman-Rafsky Test. He was also a finalist for the Lifetime Achievement Award from the Software and Information Industry Association (SIIA).

Related Research Articles

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

<span class="mw-page-title-main">John Tukey</span> American mathematician

John Wilder Tukey was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term bit and the first published use of the word software.

<span class="mw-page-title-main">Density estimation</span> Estimate of an unobservable underlying probability density function

In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Projection pursuit (PP) is a type of statistical technique that involves finding the most "interesting" possible projections in multidimensional data. Often, projections that deviate more from a normal distribution are considered to be more interesting. As each projection is found, the data are reduced by removing the component along that projection, and the process is repeated to find new projections; this is the "pursuit" aspect that motivated the technique known as matching pursuit.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

<span class="mw-page-title-main">Robin Li</span> Chinese software engineer and billionaire internet entrepreneur (born 1968)

Robin Li Yanhong is a Chinese software engineer and billionaire internet entrepreneur who is the co-founder and chief executive officer of Chinese multinational technology company Baidu. As of June 2023, his net worth was estimated at US$8.6 billion by Forbes.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

<span class="mw-page-title-main">Theodore Wilbur Anderson</span> American mathematician and statistician

Theodore Wilbur Anderson was an American mathematician and statistician who specialized in the analysis of multivariate data. He was born in Minneapolis, Minnesota. He was on the faculty of Columbia University from 1946 until moving to Stanford University in 1967, becoming Emeritus Professor in 1988. He served as Editor of Annals of Mathematical Statistics from 1950 to 1952. He was elected President of the Institute of Mathematical Statistics in 1962.

<span class="mw-page-title-main">Computer algebra</span> Scientific area at the interface between computer science and mathematics

In mathematics and computer science, computer algebra, also called symbolic computation or algebraic computation, is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expressions and other mathematical objects. Although computer algebra could be considered a subfield of scientific computing, they are generally considered as distinct fields because scientific computing is usually based on numerical computation with approximate floating point numbers, while symbolic computation emphasizes exact computation with expressions containing variables that have no given value and are manipulated as symbols.

Jerome Harold Friedman is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Skedasticity comes from the Ancient Greek word skedánnymi, meaning “to scatter”. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

  1. 1 2 Friedman, Jerome H.; Rafsky, Lawrence C. (1983-06-01). "Graph-Theoretic Measures of Multivariate Association and Prediction". The Annals of Statistics. 11 (2): 377–391. doi: 10.1214/aos/1176346148 . ISSN   0090-5364.
  2. 1 2 "Lifetime Award". www.siia.net. Retrieved 2017-04-10.
  3. Friedman, Jerome H.; Rafsky, Lawrence C. (1981-06-01). "Graphics for the Multivariate Two-Sample Problem". Journal of the American Statistical Association. 76 (374): 277–287. doi:10.1080/01621459.1981.10477643. ISSN   0162-1459. OSTI   1447031.
  4. US 20120259846,Rafsky, Lawrence C.&Donchez, Thomas B.,"Method for selecting a subset of content sources from a collection of content sources",published Oct 11, 2012
  5. YS 20120243534,Rafsky, Lawrence C.; Ungar, Robert E.& Donchez, Thomas B.,"Method and system for pacing, acking, timing, and handicapping (path) for simultaneous receipt of documents",published Sep 27, 2012
  6. US 20160239494,Rafsky, Lawrence C.; Marshall, Jonathan Alan& Sun, Raymond,"Determining and maintaining a list of news stories from news feeds most relevant to a topic",published Aug 18, 2016
  7. US 9313149,Rafsky, Lawrence C.; Ungar, Robert E.& Donchez, Thomas B.,"Method and system tracking and determining public sentiment relating to social media",published Apr 12, 2016
  8. 1 2 Hallman, Jeff (2015-07-12), fame: Interface for FAME Time Series Database , retrieved 2017-04-10
  9. Barboza, David (2010-01-13). "Baidu's Gain May Be China's Loss if Google Departs". The New York Times. ISSN   0362-4331 . Retrieved 2017-04-10.
  10. Hane, Paula J. (1999-06-14). "WavePhore Changes its Name and Introduces NewsPak" . Retrieved 2017-04-10.
  11. "Lawrence C. Rafsky: Executive Profile & Biography - Bloomberg". www.bloomberg.com. Retrieved 2017-04-10.
  12. Mathews, Jay; Mathews, Jay (1993-02-23). "ROBERT RAFSKY, WRITER AND ACTIVIST IN AIDS FIGHT, DIES". The Washington Post. ISSN   0190-8286 . Retrieved 2017-04-10.
  13. Howe, Marvine (1993-02-23). "Robert Rafsky, 47, Media Coordinator For AIDS Protesters". The New York Times. ISSN   0362-4331 . Retrieved 2017-04-10.