RevoScaleR

Last updated
RevoScaleR
Original author(s) Microsoft
Initial release2016;5 years ago (2016)
Written in Python
Platform Windows, Linux
Available inR
Website docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler

RevoScaleR is a machine learning package in R created by Microsoft. It is available as part of Machine Learning Server, Microsoft R Client, and Machine Learning Services in Microsoft SQL Server 2016.

Contents

The package contains functions for creating linear model, logistic regression, random forest, decision tree and boosted decision tree, and K-means, in addition to some summary functions for inspecting and visualizing data. [1]

It has a Python package counterpart called revoscalepy. Another closely related package is MicrosoftML, which contains machine learning algorithms that RevoScaleR does not have, such as neural network and SVM.

In June 2021, Microsoft announced to open source the RevoScaleR and revoscalepy packages, making them freely available under the MIT License. [2]

Concepts

Many R packages are designed to analyze data that can fit in the memory of the machine and usually do not make use of parallel processing. RevoScaleR was designed to address these limitations. The functions in RevoScaleR orientate around three main abstraction concepts that users can specify to process large amount of data that might not fit in memory and exploit parallel resources to speed up the analysis.

Compute Contexts

A compute context refers to the location where the computation on the data happens. It could be "local" (on the client machine) or "remote" (on a data platform such as a SQL server, or Spark). Pushing the computation to a remote server allows people to take advantage of the greater compute resources that a remote machine may have. If the data being analyzed reside on the same machine, using a remote compute context also removes the need to pull data across the network onto the client machine. [3]

Data source

Data source defines where the data comes from. There are various data sources available in RevoScaleR, such as text data, Xdf data, in-SQL data, and a spark dataframe. People can wrap their data in a data source object and use that as run analytics in different compute context. Different data sources are available in different compute context. For example, if the compute context is set to SQL server, then the only data source one can use would be an in-SQL data source.

Analytics

Analytic functions in RevoScaleR takes in data source object, a compute context, and the other parameters needed to build the specific model, such as formula for the logistic regression or the number of trees in a decision tree. In addition to those parameters, one can also specify the level of parallelism, such as the size of the data chunk for each process or number of processes to build the model. However, parallelism is only available in non-express edition.

Limitations

The package is mostly meant to be used with a SQL server or other remote machines. To fully leverage the abstractions it uses to process a large dataset, one needs a remote server and non-Express free edition of the package. It cannot be easily installed such as by running "install.packages("RevoScaleR")" like most open source R packages. It's available only through Microsoft R Client, a distribution of R for data science, or Microsoft Machine Learning Server (stand-alone with no SQL server attached), or Microsoft Machine Learning Services (a SQL server services). However, one can still use the analytics functions in an Express, free version of the package.

See also

Related Research Articles

IBM Db2 Family Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

R (programming language) Language and environment for statistical computing and graphics

R is a programming language and free software environment for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in R's popularity; as of July 2021, R ranks 12th in the TIOBE index, a measure of popularity of programming languages.

Manifold System

Manifold System is a geographic information system (GIS) software package developed by Manifold Software Limited that runs on Microsoft Windows. Manifold System handles both vector and raster data, includes spatial SQL, a built-in Internet Map Server (IMS), and other general GIS features.

SAP SQL Anywhere is a proprietary relational database management system (RDBMS) product from SAP. SQL Anywhere was known as Sybase SQL Anywhere prior to the acquisition of Sybase by SAP.

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

An embedded database system is a database management system (DBMS) which is tightly integrated with an application software; it is "embedded in the application". It is actually a broad technology category that includes

The Base One Foundation Component Library (BFC) is a rapid application development toolkit for building secure, fault-tolerant, database applications on Windows and ASP.NET. In conjunction with Microsoft's Visual Studio integrated development environment, BFC provides a general-purpose web application framework for working with databases from Microsoft, Oracle, IBM, Sybase, and MySQL, running under Windows, Linux/Unix, or IBM iSeries or z/OS. BFC also includes facilities for distributed computing, batch processing, queuing, and database command scripting, and these run under Windows or Linux with Wine.

Microsoft Azure, commonly referred to as Azure, is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers. It provides software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS) and supports many different programming languages, tools, and frameworks, including both Microsoft-specific and third-party software and systems.

Vertica

Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer. Palmer was the founding CEO. Ralph Breslauer and Christopher P. Lynch served as later CEOs.

XLeratorDB

XLeratorDB is a suite of database function libraries that enable Microsoft SQL Server to perform a wide range of additional (non-native) business intelligence and ad hoc analytics. The libraries, which are embedded and run centrally on the database, include more than 450 individual functions similar to those found in Microsoft Excel spreadsheets. The individual functions are grouped and sold as six separate libraries based on usage: finance, statistics, math, engineering, unit conversions and strings. WestClinTech, the company that developed XLeratorDB, claims it is "the first commercial function package add-in for Microsoft SQL Server."

Revolution Analytics is a statistical software company focused on developing open source and "open-core" versions of the free and open source software R for enterprise, academic and analytics customers. Revolution Analytics was founded in 2007 as REvolution Computing providing support and services for R in a model similar to Red Hat's approach with Linux in the 1990s as well as bolt-on additions for parallel processing. In 2009 the company received nine million in venture capital from Intel along with a private equity firm and named Norman H. Nie as their new CEO. In 2010 the company announced the name change as well as a change in focus. Their core product, Revolution R, would be offered free to academic users and their commercial software would focus on big data, large scale multiprocessor computing, and multi-core functionality.

Programming with Big Data in R (pbdR) is a series of R packages and an environment for statistical computing with big data by using high-performance statistical computation. The pbdR uses the same programming language as R with S3/S4 classes and methods which is used among statisticians and data miners for developing statistical software. The significant difference between pbdR and R code is that pbdR mainly focuses on distributed memory systems, where data are distributed across several processors and analyzed in a batch mode, while communications between processors are based on MPI that is easily used in large high-performance computing (HPC) systems. R system mainly focuses on single multi-core machines for data analysis via an interactive mode such as GUI interface.

Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Distributed R is an open source, high-performance platform for the R language. It splits tasks between multiple processing nodes to reduce execution time and analyze large data sets. Distributed R enhances R by adding distributed data structures, parallelism primitives to run functions on distributed data, a task scheduler, and multiple data loaders. It is mostly used to implement distributed versions of machine learning tasks. Distributed R is written in C++ and R, and retains the familiar look and feel of R. As of February 2015, Hewlett-Packard (HP) provides enterprise support for Distributed R with proprietary additions such as a fast data loader from the Vertica database.

Microsoft Azure Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.

revoscalepy is a machine learning package in Python created by Microsoft. It is available as part of Machine Learning Services in Microsoft SQL Server 2017 and Machine Learning Server 9.2.0 and later.

References

  1. "RevoScaleR package". Microsoft Corporation. Retrieved 2018-04-12.
  2. Looking to the future for R in Azure SQL and SQL Server - Microsoft SQL Server Blog
  3. "Compute context for script execution in Machine Learning Server". Microsoft Corporation. Retrieved 2018-04-12.