HOME

TheInfoList



OR:

RevoScaleR is a
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
package in R created by
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
. It is available as part of Machine Learning Server, Microsoft R Client, and Machine Learning Services in
Microsoft SQL Server Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which ...
2016. The package contains functions for creating
linear model In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...
,
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...
, random forest, decision tree and boosted decision tree, and
K-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
, in addition to some summary functions for inspecting and visualizing data. It has a Python package counterpart called revoscalepy. Another closely related package is MicrosoftML, which contains machine learning algorithms that RevoScaleR does not have, such as neural network and SVM. In June 2021, Microsoft announced to open source the RevoScaleR and revoscalepy packages, making them freely available under the
MIT License The MIT License is a permissive free software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts only very limited restriction on reuse and has, therefore, high license comp ...
.Looking to the future for R in Azure SQL and SQL Server - Microsoft SQL Server Blog
/ref>


Concepts

Many R packages are designed to analyze data that can fit in the memory of the machine and usually do not make use of parallel processing. RevoScaleR was designed to address these limitations. The functions in RevoScaleR orientate around three main abstraction concepts that users can specify to process large amount of data that might not fit in memory and exploit parallel resources to speed up the analysis.


Compute Contexts

A compute context refers to the location where the computation on the data happens. It could be "local" (on the client machine) or "remote" (on a data platform such as a SQL server, or
Spark Spark commonly refers to: * Spark (fire), a small glowing particle or ember * Electric spark, a form of electrical discharge Spark may also refer to: Places * Spark Point, a rocky point in the South Shetland Islands People * Spark (surname) * ...
). Pushing the computation to a remote server allows people to take advantage of the greater compute resources that a remote machine may have. If the data being analyzed reside on the same machine, using a remote compute context also removes the need to pull data across the network onto the client machine.


Data source

Data source defines where the data comes from. There are various data sources available in RevoScaleR, such as text data, Xdf data, in-SQL data, and a spark dataframe. People can wrap their data in a data source object and use that as run analytics in different compute context. Different data sources are available in different compute context. For example, if the compute context is set to SQL server, then the only data source one can use would be an in-SQL data source.


Analytics

Analytic functions in RevoScaleR takes in data source object, a compute context, and the other parameters needed to build the specific model, such as formula for the logistic regression or the number of trees in a decision tree. In addition to those parameters, one can also specify the level of parallelism, such as the size of the data chunk for each process or number of processes to build the model. However, parallelism is only available in non-express edition.


Limitations

The package is mostly meant to be used with a SQL server or other remote machines. To fully leverage the abstractions it uses to process a large dataset, one needs a remote server and non-Express free edition of the package. It cannot be easily installed such as by running "install.packages("RevoScaleR")" like most open source R packages. It's available only through Microsoft R Client, a distribution of R for data science, or Microsoft Machine Learning Server (stand-alone with no SQL server attached), or Microsoft Machine Learning Services (a SQL server services). However, one can still use the analytics functions in an Express, free version of the package.


See also

* Microsoft Machine Learning Services


References


External links


Samples for using revoscalepy and microsoftml
{{Microsoft FOSS Applied machine learning Microsoft software