Dplyr

Last updated
dplyr
Original author(s) Hadley Wickham, Romain François, Lionel Henry, Kirill Müller, Davis Vaughan
Initial releaseJanuary 7, 2014;10 years ago (2014-01-07)
Stable release
1.1.0 / January 29, 2023;14 months ago (2023-01-29)
Written in R
License MIT License
Website dplyr.tidyverse.org//

One of the core packages of the tidyverse in the R programming language, dplyr is primarily a set of functions designed to enable dataframe manipulation in an intuitive, user-friendly way. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization. [1] [2]

Contents

For instance, someone seeking to analyze an enormous dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset.

dplyr was launched in 2014. [3] On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges." [4]

The five core verbs

While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs: [5]

filter(), which is used to extract rows from a dataframe, based on conditions specified by a user;

select(), which is used to subset a dataframe by its columns;

arrange(), which is used to sort rows in a dataframe based on attributes held by particular columns;

mutate(), which is used to create new variables, by altering and/or combining values from existing columns; and

summarize(), also spelled summarise(), which is used to collapse values from a dataframe into a single summary.

Additional functions

In addition to its five main verbs, dplyr also includes several other functions that enable exploration and manipulation of dataframes. Included among these are:

count(), which is used to sum the number of unique observations that contain some particular value or categorical attribute;

rename(), which enables a user to alter the column names for variables, often to improve ease of use and intuitive understanding of a dataset;

slice_max(), which returns a data subset that contains the rows with the highest number of values for some particular variable;

slice_min(), which returns a data subset that contains the rows with the lowest number of values for some particular variable.

Built-in datasets

The dplyr package comes with five datasets. These are: band_instruments, band_instruments2, band_members, starwars, storms.        

The copyright to dplyr is held by Posit PBC, formerly RStudio PBC. Dplyr was originally released under a GPL license[ citation needed ], but in 2022 Posit changed the license terms for the package to the "more permissive" MIT License. [6] The chief difference between the two types of license is that the MIT license allows subsequent re-use of code within proprietary software, whereas a GPL license does not.

Related Research Articles

<span class="mw-page-title-main">Spreadsheet</span> Computer application for organization, analysis, and storage of data in tabular form

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. The term spreadsheet may also refer to one such electronic document.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versions of the software released since 2015 have the brand name IBM SPSS Statistics.

<span class="mw-page-title-main">R (programming language)</span> Programming language for statistics

R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.

<span class="mw-page-title-main">Stata</span> Statistical software package

Stata is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology, and sociology.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A pivot table is a table of values which are aggregations of groups of individual values from a more extensive table within one or more discrete categories. The aggregations or summaries of the groups of the individual terms might include sums, averages, counts, or other statistics. A pivot table is the outcome of the statistical processing of tabularized raw data and can be used for decision-making.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

<span class="mw-page-title-main">Goldfeld–Quandt test</span> Test proposed by Stephen Goldfeld and Richard Quandt

In statistics, the Goldfeld–Quandt test checks for heteroscedasticity in regression analyses. It does this by dividing a dataset into two parts or groups, and hence the test is sometimes called a two-group test. The Goldfeld–Quandt test is one of two tests proposed in a 1965 paper by Stephen Goldfeld and Richard Quandt. Both a parametric and nonparametric test are described in the paper, but the term "Goldfeld–Quandt test" is usually associated only with the former.

Wide and narrow are terms used to describe two different presentations for tabular data.

PL/SQL is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database, Times Ten in-memory database, and IBM Db2. Oracle Corporation usually extends PL/SQL functionality with each successive release of the Oracle Database.

<span class="mw-page-title-main">RStudio</span> Integrated development environment for R

RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC.

<span class="mw-page-title-main">Knitr</span>

knitr is an engine for dynamic report generation with R. It is a package in the programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents. The purpose of knitr is to allow reproducible research in R through the means of literate programming. It is licensed under the GNU General Public License.

pandas (software) Python library for data analysis

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

<span class="mw-page-title-main">Hadley Wickham</span> New Zealand statistician

Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit, PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.

rnn (software) Machine Learning framework written in the R language

rnn is an open-source machine learning framework that implements recurrent neural network architectures, such as LSTM and GRU, natively in the R programming language, that has been downloaded over 100,000 times.

<span class="mw-page-title-main">R package</span> Extensions to the R statistical programming language

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.

References

  1. Yadav, Rohit (2019-10-29). "Python's Pandas vs R's Tidyverse: Who Comes Out On Top?". Analytics India Magazine. Retrieved 2021-02-06.
  2. Krill, Paul (2015-06-30). "Why R? The pros and cons of the R language". InfoWorld. Retrieved 2021-02-06.
  3. "Introducing dplyr". blog.rstudio.com. 17 January 2014. Retrieved 2020-09-02.
  4. "Function reference". dplyr.tidyverse.org. Retrieved 2021-02-06.
  5. Grolemund, Garrett; Wickham, Hadley. 5 Data transformation | R for Data Science.
  6. "A Grammar of Data Manipulation". tidyverse.org. Retrieved 2023-01-14.