Dplyr

Last updated
dplyr
Original author(s) Hadley Wickham, Romain François, Lionel Henry, Kirill Müller, Davis Vaughan
Initial releaseJanuary 7, 2014;10 years ago (2014-01-07)
Stable release
1.1.0 / January 29, 2023;21 months ago (2023-01-29)
Written in R
License MIT License
Website dplyr.tidyverse.org//

dplyr is an R package whose set of functions are designed to enable dataframe (a spreadsheet-like data structure) manipulation in an intuitive, user-friendly way. It is one of the core packages of the popular tidyverse set of packages in the R programming language. [1] Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization. [2] [3]

Contents

For instance, someone seeking to analyze a large dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset. Functions within the dplyr package will allow a user to perform such tasks.

dplyr was launched in 2014. [4] On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges." [5]

The five core verbs

While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs or actions: [6]

Additional functions

In addition to its five main verbs, dplyr also includes several other functions that enable exploration and manipulation of dataframes. Included among these are:

Built-in datasets

The dplyr package comes with five datasets. These are: band_instruments, band_instruments2, band_members, starwars, storms.

The copyright to dplyr is held by Posit PBC, formerly RStudio PBC. dplyr was originally released under a GPL license[ citation needed ], but in 2022, Posit changed the license terms for the package to the "more permissive" MIT License. [7] The main difference between the two types of license is that the MIT license allows subsequent re-use of code within proprietary software, whereas a GPL license does not.

Related Research Articles

<span class="mw-page-title-main">Spreadsheet</span> Computer application for organization, analysis, and storage of data in tabular form

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. The term spreadsheet may also refer to one such electronic document.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versions of the software released since 2015 have the brand name IBM SPSS Statistics.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

<span class="mw-page-title-main">R (programming language)</span> Programming language for statistics

R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.

<span class="mw-page-title-main">Stata</span> Statistical software package

Stata is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology, and sociology.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A pivot table is a table of values which are aggregations of groups of individual values from a more extensive table within one or more discrete categories. The aggregations or summaries of the groups of the individual terms might include sums, averages, counts, or other statistics. A pivot table is the outcome of the statistical processing of tabularized raw data and can be used for decision-making.

Wide and narrow are terms used to describe two different presentations for tabular data.

PL/SQL is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database, TimesTen in-memory database, and IBM Db2. Oracle Corporation usually extends PL/SQL functionality with each successive release of the Oracle Database.

<span class="mw-page-title-main">RStudio</span> Integrated development environment for R

RStudio IDE is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC.

<span class="mw-page-title-main">Knitr</span> Report generation engine with R

knitr is a software engine for dynamic report generation with R. It is a package in the programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents. The purpose of knitr is to allow reproducible research in R through the means of literate programming. It is licensed under the GNU General Public License.

pandas (software) Python library for data analysis

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

<span class="mw-page-title-main">Hadley Wickham</span> New Zealand statistician

Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.

rnn (software) Machine Learning framework written in the R language

rnn is an open-source machine learning framework that implements recurrent neural network architectures, such as LSTM and GRU, natively in the R programming language, that has been downloaded over 100,000 times.

Jennifer "Jenny" Bryan is a data scientist and an associate professor of statistics at the University of British Columbia where she developed the Master in Data Science Program. She is a statistician and software engineer at RStudio from Vancouver, Canada and is known for creating open source tools which connect R to Google Sheets and Google Drive.

<span class="mw-page-title-main">Tidyverse</span> Collection of R packages

The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that "share an underlying design philosophy, grammar, and data structures" of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping.

<span class="mw-page-title-main">R package</span> Extensions to the R statistical programming language

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.

<span class="mw-page-title-main">Shiny (web framework)</span> Software for developing web applications

Shiny is a web framework for developing web applications (apps), originally in R and since 2022 in python. It is free and open source. It was announced by Joe Cheng, CTO of Posit, formerly RStudio, in 2012. One of the uses of Shiny has been in fast prototyping.

Posit PBC is an open-source data science software company. It is a public-benefit corporation founded by J. J. Allaire, creator of the programming language ColdFusion.

References

  1. Wickham, Hadley; Averick, Mara; Bryan, Jennifer; Chang, Winston; McGowan, Lucy D'Agostino; François, Romain; Grolemund, Garrett; Hayes, Alex; Henry, Lionel; Hester, Jim; Kuhn, Max; Pedersen, Thomas Lin; Miller, Evan; Bache, Stephan Milton; Müller, Kirill (2019-11-21). "Welcome to the Tidyverse". Journal of Open Source Software. 4 (43): 1686. doi: 10.21105/joss.01686 . ISSN   2475-9066.
  2. Yadav, Rohit (2019-10-29). "Python's Pandas vs R's Tidyverse: Who Comes Out On Top?". Analytics India Magazine. Retrieved 2021-02-06.
  3. Krill, Paul (2015-06-30). "Why R? The pros and cons of the R language". InfoWorld. Retrieved 2021-02-06.
  4. "Introducing dplyr". blog.rstudio.com. 17 January 2014. Retrieved 2020-09-02.
  5. "Function reference". dplyr.tidyverse.org. Retrieved 2021-02-06.
  6. Grolemund, Garrett; Wickham, Hadley. 5 Data transformation | R for Data Science.
  7. "A Grammar of Data Manipulation". tidyverse.org. Retrieved 2023-01-14.