Tidyverse

Last updated
Repository
Written in R
Type Package collection
Website www.tidyverse.org OOjs UI icon edit-ltr-progressive.svg

The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham [1] and his team that "share an underlying design philosophy, grammar, and data structures" of tidy data. [2] Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping. [3] [4] [5]

As of November 2018, the tidyverse package and some of its individual packages comprise 5 out of the top 10 most downloaded R packages. [6] The tidyverse is the subject of multiple books and papers. [7] [8] [9] [10] In 2019, the ecosystem has been published in the Journal of Open Source Software . [11]

Its syntax has been referred to as "supremely readable". [12] Critics of the tidyverse have argued it promotes tools that are harder to teach and learn than their base-R equivalents and are too dissimilar to other programming languages. [13] [14] On the other hand, some [15] have argued that tidyverse is a very effective way to introduce complete beginners to programming, as pedagogically it allows students to quickly begin doing powerful data processing tasks. [16] [15] Further to this, some practitioners have pointed out that data processing tasks are intuitively much easier to chain together with tidyverse compared to Python Pandas. [17]

Packages

The core packages, which provide functionality to model, transform, and visualize data, include: [18]

Additional packages assist the core collection. [19] Other packages based on the tidy data principles are regularly developed, such as tidytext [20] for text analysis, tidymodels [21] for machine learning, or tidyquant [22] for financial operations.

Related Research Articles

<span class="mw-page-title-main">R (programming language)</span> Programming language for statistics

R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.

<span class="mw-page-title-main">Exploratory data analysis</span> Approach of analyzing data sets in statistics

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

The following tables compare general and technical information for a number of statistical analysis packages.

A software repository, or repo for short, is a storage location for software packages. Often a table of contents is also stored, along with metadata. A software repository is typically managed by source or version control, or repository managers. Package managers allow automatically installing and updating repositories, sometimes called "packages".

GGobi is a free statistical software tool for interactive data visualization. GGobi allows extensive exploration of the data with Interactive dynamic graphics. It is also a tool for looking at multivariate data. R can be used in sync with GGobi. The GGobi software can be embedded as a library in other programs and program packages using an application programming interface (API) or as an add-on to existing languages and scripting environments, e.g., with the R command line or from a Perl or Python scripts. GGobi prides itself on its ability to link multiple graphs together.

ggplot2 Data visualization package for R

ggplot2 is an open-source data visualization package for the statistical programming language R. Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales. Since 2005, ggplot2 has grown in use to become one of the most popular R packages.

<span class="mw-page-title-main">Snake case</span> Words joined with underscores

Snake case is the naming convention in which each space is replaced with an underscore (_) character, and words are written in lowercase. It is a commonly used naming convention in computing, for example for variable and subroutine names, and for filenames. One study has found that readers can recognize snake case values more quickly than camel case. However, "subjects were trained mainly in the underscore style", so the possibility of bias cannot be eliminated.

<span class="mw-page-title-main">RStudio</span> Integrated development environment for R

RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC.

pandas (software) Python library for data analysis

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

<span class="mw-page-title-main">Hadley Wickham</span> New Zealand statistician

Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit, PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.

Yihui Xie is a Chinese statistician, data scientist and software engineer who formerly worked for RStudio. He is the principal author of the open-source software package Knitr for data analysis in the R programming language, and has also written the book Dynamic Documents with R and knitr.

Heike Hofmann is a statistician and Professor in the Department of Statistics at Iowa State University.

Mine Çetinkaya-Rundel is a Turkish-American statistician and professor of the practice at Duke University, and a professional educator at RStudio. She is the author of several open source statistics textbooks and is an instructor for Coursera. She is the chair-elect of the Statistical Education Section of the American Statistical Association. Previously, she was a senior lecturer at University of Edinburgh.

<span class="mw-page-title-main">Julia Silge</span> American data scientist and software engineer

Julia Silge is an American data scientist and software engineer. She has developed tools for statistical modelling in the R programming language, including the text mining package tidytext. Silge currently works for Posit, formerly known as RStudio.

rnn (software) Machine Learning framework written in the R language

rnn is an open-source machine learning framework that implements recurrent neural network architectures, such as LSTM and GRU, natively in the R programming language, that has been downloaded over 100,000 times.

Jennifer "Jenny" Bryan is a data scientist and an associate professor of statistics at the University of British Columbia where she developed the Master in Data Science Program. She is a statistician and software engineer at RStudio from Vancouver, Canada and is known for creating open source tools which connect R to Google Sheets and Google Drive.

One of the core packages of the tidyverse in the R programming language, dplyr is primarily a set of functions designed to enable dataframe manipulation in an intuitive, user-friendly way. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization.

<span class="mw-page-title-main">R package</span> Extensions to the R statistical programming language

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.

<span class="mw-page-title-main">Jamovi</span> Graphical user interface for R programming language

jamovi is a free and open-source computer program for data analysis and performing statistical tests. The core developers of jamovi are Jonathon Love, Damian Dropmann, and Ravi Selker, who are developers for the JASP project. jamovi is a fork of JASP.

<span class="mw-page-title-main">Easystats</span> Software package for the R language

The easystats collection of open source R packages was created in 2019 and primarily includes tools dedicated to the post-processing of statistical models. As of May 2022, the 10 packages composing the easystats ecosystem have been downloaded more than 8 million times, and have been used in more than 1000 scientific publications. The ecosystem is the topic of several statistical courses, video tutorials and books.

References

  1. "Welcome to the Tidyverse". Revolutions. Retrieved 2018-11-26.
  2. "Tidyverse". www.tidyverse.org. Retrieved 2018-11-26.
  3. Wickham, Stefan Milton Bache and Hadley (2014-11-22), magrittr: A Forward-Pipe Operator for R , retrieved 2020-04-20
  4. Wickham, Hadley. 4 Pipes | The tidyverse style guide.
  5. Wickham, Hadley (2019). Advanced R (Second ed.). Boca Raton. ISBN   978-0815384571.{{cite book}}: CS1 maint: location missing publisher (link)
  6. "RDocumentation". www.rdocumentation.org. Retrieved 2018-11-26.
  7. Duggan, Jim (2018-09-07). "Input and output data analysis for system dynamics modelling using the tidyverse libraries of R". System Dynamics Review. 34 (3): 438–461. doi:10.1002/sdr.1600. hdl: 10379/15029 . ISSN   0883-7066. S2CID   70005357.
  8. Chang, Winston (2013). R Graphics Cookbook. "O'Reilly Media, Inc.". ISBN   9781449316952.
  9. C., Boehmke, Bradley (2016-11-17). Data wrangling with R. Cham. ISBN   9783319455990. OCLC   964404346.{{cite book}}: CS1 maint: location missing publisher (link) CS1 maint: multiple names: authors list (link)
  10. Hadley, Wickham (2017). R for data science : import, tidy, transform, visualize, and model data. Grolemund, Garrett (First ed.). Sebastopol, CA. ISBN   9781491910399. OCLC   968213225.{{cite book}}: CS1 maint: location missing publisher (link)
  11. Wickham, Hadley; Averick, Mara; Bryan, Jennifer; Chang, Winston; McGowan, Lucy D'Agostino; François, Romain; Grolemund, Garrett; Hayes, Alex; Henry, Lionel; Hester, Jim; Kuhn, Max; Pedersen, Thomas Lin; Miller, Evan; Bache, Stephan Milton; Müller, Kirill; Ooms, Jeroen; Robinson, David; Seidel, Dana Paige; Spinu, Vitalie; Takahashi, Kohske; Vaughan, Davis; Wilke, Claus; Woo, Kara; Yutani, Hiroaki (21 November 2019). "Welcome to the Tidyverse". Journal of Open Source Software. 4 (43): 1686. Bibcode:2019JOSS....4.1686W. doi: 10.21105/joss.01686 . S2CID   214002773.
  12. Steinmetz, Art (2024-04-10). "Outsider Data Science - The Truth About Tidy Wrappers". outsiderdata.netlify.app. Retrieved 2024-04-11.
  13. Matloff, Norm (30 September 2019). "An opinionated view of the Tidyverse "dialect" of the R language". GitHub. Retrieved 28 October 2019.
  14. Muenchen, Bob (23 March 2017). "The Tidyverse Curse". r4stats.com.
  15. 1 2 Heppler, Jason (2018-02-27). "Teaching the tidyverse to R novices". Medium. Retrieved 2023-08-24.
  16. on, Teach the tidyverse to beginners was published (5 July 2017). "Teach the tidyverse to beginners". Variance Explained. Retrieved 2022-07-15.
  17. "Why pandas feels clunky when coming from R". Rasmus Bååth's Blog. Retrieved 2024-03-30.
  18. "Tidyverse packages - Tidyverse" . Retrieved 2018-11-26.
  19. "Tidyverse packages". www.tidyverse.org. Retrieved 2020-12-22.
  20. Silge, Julia (2023-02-01), tidytext: Text mining using tidy tools , retrieved 2023-02-03
  21. "Tidymodels". www.tidymodels.org. Retrieved 2023-02-03.
  22. "Tidy Quantitative Financial Analysis". business-science.github.io. Retrieved 2023-02-03.