Repository | github |
---|---|
Written in | R |
Type | Package collection |
License | MIT |
Website | www |
The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham [1] and his team that "share an underlying design philosophy, grammar, and data structures" of tidy data. [2] Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping. [3] [4] [5]
As of November 2018, the tidyverse package and some of its individual packages comprise 5 out of the top 10 most downloaded R packages. [6] The tidyverse is the subject of multiple books and papers. [7] [8] [9] [10] In 2019, the ecosystem has been published in the Journal of Open Source Software . [11]
Its syntax has been referred to as "supremely readable", [12] and some [13] have argued that tidyverse is an effective way to introduce complete beginners to programming, as pedagogically it allows students to quickly begin doing data processing tasks. [14] [13] Moreover, some practitioners have pointed out that data processing tasks are intuitively easier to chain together with tidyverse compared to Python's equivalent data processing package, pandas. [15] There is also an active R community around the tidyverse. For example, there is the TidyTuesday social data project organised by the Data Science Learning Community (DSLC), [16] where varied real-world datasets are released each week for the community to participate, share, practice, and make learning to work with data easier. [17] Critics of the tidyverse have argued it promotes tools that are harder to teach and learn than their built-in, base R equivalents and are too dissimilar to some programming languages. [18] [19]
The tidyverse principles more generally encourage and help ensure that a universe of streamlined packages, in principle, will help alleviate dependency issues and compatibility with current and future features. [20] An example of such a tidyverse principled approach is the pharmaverse, which is a collection of R packages for clinical reporting usage in pharma. [21]
The core tidyverse packages, which provide functionality to model, transform, and visualize data, include: [22]
Additional packages assist the core collection. [23] Other packages based on the tidy data principles are regularly developed, such as tidytext [24] for text analysis, tidymodels [25] for machine learning, or tidyquant [26] for financial operations.
SciPy is a free and open-source Python library used for scientific computing and technical computing.
R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.
In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.
Norman Saul Matloff is an American professor of computer science at the University of California, Davis.
A software repository, or repo for short, is a storage location for software packages. Often a table of contents is also stored, along with metadata. A software repository is typically managed by source or version control, or repository managers. Package managers allow automatically installing and updating repositories, sometimes called "packages".
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
ggplot2 is an open-source data visualization package for the statistical programming language R. Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales. Since 2005, ggplot2 has grown in use to become one of the most popular R packages.
Snake case is the naming convention in which each space is replaced with an underscore (_) character, and words are written in lowercase. It is a commonly used naming convention in computing, for example for variable and subroutine names, and for filenames. One study has found that readers can recognize snake case values more quickly than camel case. However, "subjects were trained mainly in the underscore style", so the possibility of bias cannot be eliminated.
Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.
Heike Hofmann is a statistician and Professor in the Department of Statistics at University of Nebraska–Lincoln and was previously at Iowa State University.
Mine Çetinkaya-Rundel is a Turkish-American statistician and professor of the practice at Duke University, and a professional educator at RStudio. She is the author of several open source statistics textbooks and is an instructor for Coursera. She is the chair-elect of the Statistical Education Section of the American Statistical Association. Previously, she was a senior lecturer at University of Edinburgh.
rnn is an open-source machine learning framework that implements recurrent neural network architectures, such as LSTM and GRU, natively in the R programming language, that has been downloaded over 100,000 times.
Jennifer "Jenny" Bryan is a data scientist and an associate professor of statistics at the University of British Columbia where she developed the Master in Data Science Program. She is a statistician and software engineer at RStudio from Vancouver, Canada and is known for creating open source tools which connect R to Google Sheets and Google Drive.
dplyr is an R package whose set of functions are designed to enable dataframe manipulation in an intuitive, user-friendly way. It is one of the core packages of the popular tidyverse set of packages in the R programming language. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization.
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.
jamovi is a free and open-source computer program for data analysis and performing statistical tests. The core developers of jamovi are Jonathon Love, Damian Dropmann, and Ravi Selker, who were developers for the JASP project.
The easystats collection of open source R packages was created in 2019 and primarily includes tools dedicated to the post-processing of statistical models. As of May 2022, the 10 packages composing the easystats ecosystem have been downloaded more than 8 million times, and have been used in more than 1000 scientific publications. The ecosystem is the topic of several statistical courses, video tutorials and books.
Posit PBC is an open-source data science software company. It is a public-benefit corporation founded by J. J. Allaire, creator of the programming language ColdFusion.
Chester Ismay is an American data professional and educator with a background in data science, statistical modeling, and machine learning. He served as the Senior Director of Data Science Education at Flatiron School. Ismay has co-authored several R packages, including infer, fivethirtyeight, thesisdown, and moderndive.
{{cite book}}
: CS1 maint: date and year (link){{cite book}}
: CS1 maint: location missing publisher (link) CS1 maint: multiple names: authors list (link){{cite book}}
: CS1 maint: location missing publisher (link)