Data version control

Last updated

Data version control is a method of working with data sets. It is similar to the version control systems used in traditional software development, but is optimized to allow better processing of data and collaboration in the context of data analytics, research, and any other form of data analysis. Data version control may also include specific features and configurations designed to facilitate work with large data sets and data lakes. [1]

Contents

History

Background

As early as 1985, researchers recognized the need for defining timing attributes in database tables, which would be necessary for tracking changes to databases. [2] This research continued into the 1990s, and the theory was formalized into practical methods for managing data in relational databases, [3] providing some of the foundational concepts for what would later become data version control.

In the early 2010s the size of data sets was rapidly expanding, and relational databases were no longer sufficient to manage the amounts of data organizations were accumulating. The rise of the Apache Hadoop eco system, with HDFS as a storage layer, and later object storage had become dominant in big data operations. [4] Research into data management tools and data version control systems increased sharply, along with demand for such tools from both academia and the private and public sectors. [5]

Version controlled databases

The first versioned database was proposed in 2012 for the SciDB database, and demonstrated it was possible to create chains and trees of different versions of the database while decreasing both the overall storage size and access speeds associated with previous methods. [6] In 2014, a proposal was made to generalize these principles into a platform that could be used for any application. [7]

In 2016, a prototype for a data version control system was developed during a Kaggle competition. This software was later used internally at an AI firm, and eventually spun off as a startup. [8] Since then, a number of data version control systems, both open and closed source, have been developed and offered commercially, [9] with a subset dedicated specifically to machine learning. [10]

Use cases

Reproducibility

A wide range of scientific disciplines have adopted automated analysis of large quantities of data, including astrophysics, seismology, biology and medicine, social sciences and economics, and many other fields. The principle of reproducibility is an important aspect of formalizing findings in scientific disciplines, and in the context of data science presents a number of challenges. Most datasets are constantly changing, whether due to the addition of more data or changes in the structure and format of the data, and small changes can have significant effects on the outcome of experiments. Data version control allows for recording the exact state of data sets at a particular moment of time, making it easier to reproduce and understand experimental outcomes. [11] If data practitioners can only know the present state of the data, they may run into a number of challenges such as difficulties in problem debugging or complying with data audits.

Development and testing

Data version control is sometimes used in testing and development of applications that interact with large quantities of data. Some data version control tools allow users to create replicas of their production environment for testing purposes. This approach allows them to test data integration processes such as extract, transform and load (ETL) and understand the changes made to data without having a negative impact on the consumers of the production data.

Machine learning and artificial intelligence

In the context of machine learning, data version control can be used for optimizing the performance of models. It can allow automating the process of analyzing outcomes with different versions of a data set to continuously improve performance. [12] It is possible that open source data version control software could eliminate the need for proprietary AI platforms by extending tools like Git and CI/CD for use by machine learning engineers. [13] Many open-source solutions build on Git-like semantics to provide these capabilities, as Git itself was designed for small text files and doesn't support typical machine learning datasets, which are very large.

CI/CD for data

CI/CD methodologies can be applied to datasets using data version control. [14] Version control enables users to integrate with automation servers that allow establishing a CI/CD process for data. By adding testing platforms to the process, they can guarantee high quality of the data product. In this scenario, teams execute Continuous Integration (CI) tests on data and set checks in place to ensure the data is promoted to production only all the set data quality and data governance criteria are met.

Experimentation in isolated environments

To experiment on a dataset without impacting production data, one can use data version control to create replicas of the production environment where tests can be carried out. Such replicas allow testing and understanding of changes safely applied to data.

Data version control tools allow replication environments without the time- and resource-consuming maintenance. Instead, such tools allow objects to be shared using metadata.

Rollback

Continuous changes in data sets can sometimes cause functionality issues or lead to undesired outcomes, especially when applications are using the data. Data version control tools allow for the possibility to roll back a data set to an earlier state. This can be used to restore or improve functionality of an application or to correct errors or bad data which has been mistakenly included. [15]

Examples

Version controlled data sources:

Data version control for data lakes:

ML-Ops systems that implement data version control:

See also

Related Research Articles

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

In software testing, test automation is the use of software separate from the software being tested to control the execution of tests and the comparison of actual outcomes with predicted outcomes. Test automation can automate some repetitive but necessary tasks in a formalized testing process already in place, or perform additional testing that would be difficult to do manually. Test automation is critical for continuous delivery and continuous testing.

Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language, hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc.

Azure DevOps Server, formerly known as Team Foundation Server (TFS) and Visual Studio Team System (VSTS), is a Microsoft product that provides version control, reporting, requirements management, project management, automated builds, testing and release management capabilities. It covers the entire application lifecycle and enables DevOps capabilities. Azure DevOps can be used as a back-end to numerous integrated development environments (IDEs) but is tailored for Microsoft Visual Studio and Eclipse on all platforms.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">Figure Eight Inc.</span> American software company

Figure Eight was a human-in-the-loop machine learning and artificial intelligence company based in San Francisco.

<span class="mw-page-title-main">Kaggle</span> Internet platform for data science competitions

Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC. Kaggle enables users to find and publish datasets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Revolution Analytics is a statistical software company focused on developing open source and "open-core" versions of the free and open source software R for enterprise, academic and analytics customers. Revolution Analytics was founded in 2007 as REvolution Computing providing support and services for R in a model similar to Red Hat's approach with Linux in the 1990s as well as bolt-on additions for parallel processing. In 2009 the company received nine million in venture capital from Intel along with a private equity firm and named Norman H. Nie as their new CEO. In 2010 the company announced the name change as well as a change in focus. Their core product, Revolution R, would be offered free to academic users and their commercial software would focus on big data, large scale multiprocessor computing, and multi-core functionality.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma, et.al. Registration requires a credit card or bank account details.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

GitLab Inc. is an open-core company that operates GitLab, a DevOps software package that can develop, secure, and operate software. The open-source software project was created by Ukrainian developer Dmytro Zaporozhets and Dutch developer Sytse Sijbrandij. In 2018, GitLab Inc. was considered to be the first partly-Ukrainian unicorn.

Feature engineering, a preprocessing step in supervised machine learning and statistical modeling, transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

<span class="mw-page-title-main">XGBoost</span> Gradient boosting machine learning library

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

Hugging Face, Inc. is a French-American company based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Data Version Control (software)</span>

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments. It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines. DVC works on top of Git repositories and cloud storage.

Llama is a family of autoregressive large language models released by Meta AI starting in February 2023. The latest version is Llama 3 released in April 2024.

References

  1. "A guide to open source data version control - Fuzzy Labs". www.fuzzylabs.ai. Retrieved 2023-01-05.
  2. Snodgrass, Richard; Ahn, Ilsoo (1985-05-01). "A taxonomy of time databases". ACM SIGMOD Record. 14 (4): 236–246. doi:10.1145/971699.318921. ISSN   0163-5808.
  3. Temporal databases : theory, design, and implementation. Redwood City, Calif.: Benjamin/Cummings Pub. Co. 1993. ISBN   978-0-8053-2413-6.
  4. "Apache Hadoop turns 10: The Rise and Glory of Hadoop". ProjectPro. Retrieved 2023-01-18.
  5. Bryan, Jennifer (2018-01-02). "Excuse Me, Do You Have a Moment to Talk About Version Control?". The American Statistician. 72 (1): 20–27. doi:10.1080/00031305.2017.1399928. ISSN   0003-1305. S2CID   40975582.
  6. Seering, Adam; Cudre-Mauroux, Philippe; Madden, Samuel; Stonebraker, Michael (2012-04-01). "Efficient Versioning for Scientific Array Databases". 2012 IEEE 28th International Conference on Data Engineering. pp. 1013–1024. doi:10.1109/ICDE.2012.102. hdl:1721.1/90380. ISBN   978-0-7695-4747-3. S2CID   9144420.
  7. Bhardwaj, Anant; Bhattacherjee, Souvik; Chavan, Amit; Deshpande, Amol; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya G. (2014-09-02). "DataHub: Collaborative Data Science & Dataset Version Management at Scale". arXiv: 1409.0798 [cs.DB].
  8. "neptune.ai | About us, our story, team and Neptune in the news". neptune.ai. Retrieved 2023-01-04.
  9. StartupStash. "Top 16 Data Versioning Tools". Startup Stash. Retrieved 2023-01-04.
  10. Aryan Jadon (26 December 2022). "Survey of Data Versioning Tools for Machine Learning Operations". Medium. Retrieved 2023-06-27.
  11. Reproducibility and replicability in science. Engineering, and Medicine. Washington, DC. 2019. p. 114. ISBN   978-0-309-48617-0. OCLC   1122461743.{{cite book}}: CS1 maint: location missing publisher (link) CS1 maint: others (link)
  12. "Versionskontrolle für Machine-Learning-Projekte". Informatik Aktuell (in German). Retrieved 2023-01-05.
  13. "Streamlining data science with open source: Data version control and continuous machine learning". ZDNET. Retrieved 2023-01-05.
  14. "The Ultimate Guide to Database Version Control, CI/CD, and Deployment". Database Star. 2020-02-01. Retrieved 2023-01-18.
  15. "Version Control for Data — The Turing Way". the-turing-way.netlify.app. Retrieved 2023-01-05.
  16. "Day 1: Data Versioning & Creating Datasets". kaggle.com. Retrieved 2023-01-18.
  17. "Quilt Data". Quilt Data. Retrieved 2023-01-18.
  18. Hall, Susan (2020-08-19). "Dolt, a Relational Database with Git-Like Cloning Features". The New Stack. Retrieved 2023-01-05.
  19. "X-MOL". en.x-mol.com. Retrieved 2023-01-18.
  20. "Treeverse raises $23M to bring Git-like version control to data lakes". VentureBeat. 2021-07-28. Retrieved 2023-01-05.
  21. "About Nessie - Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics". projectnessie.org. Retrieved 2023-01-18.
  22. "Git Large File Storage". Git Large File Storage. Retrieved 2023-01-05.
  23. Lardinois, Frederic (2022-06-01). "Iterative launches MLEM, a tool to simplify ML model deployment". TechCrunch. Retrieved 2023-01-18.
  24. "Top AI startup news of the week: InstaDeep, DeepL, Pachyderm and more". VentureBeat. 2023-01-13. Retrieved 2023-01-18.
  25. Ingle, Prathamesh (2022-10-21). "Top Data Version Control Tools for Machine Learning Research in 2022". MarkTechPost. Retrieved 2023-01-18.
  26. Miller, Ron (2021-11-02). "Activeloop snags $5M seed to build streaming database for AI applications". TechCrunch. Retrieved 2023-01-18.
  27. "Edward Cui, Founder & CEO of Graviti - Interview Series - Unite.AI". www.unite.ai. Retrieved 2023-01-18.
  28. Ingle, Prathamesh (2022-10-06). "Top Tools for Machine Learning (ML) Experiment Tracking and Management". MarkTechPost. Retrieved 2023-01-18.
  29. "How to Set Yourself Apart from Other Applicants with Data-Centric AI". KDnuggets. Retrieved 2023-01-18.
  30. Shields, Ronan (2023-01-05). "The Trade Desk attempts to woo advertisers at CES with 'Galileo' — a bid to chart the 'Open Internet' without cookies". Digiday. Retrieved 2023-01-18.
  31. Wiggers, Kyle (2022-09-21). "Voxel51 lands funds for its platform to manage unstructured data". TechCrunch. Retrieved 2023-01-18.
  32. Cheptsov, Andrey. "Reproducible ML workflows for teams - dstack". docs.dstack.ai. Retrieved 2023-01-18.
  33. Katz, William T.; Plaza, Stephen M. (2019). "DVID: Distributed Versioned Image-Oriented Dataservice". Frontiers in Neural Circuits. 13: 5. doi: 10.3389/fncir.2019.00005 . ISSN   1662-5110. PMC   6371063 . PMID   30804760.