Data Version Control (software)

Last updated
DVC
Original author(s) Dmitry Petrov
Developer(s) Iterative.ai
Initial releaseMay 4, 2017; 5 years ago
Stable release
2.30.0 / October 10, 2022; 1 day ago
Repository https://github.com/iterative/dvc
Written in Python
Type Machine Learning CLI
License Apache License 2.0
Website dvc.org

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments. [1] It is designed to make ML models shareable, experiments reproducible, [2] and to track versions of models, data, and pipelines. [3] [4] [5] DVC works on top of Git repositories [6] and cloud storage. [7]

Contents

The first (beta) version of DVC 0.6 was launched in May 2017. [8] In May 2020, DVC 1.0 was publicly released by Iterative.ai. [9]

Overview

DVC is designed to incorporate the best practices of software development [10] into Machine Learning workflows. [11] It does this by extending the traditional software tool Git by cloud storages for datasets and Machine Learning models. [12]

Specifically, DVC makes Machine Learning operations:  

DVC and Git

DVC stores large files and datasets in separate storage, outside of Git. [3] This storage can be on the user’s computer or hosted on any major cloud storage provider, [16] [5] such as Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage. [17] [18] DVC users may also set up a remote repository on any server and connect to it remotely. [3]

When a user stores their data and models in the remote repository, text file is created in their Git repository which points to the actual data in remote storage. [2]

Features

DVC's features can be divided into three categories: data management, pipelines, and experiment tracking. [19] [20] [18]

Data management

Data and model versioning is the base layer [21] of DVC for large files, datasets, and machine learning models. It allows the use of a standard Git workflow, but without the need to store those files in the repository. Large files, directories and ML models are replaced with small metafiles, which in turn point to the original data. Data is stored separately, allowing data scientists to transfer large datasets or share a model with others. [6]

DVC enables data versioning through codification. [22] When a user creates metafiles, describing what datasets, ML artifacts and other features to track, DVC makes it possible to capture versions of data and models, create and restore from snapshots, record evolving metrics, switch between versions, etc. [6]

Unique versions of data files and directories are cached [23] in a systematic way (also preventing file duplication). The working datastore is separated from the user’s workspace to keep the project light, but stays connected via file links handled automatically by DVC. [24]

Pipelines

DVC provides a mechanism to define and execute pipelines. [25] [26] Pipelines represent the process of building ML datasets and models, from how data is preprocessed to how models are trained and evaluated. [27] Pipelines can also be used to deploy models into production environments.

DVC pipeline is focused on the experimentation phase of the ML process. Users can run multiple copies of a DVC pipeline by cloning a Git repository with the pipeline or running ML experiments. They can also record the workflow as a pipeline, and reproduce [28] it in the future.

Pipelines are represented in code as yaml [29] configuration files. These files define the stages of the pipeline and how data and information flows from one step to the next.

When a pipeline is run, the artifacts produced by that pipeline are registered in a dvc.lock file. [30] The lockfile records the stages that were run, and stores a hash of the resulting output for each stage. [25] Not only is it a record of the execution of the pipeline, but is also useful when deciding which steps must be rerun on subsequent executions of the pipeline. [27] [19]

Experiment tracking

Experiment tracking allows developers to explore, iterate and compare different machine learning experiments. [21] [19]

Each experiment represents a variation of a data science project defined by changes in the workspace. Experiments maintain a link to the commit in the current branch (Git HEAD) [31] as their parent or baseline. However, they do not form part of the regular Git tree (unless they are made persistent). [32] This stops temporary commits and branches from overflowing a user's repository.

Common use cases [33] for experiments are:

  1. Comparison of model architectures
  2. Comparison of training or evaluation datasets
  3. Selection of model hyperparameters

DVC experiments can be managed and visualized either from the VS Code IDE [34] or online using Iterative Studio. [35] Visualization [36] allows each user to compare experiment results visually, track plots and generate them with library integrations.

DVC offers several options [36] for using visualization in a regular workflow:

The DVC VS Code extension

In 2022, Iterative released a free extension [39] for Visual Studio Code (VS Code), a source-code editor made by Microsoft, which provides VS Code users with the ability to use DVC in their editors with additional user interface functionality. [40] [41]

History

In 2017, [42] [43] the first (beta) version of DVC 0.6 [44] was publicly released (as a simple command line tool). [43] It allowed data scientists to keep track of their machine learning processes and file dependencies in the simple form of git-like commands. It also allowed them to transform existing machine learning processes into reproducible DVC pipelines. DVC 0.6 solved most of the common problems that machine learning engineers and data scientists were facing: the reproducibility of machine learning experiments, as well as data versioning and low levels of collaboration between teams.

Created by ex-Microsoft data scientist Dmitry Petrov, DVC aimed to integrate the best existing software development practices into machine learning operations. [45]

In 2018, [46] Dmitry Petrov together with Ivan Shcheklein, an engineer and entrepreneur, founded Iterative.ai, [4] [47] an MLOps company that continued the development of DVC. Besides DVC, Iterative.ai is also behind open source tools like CML, MLEM, and Studio, the enterprise version of the open source tools.

In June 2020, [48] the Iterative.ai team released DVC 1.0. New features like multi-stage DVC files, run cache, plots, data transfer optimizations, hyperparameter tracking, and stable release cycles were added as a result of discussions and contributions from the community.

In March 2021, [49] DVC released DVC 2.0, which introduced ML experiments (experiment management), model checkpoints and metrics logging.

ML experiments: To solve the problem of Git overhead, when hundreds of experiments need to be run in a single day and each experiment run requires additional Git commands, DVC 2.0 introduced the lightweight experiments feature. It allows its users to auto-track ML experiments and capture code changes.

This eliminated the dependence upon additional services [50] by saving data versions as metadata in Git, as opposed to relegating it to external databases or APIs. [51]

ML model checkpoints versioning: The new release also enables versioning of all checkpoints with corresponding code and data.

Metrics logging: DVC 2.0 introduced a new open-source library DVC-Live that would provide functionality for tracking model metrics and organizing metrics in a way that DVC could visualize with navigation in Git history.

Alternative solutions to DVC

There are several open source projects that provide similar data version control capabilities to DVC, [52] such as: Git LFS, Dolt, Nessie, and lakeFS. These projects vary in their fit to the different needs of data engineers and data scientists such as: scalability, supported file formats, support in tabular data and unstructured data, volume of data that are supported, and more.

Related Research Articles

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for exploratory qualitative data analysis and interactive data visualization.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

<span class="mw-page-title-main">TensorFlow</span> Machine learning software library

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. It is one of the two most popular deep learning libraries alongside PyTorch.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

<span class="mw-page-title-main">Keras</span> Neural network library

Keras is an open-source library that provides a Python interface for artificial neural networks. Keras was first independent software, then integrated into the TensorFlow library, and later supporting more. "Keras 3 is a full rewrite of Keras [and can be used] as a low-level cross-framework language to develop custom components such as layers, models, or metrics that can be used in native workflows in JAX, TensorFlow, or PyTorch — with one codebase." Keras 3 will be the default Keras version for TensorFlow 2.16 onwards, but Keras 2 can still be used.

spaCy Software library

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

<span class="mw-page-title-main">MLOps</span> Approach to machine learning lifecycle management

MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of "machine learning" and the continuous delivery practice (CI/CD) of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation, orchestration, and deployment, to health, diagnostics, governance, and business metrics.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Kubeflow is an open-source platform for machine learning and MLOps on Kubernetes introduced by Google. The different stages in a typical machine learning lifecycle are represented with different software components in Kubeflow, including model development (Kubeflow Notebooks), model training (Kubeflow Pipelines,Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib).

GitHub Copilot is a code completion and automatic programming tool developed by GitHub and OpenAI that assists users of Visual Studio Code, Visual Studio, Neovim, and JetBrains integrated development environments (IDEs) by autocompleting code. Currently available by subscription to individual developers and to businesses, the generative artificial intelligence software was first announced by GitHub on 29 June 2021, and works best for users coding in Python, JavaScript, TypeScript, Ruby, and Go. In March 2023 GitHub announced plans for "Copilot X", which will incorporate a chatbot based on GPT-4, as well as support for voice commands, into Copilot.

Hugging Face, Inc. is an American company incorporated under the Delaware General Corporation Law and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

Data version control is a method of working with data sets. It is similar to the version control systems used in traditional software development, but is optimized to allow better processing of data and collaboration in the context of data analytics, research, and any other form of data analysis. Data version control may also include specific features and configurations designed to facilitate work with large data sets and data lakes.

Aporia is a machine learning observability platform based in Tel Aviv, Israel. The company has a US office located in San Jose, California.

Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.2, released in September 2024.

Medical open network for AI (MONAI) is an open-source, community-supported framework for Deep learning (DL) in healthcare imaging. MONAI provides a collection of domain-optimized implementations of various DL algorithms and utilities specifically designed for medical imaging tasks. MONAI is used in research and industry, aiding the development of various medical imaging applications, including image segmentation, image classification, image registration, and image generation.

Artificial Intelligence engineering is a tech discipline that focuses on the design, development, and deployment of AI systems. AI engineering involves applying engineering principles and methodologies to create scalable, efficient, and reliable AI-based solutions. It merges aspects of data engineering and software engineering to create real-world applications in diverse domains such as healthcare, finance, autonomous systems, and industrial automation.

References

  1. Hewage Nipuni, Meedeniya Dulani (February 2022). "Machine Learning Operations: A Survey on MLOps Tool Support". ResearchGate. arXiv: 2202.10169 .
  2. 1 2 Barrak Amine, Eghan Ellis E., Adams Bram (March 2021). "On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects". IEEE Xplore. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  3. 1 2 3 4 Ivancic, Kristijan. "Data Version Control With Python and DVC". Real Python. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  4. 1 2 Wiggers, Kyle. "MLOps startup Iterative.ai nabs $20M". VentureBeat. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  5. 1 2 "MLOps Company Iterative Achieves Significant Customer and Company Growth in 2021". Business Wire. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  6. 1 2 3 Hall, Susan (4 February 2021). "Iterative.ai: Git-Based Machine Learning Tools for ML Engineers". The New Stack. Archived from the original on 5 October 2022. Retrieved 5 October 2022.
  7. "What is DVC?". MLOps Guide. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  8. Petrov, Dmitry. "DVC 3 Years and 1.0 Pre-release". Iterative.ai. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  9. Anadiotis, George. "Streamlining data science with open source: Data version control and continuous machine learning". ZDNET. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  10. Petrov, Dmitry. "The Road to AI Hell Starts with Good MLOps Intentions". The New Stack. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  11. Ejaz, Nimra (6 October 2021). "Data Version Control Explained". Crowdbotics. Archived from the original on 7 October 2022. Retrieved 7 October 2022.
  12. Lardinois, Frederic. "Iterative raises $20M for its MLOps platform". TechCrunch. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  13. "AITech interview with Dmitry Petrov, Co-Founder & CEO at Iterative.ai". AI Tech Park. 20 July 2022. Archived from the original on 6 October 2022. Retrieved 6 October 2022.
  14. "Data Versioning for CD4ML – Part 2". AI Singapore. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  15. Baena, Daniel (2 March 2022). "How to build an efficient Machine Learning project workflow using Data Version Control (DVC)". Rappi Tech. Archived from the original on 6 October 2022. Retrieved 6 October 2022.
  16. "DVC Documentation. Supported storage types". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  17. Vizard, Michael. "Iterative.ai updates MLOps platform to streamline and support cloud provisioning". VentureBeat. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  18. 1 2 Kulkarni, Amit (17 June 2021). "Tracking ML Experiments With Data Version Control". Analytics Vidhya. Archived from the original on 6 October 2022. Retrieved 6 October 2022.
  19. 1 2 3 "Introduction to Data Version Control(DVC)". Kaggle. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  20. Guerrapin, Basile (12 July 2019). "Using DVC to create an efficient version control system for data projects". The Qonto Way. Archived from the original on 6 October 2022. Retrieved 6 October 2022.
  21. 1 2 "DVC Documentation. Get Started". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  22. "DVC Documentation. Versioning Data and Models". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  23. "DVC Documentation. Internal Directories and Files". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  24. "DVC Documentation. Large Dataset Optimization". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  25. 1 2 "Working with Pipelines". MLOps Guide. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  26. "DVC Documentation. Get Started: Data Pipelines". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  27. 1 2 Idowu Samuel, Strüber Daniel, Berger Thorsten (2021). "Asset Management in Machine Learning: A Survey". Astrophysics Data System (ADS). arXiv: 2102.06919 .
  28. Kapoor Sayash, Narayanan Arvind (2022). "Leakage and the Reproducibility Crisis in ML-based Science". ResearchGate. arXiv: 2207.07048 . Archived from the original on 2024-06-28. Retrieved 2022-10-07.
  29. "DVC Documentation. dvc.yaml". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  30. "DVC Documentation. dvc.lock file". dvc.org/doc. Archived from the original on 2022-10-06. Retrieved 2022-10-06.
  31. "DVC Documentation. DVC Experiments Overview". dvc.org/doc. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  32. "DVC Documentation. Persisting Experiments". dvc.org/doc.
  33. "How we keep track of our data experiments". Kapernikov. 26 January 2022. Archived from the original on 6 October 2022. Retrieved 6 October 2022.
  34. "DVC Extension for Visual Studio Code". Visual Studio. Marketplace. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  35. "Iterative Introduces First Git-based Machine Learning Model Registry". Yahoo Finance. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  36. 1 2 "DVC Documentation. Get Started: Visualization with Plots". dvc.org/doc. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  37. "DVC Documentation. Metrics and Plots outputs". dvc.org/doc. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  38. "DVC Documentation. DVCLive with DVC". dvc.org/doc. Archived from the original on 2022-10-07. Retrieved 2022-10-07.
  39. Nicholls, Emily (14 June 2022). "Iterative Announces A Free Extension To Microsoft Visual Studio Code To Accelerate ML Model Development Experience". TFiR. Archived from the original on 7 October 2022. Retrieved 7 October 2022.
  40. Bhartiya, Swapnil (28 June 2022). "Iterative's DVC Extension Turns VS Code Into ML Experimentation Platform". TFiR. Archived from the original on 28 June 2024. Retrieved 7 October 2022.
  41. Awan, Abid Ali. "12 Essential VSCode Extensions for Data Science". KDnuggets. Archived from the original on 2024-06-28. Retrieved 2022-10-07.
  42. "DVC 3 Years and 1.0 Pre-release". Iterative.ai. 4 May 2020. Archived from the original on 5 October 2022. Retrieved 5 October 2022.
  43. 1 2 "Data Version Control Explained". Crowdbotics. 6 October 2021. Archived from the original on 7 October 2022. Retrieved 7 October 2022.
  44. Petrov, Dmitry. "Data Version Control: iterative machine learning". KDnuggets. Archived from the original on 2022-12-02. Retrieved 2022-12-02.
  45. Vázquez, Favio (17 April 2019). "Data version control with DVC. What do the authors have to say?". Towards Data Science. Archived from the original on 2 December 2022. Retrieved 2 December 2022.
  46. Smolaks, Max. "Iterative.ai pitches open source alternative to AWS SageMaker and Azure ML Engineer". AI Business. Archived from the original on 2022-12-02. Retrieved 2022-12-02.
  47. Singh, Swastik (3 June 2021). "An open-source startup Iterative.ai raises USD 20 Million". VCBay. Archived from the original on 2 December 2022. Retrieved 2 December 2022.
  48. "DVC 1.0 release: new features for MLOps". Iterative.ai. 22 June 2020. Archived from the original on 2 December 2022. Retrieved 2 December 2022.
  49. "DVC 2.0 Release". Iterative.ai. 3 March 2021. Archived from the original on 2 December 2022. Retrieved 2 December 2022.
  50. "DVC Documentation. Experiment Management". dvc.org/doc. Archived from the original on 2022-10-08. Retrieved 2022-10-07.
  51. "DVC Documentation. Related Technologies". dvc.org/doc. Archived from the original on 2022-12-02. Retrieved 2022-12-02.
  52. Orr, Einat (25 July 2022). "Data versioning as your 'Get out of jail' card – DVC vs. Git-LFS vs. dolt vs. lakeFS". lakeFS. Archived from the original on 23 November 2022. Retrieved 23 November 2022.