Distributed version control

Last updated

In software development, distributed version control (also known as distributed revision control) is a form of version control in which the complete codebase, including its full history, is mirrored on every developer's computer. [1] Compared to centralized version control, this enables automatic management branching and merging, speeds up most operations (except pushing and pulling), improves the ability to work offline, and does not rely on a single location for backups. [1] [2] [3] Git, the world's most popular version control system, [4] is a distributed version control system.

Contents

In 2010, software development author Joel Spolsky described distributed version control systems as "possibly the biggest advance in software development technology in the [past] ten years". [2]

Distributed vs. centralized

Distributed version control systems (DVCS) use a peer-to-peer approach to version control, as opposed to the client–server approach of centralized systems. Distributed revision control synchronizes repositories by transferring patches from peer to peer. There is no single central version of the codebase; instead, each user has a working copy and the full change history.

Advantages of DVCS (compared with centralized systems) include:

Disadvantages of DVCS (compared with centralized systems) include:

Some originally centralized systems now offer some distributed features. Team Foundation Server and Visual Studio Team Services now host centralized and distributed version control repositories via hosting Git.

Similarly, some distributed systems now offer features that mitigate the issues of checkout times and storage costs, such as the Virtual File System for Git developed by Microsoft to work with very large codebases, [8] which exposes a virtual file system that downloads files to local storage only as they are needed.

Work model

The distributed model is generally better suited for large projects with partly independent developers, such as the Linux kernel project, because developers can work independently and submit their changes for merge (or rejection). This flexibility allows adopting custom source code contribution workflows, such as the integrator workflow, which is the most widely used.Unlike the centralized model where developers must serialize their work to avoid problems with different versions, in the distributed model, developers can clone the entire history of the code to their local machines. They commit their changes to their local repositories first, creating 'change sets,' before pushing them to the master repository. This approach enables developers to work locally and disconnected, making it more convenient for distributed teams. [9]

Central and branch repositories

In a truly distributed project, such as Linux, every contributor maintains their own version of the project, with different contributors hosting their own respective versions and pulling in changes from other users as needed, resulting in a general consensus emerging from multiple different nodes. This also makes the process of "forking" easy, as all that is required is one contributor stop accepting pull requests from other contributors and letting the codebases gradually grow apart.

This arrangement, however, can be difficult to maintain, resulting in many projects choosing to shift to a paradigm in which one contributor is the universal "upstream", a repository from whom changes are almost always pulled. Under this paradigm, development is somewhat recentralized, as every project now has a central repository that is informally considered as the official repository, managed by the project maintainers collectively. While distributed version control systems make it easy for new developers to "clone" a copy of any other contributor's repository, in a central model, new developers always clone the central repository to create identical local copies of the code base. Under this system, code changes in the central repository are periodically synchronized with the local repository, and once the development is done, the change should be integrated into the central repository as soon as possible.

Organizations utilizing this centralize pattern often choose to host the central repository on a third party service like GitHub, which offers not only more reliable uptime than self-hosted repositories, but can also add centralized features like issue trackers and continuous integration.

Pull requests

Contributions to a source code repository that uses a distributed version control system are commonly made by means of a pull request, also known as a merge request. [10] The contributor requests that the project maintainer pull the source code change, hence the name "pull request". The maintainer has to merge the pull request if the contribution should become part of the source base. [11]

The developer creates a pull request to notify maintainers of a new change; a comment thread is associated with each pull request. This allows for focused discussion of code changes. Submitted pull requests are visible to anyone with repository access. A pull request can be accepted or rejected by maintainers. [12]

Once the pull request is reviewed and approved, it is merged into the repository. Depending on the established workflow, the code may need to be tested before being included into official release. Therefore, some projects contain a special branch for merging untested pull requests. [11] [13] Other projects run an automated test suite on every pull request, using a continuous integration tool, and the reviewer checks that any new code has appropriate test coverage.

History

The first open-source DVCS systems included Arch, Monotone, and Darcs. However, open source DVCSs were never very popular until the release of Git and Mercurial.

BitKeeper was used in the development of the Linux kernel from 2002 to 2005. [14] The development of Git, now the world's most popular version control system, [4] was prompted by the decision of the company that made BitKeeper to rescind the free license that Linus Torvalds and some other Linux kernel developers had previously taken advantage of. [14]

See also

Related Research Articles

In software engineering, version control is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections of information. Version control is a component of software configuration management.

BitKeeper is a discontinued software tool for distributed revision control of computer source code. Originally developed as proprietary software by BitMover Inc., a privately held company based in Los Gatos, California, it was released as open-source software under the Apache-2.0 license on 9 May 2016. BitKeeper is no longer being developed.

In software engineering, a project fork happens when developers take a copy of source code from one software package and start independent development on it, creating a distinct and separate piece of software. The term often implies not merely a development branch, but also a split in the developer community; as such, it is a form of schism. Grounds for forking are varying user preferences and stagnated or discontinued development of the original software.

<span class="mw-page-title-main">Monotone (software)</span> Revision control software

Monotone is an open source software tool for distributed revision control.

<span class="mw-page-title-main">Git</span> Software for version control of files

Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers who are collaboratively developing source code during software development.

In software development, a codebase is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code system files; thus, a codebase usually does not include source code files generated by tools or binary library files, as they can be built from the human-written source code. However, it generally does include configuration and property files, as they are the data necessary for the build.

<span class="mw-page-title-main">GForge</span>

GForge is a commercial service originally based on the Alexandria software behind SourceForge, a web-based project management and collaboration system which was licensed under the GPL. Open source versions of the GForge code were released from 2002 to 2009, at which point the company behind GForge focused on their proprietary service offering which provides project hosting, version control, code reviews, ticketing, release management, continuous integration and messaging. The FusionForge project emerged in 2009 to pull together open-source development efforts from the variety of software forks which had sprung up.

<span class="mw-page-title-main">Darcs</span>

Darcs is a distributed version control system created by David Roundy. Key features include the ability to choose which changes to accept from other repositories, interaction with either other local (on-disk) repositories or remote repositories via SSH, HTTP, or email, and an unusually interactive interface. The developers also emphasize the use of advanced software tools for verifying correctness: the expressive type system of the functional programming language Haskell enforces some properties, and randomized testing via QuickCheck verifies many others. The name is a recursive acronym for Darcs Advanced Revision Control System.

Open-source software development (OSSD) is the process by which open-source software, or similar software whose source code is publicly available, is developed by an open-source software project. These are software products available with its source code under an open-source license to study, change, and improve its design. Examples of some popular open-source software products are Mozilla Firefox, Google Chromium, Android, LibreOffice and the VLC media player.

<span class="mw-page-title-main">Mercurial</span> Distributed revision-control tool for software developers

Mercurial is a distributed revision control tool for software developers. It is supported on Microsoft Windows, Linux, and other Unix-like systems, such as FreeBSD and macOS.

Branching, in version control and software configuration management, is the duplication of an object under version control. Each object can thereafter be modified separately and in parallel so that the objects become different. In this context the objects are called branches. The users of the version control system can branch any branch.

In software development, version control is a class of systems responsible for managing changes to computer programs or other collections of information such that revisions have a logical and consistent organization. The following tables include general and technical information on notable version control and software configuration management (SCM) software. For SCM software not suitable for source code, see Comparison of open-source configuration management software.

Quilt is a software utility for managing a series of changes to the source code of any computer program. Such changes are often referred to as "patches" or "patch sets". Quilt can take an arbitrary number of patches as input and condense them into a single patch. In doing so, Quilt makes it easier for many programmers to test and evaluate the different changes amongst patches before they are permanently applied to the source code.

Plastic SCM is a cross-platform commercial distributed version control tool developed by Códice Software for Microsoft Windows, Mac OS X, Linux, and other operating systems. It includes a command-line tool, native GUIs, diff and merge tool and integration with a number of IDEs. It is a full version control stack not based on Git.

In version control systems, a repository is a data structure that stores metadata for a set of files or directory structure. Depending on whether the version control system in use is distributed, like Git or Mercurial, or centralized, like Subversion, CVS, or Perforce, the whole set of information in the repository may be duplicated on every user's system or may be maintained on a single server. Some of the metadata that a repository contains includes, among other things, a historical record of changes in the repository, a set of commit objects, and a set of references to commit objects, called heads.

<span class="mw-page-title-main">RhodeCode</span> German software company

RhodeCode is an open source self-hosted platform for behind-the-firewall source code management. It provides centralized control over Git, Mercurial, and Subversion repositories within an organization, with common authentication and permission management. RhodeCode allows forking, pull requests, and code reviews via a web interface.

Fork and pull model refers to a software development model mostly used on GitHub, where multiple developers working on an open, shared project make their own contributions by sharing a main repository and pushing changes after granted pull request by integrator users. Followed by the advent of distributed version control systems (DVCS), Git naturally enables the usage of a pull-based development model, in which developers can copy the project onto their own repository and then push their changes to the original repository, where the integrators will determine the validity of the pull request. Since its appearance, pull-based development has gained popularity within the open software development community. On GitHub, over 400,000 pull-requests emerged per month on average in 2015. It is also the model shared on most collaborative coding platforms, like Bitbucket, Gitorious, etc. More and more functionalities are added to facilitate pull-based model.

In version-control systems, a monorepo is a software-development strategy in which the code for a number of projects is stored in the same repository. This practice dates back to at least the early 2000s, when it was commonly called a shared codebase. Google, Meta, Microsoft, Uber, Airbnb, and Twitter all employ very large monorepos with varying strategies to scale build systems and version control software with a large volume of code and daily changes.

Bcachefs is a copy-on-write (COW) file system for Linux-based operating systems. Its primary developer, Kent Overstreet, first announced it in 2015, and it was added to the Linux kernel beginning with 6.7. It is intended to compete with the modern features of ZFS or Btrfs, and the speed and performance of ext4 or XFS.

References

  1. 1 2 Chacon, Scott; Straub, Ben (2014). "About version control". Pro Git (2nd ed.). Apress. Chapter 1.1. Retrieved 4 June 2019.
  2. 1 2 Spolsky, Joel (17 March 2010). "Distributed Version Control Is Here to Stay, Baby". Joel on Software. Retrieved 4 June 2019.
  3. "Intro to Distributed Version Control (Illustrated)". www.betterexplained.com. Retrieved 7 January 2018.
  4. 1 2 "Version Control Systems Popularity in 2016". www.rhodecode.com. Retrieved 7 January 2018.
  5. 1 2 O'Sullivan, Bryan. "Distributed revision control with Mercurial" . Retrieved July 13, 2007.
  6. Chacon, Scott; Straub, Ben (2014). "Distributed workflows". Pro Git (2nd ed.). Apress. Chapter 5.1.
  7. "What is version control: centralized vs. DVCS". www.atlassian.com. 14 February 2012. Retrieved 7 January 2018.
  8. Jonathan Allen (2017-02-08). "How Microsoft Solved Git's Problem with Large Repositories" . Retrieved 2019-08-06.
  9. Upadhaye, Annu (22 Feb 2023). "Centralized vs Distributed Version Control". GFG. Retrieved 4 April 2024.
  10. Sijbrandij, Sytse (29 September 2014). "GitLab Flow". GitLab. Retrieved 4 August 2018.
  11. 1 2 Johnson, Mark (8 November 2013). "What is a pull request?". Oaawatch. Retrieved 27 March 2016.
  12. "Using pull requests". GitHub. Retrieved 27 March 2016.
  13. "Making a Pull Request". Atlassian. Retrieved 27 March 2016.
  14. 1 2 McAllister, Neil. "Linus Torvalds' BitKeeper blunder". InfoWorld. Retrieved 2017-03-19.