Horovod (machine learning)

Last updated

Horovod
Developer Uber
Initial releaseAugust 9, 2017;8 years ago (2017-08-09) [1]
Stable release
v0.28.1 [2] / June 12, 2023;2 years ago (2023-06-12)
Written in Python, C++, CUDA
Platform Linux, macOS, Windows
Type Artificial intelligence ecosystem
License Apache License 2.0
Website horovod.ai   OOjs UI icon edit-ltr-progressive.svg

Horovod is a free and open-source distributed deep learning training framework for TensorFlow, Keras, PyTorch and Apache MXNet. [3] [4]

Contents

It is designed to scale existing single-GPU training scripts to efficiently run on multiple GPUs and computer nodes with minimal code changes, using synchronous data-parallel training based on the ring-allreduce communication pattern. [5] Horovod was initially developed at Uber and released as an open-source project in 2017, and is now hosted by the LF AI & Data Foundation, a project of the Linux Foundation. [1]

History

Horovod was created at Uber as part of the company's internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs. [1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence. [2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit. [1]

In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod's design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines. [1]

In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project. [6] [7] [8] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020. [9]

Since its initial release the project has expanded beyond TensorFlow to provide APIs for PyTorch, Keras and Apache MXNet, as well as integrations with frameworks such as Apache Spark and Ray, support for elastic training, and tooling for automated performance tuning and profiling. [10] [11]

Design and features

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data. [1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters. [1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training. [12]

In the benchmark experiments reported in the original paper, Horovod achieved around 90% scaling efficiency on 512 GPUs for the ResNet-101 and Inception v3 convolutional neural networks, and around 68% scaling efficiency for the VGG-16 model. [1]

Horovod can be deployed on-premises or in cloud environments and is distributed as a Python package with optional GPU support via CUDA. [11] [13] The official documentation provides guides for running Horovod with Docker, Kubernetes (including via Kubeflow and the MPI Operator), commercial platforms such as Databricks, and cluster schedulers such as LSF. [11]

Adoption and use cases

Within Uber, Horovod has been used for applications including autonomous driving research, fraud detection and trip forecasting. [14] [8]

Major cloud providers have integrated Horovod into their managed machine learning offerings. Amazon Web Services supports distributed training with Horovod in services such as Amazon SageMaker and AWS Deep Learning Containers, [15] [16] while Microsoft Azure documents Horovod-based training workflows for Azure Synapse Analytics. [17]

Technical guides from academic and research computing centres, including Purdue University and the NASA Advanced Supercomputing programme, describe Horovod-based workflows for multi-GPU training on supercomputers and clusters. [18]

Horovod is also used in conjunction with Apache Spark and dedicated storage systems as part of end-to-end data processing and model-training pipelines. [19] Industry blogs and technical tutorials describe deployments of Horovod on Kubernetes, on-premises clusters and cloud-managed Kubernetes services such as Amazon EKS. [19] [16]

See also

References

  1. 1 2 3 4 5 6 7 8 Sergeev, Alexander (October 17, 2017). "Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow". Uber Engineering Blog. Retrieved November 28, 2025.
  2. 1 2 "Releases · horovod/horovod". horovod. Retrieved July 11, 2023.
  3. "Overview". Horovod documentation. LF AI & Data Foundation. Retrieved November 28, 2025.
  4. Khari Johnson (December 13, 2018). "Uber brings Horovod project for distributed deep learning to Linux Foundation". VentureBeat. Retrieved July 9, 2020.
  5. Sergeev, Alexander; Del Balso, Mike (February 15, 2018). "Horovod: fast and easy distributed deep learning in TensorFlow". arXiv. abs/1802.05799. Retrieved November 28, 2025.
  6. "Projects – LF AI & Data". LF AI & Data Foundation. Retrieved November 28, 2025.
  7. Johnson, Khari (December 13, 2018). "Uber brings Horovod project for distributed deep learning to Linux Foundation". VentureBeat. Retrieved November 28, 2025.
  8. 1 2 "Horovod: an open-source distributed training framework by Uber for TensorFlow, Keras, PyTorch, and MXNet". Packt. April 9, 2019. Retrieved November 28, 2025.
  9. "LF AI Foundation Announces Graduation of Horovod Project". LF AI & Data Foundation. September 9, 2020. Retrieved November 28, 2025.
  10. "Elastic Deep Learning with Horovod on Ray". Uber Engineering Blog. March 8, 2021. Retrieved November 28, 2025.
  11. 1 2 3 "Horovod documentation". Horovod. LF AI & Data Foundation. Retrieved November 28, 2025.
  12. "Using Horovod for Distributed Training". NASA Advanced Supercomputing Division. October 6, 2022. Retrieved November 28, 2025.
  13. "Horovod". Horovod. The Linux Foundation. Retrieved November 28, 2025.
  14. "LF Deep Learning Welcomes Horovod Distributed Training Framework as Newest Project". Linux Foundation. December 13, 2018. Retrieved November 28, 2025.
  15. "Launching TensorFlow distributed training easily with Horovod or parameter servers in Amazon SageMaker". AWS Machine Learning Blog. September 13, 2019. Retrieved November 28, 2025.
  16. 1 2 "How to run distributed training using Horovod and MXNet in Amazon SageMaker". AWS Machine Learning Blog. September 1, 2020. Retrieved November 28, 2025.
  17. "Tutorial: Distributed training with Horovod and TensorFlow (deprecated)". Microsoft Learn. June 3, 2025. Retrieved November 28, 2025.
  18. "Distributed Deep Learning with Horovod". Purdue University Research Computing. Retrieved November 28, 2025.
  19. 1 2 "Deep learning with Apache Spark and NetApp AI—distributed DL with Horovod". NetApp Blog. February 6, 2023. Retrieved November 28, 2025.