MLOps

Last updated
MLOps is the set of practices at the intersection of Machine Learning, DevOps and Data Engineering ML Ops Venn Diagram.svg
MLOps is the set of practices at the intersection of Machine Learning, DevOps and Data Engineering

MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. [1] The word is a compound of "machine learning" and the continuous development practice of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. [2] Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. According to Gartner, MLOps is a subset of ModelOps. MLOps is focused on the operationalization of ML models, while ModelOps covers the operationalization of all types of AI models. [3]

Contents

Definition

MLOps is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops. [4]

History

The challenges of the ongoing use of machine learning in applications were highlighted in a 2015 paper. [5] The predicted growth in machine learning included an estimated doubling of ML pilots and implementations from 2017 to 2018, and again from 2018 to 2020. [6]

Reports show a majority (up to 88%) of corporate machine learning initiatives are struggling to move beyond test stages. [4] However, those organizations that actually put machine learning into production saw a 3-15% profit margin increases. [7] The MLOps market was estimated at $23.2 billion in 2019 and is projected to reach $126 billion by 2025 due to rapid adoption. [8]

Architecture

Machine Learning systems can be categorized in eight different categories: data collection, data processing, feature engineering, data labeling, model design, model training and optimization, endpoint deployment, and endpoint monitoring. Each step in the machine learning lifecycle is built in its own system, but requires interconnection. These are the minimum systems that enterprises need to scale machine learning within their organization.

Goals

There are a number of goals enterprises want to achieve through MLOps systems successfully implementing ML across the enterprise, including: [9]

A standard practice, such as MLOps, takes into account each of the aforementioned areas, which can help enterprises optimize workflows and avoid issues during implementation.

A common architecture of an MLOps system would include data science platforms where models are constructed and the analytical engines where computations are performed, with the MLOps tool orchestrating the movement of machine learning models, data and outcomes between the systems. [9]

See also

Related Research Articles

AnthillPro is a software tool originally developed and released as one of the first continuous integration servers. AnthillPro automates the process of building code into software projects and testing it to verify that project quality has been maintained. Software developers are able to identify bugs and errors earlier by using AnthillPro to track, collate, and test changes in real time to a collectively maintained body of computer code.

<span class="mw-page-title-main">Release management</span> Process of software building

Release management is the process of managing, planning, scheduling and controlling a software build through different stages and environments; it includes testing and deploying software releases.

DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle.

Continuous delivery (CD) is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time and, following a pipeline through a "production-like environment", without doing so manually. It aims at building, testing, and releasing software with greater speed and frequency. The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production. A straightforward and repeatable deployment process is important for continuous delivery.

Application-release automation (ARA) refers to the process of packaging and deploying an application or update of an application from development, across various environments, and ultimately to production. ARA solutions must combine the capabilities of deployment automation, environment management and modeling, and release coordination.

Cloud management is the management of cloud computing products and services.

<span class="mw-page-title-main">BuildMaster</span>

BuildMaster is an application release automation tool, designed by the software development team Inedo. It combines build management and ARA capabilities to manage and automate processes primarily related to continuous integration, database change scripts, and production deployments, overall releasing applications reliably. The tool is browser-based and able to be used "out-of-the-box". Its feature set and scope puts it in line with the DevOps movement, and is marketed as "more than a release automatigs together the people, processes, and practices that allow teams to deliver software rapidly, reliably, and responsibly.” It's a tool that embodies incremental DevOps adoption.

<span class="mw-page-title-main">Dynatrace</span> American technology company

Dynatrace, Inc. is a global technology company that provides a software observability platform based on artificial intelligence (AI) and automation. Dynatrace technologies are used to monitor, analyze, and optimize application performance, software development and security practices, IT infrastructure, and user experience for businesses and government agencies throughout the world.

XebiaLabs is an independent software company specializing in DevOps and continuous delivery for large enterprise organizations. XebiaLabs offers a DevOps Platform for application-release automation (ARO). These components include release orchestration, deployment automation and DevOps intelligence.

Infrastructure as code (IaC) is the process of managing and provisioning computer data center resources through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this process comprises both physical equipment, such as bare-metal servers, as well as virtual machines, and associated configuration resources. The definitions may be in a version control system, rather than maintaining the code through manual processes. The code in the definition files may use either scripts or declarative definitions, but IaC more often employs declarative approaches.

<span class="mw-page-title-main">DevOps toolchain</span> DevOps toolchain release package.

A DevOps toolchain is a set or combination of tools that aid in the delivery, development, and management of software applications throughout the systems development life cycle, as coordinated by an organisation that uses DevOps practices.

<span class="mw-page-title-main">Tricentis</span> Austrian software testing company

Tricentis is a software testing company founded in 2007 and headquartered in Austin, Texas. It provides software testing automation and software quality assurance products for enterprise software.

Continuous configuration automation (CCA) is the methodology or process of automating the deployment and configuration of settings and software for both physical and virtual data center equipment.

DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems.

Automated Artificial Intelligence (AutoAI) is a variation of the automated machine learning or AutoML technology, which extends the automation of model building towards automation of the full life cycle of a machine learning model. It applies intelligent automation to the task of building predictive machine learning models by preparing data for training and identifying the best type of model for the given data. then choosing the features or columns of data that best support the problem the model is solving. Finally, automation evaluates a variety of tuning options to reach the best result as it generates, then ranks, model-candidate pipelines. The best performing pipelines can be put into production to process new data, and deliver predictions based on the model training. Automated artificial intelligence can also be applied to making sure the model doesn't have inherent bias and automating the tasks for continuous improvement of the model. Managing an AutoAI model requires frequent monitoring and updating, managed by a process known as model operations or ModelOps.

Artificial Intelligence for IT Operations (AIOps) is a term coined by Gartner in 2016 as an industry category for machine learning analytics technology that enhances IT operations analytics. AIOps is the acronym of "Artificial Intelligence Operations". Such operation tasks include automation, performance monitoring and event correlations among others.

<span class="mw-page-title-main">ModelOps</span>

ModelOps, as defined by Gartner, "is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models". "ModelOps lies at the heart of any enterprise AI strategy". It orchestrates the model lifecycles of all models in production across the entire enterprise, from putting a model into production, then evaluating and updating the resulting application according to a set of governance rules, including both technical and business KPI's. It grants business domain experts the capability to evaluate AI models in production, independent of data scientists.

TestOps refers to the discipline of managing the operational aspects of testing within the software delivery lifecycle.

<span class="mw-page-title-main">Data Version Control (software)</span>

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments. It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines. DVC works on top of Git repositories and cloud storage.

References

  1. 1 2 Breuel, Cristiano. "ML Ops: Machine Learning as an Engineering Discipline". Towards Data Science. Retrieved 6 July 2021.
  2. Talagala, Nisha. "Why MLOps (and not just ML) is your Business' New Competitive Frontier". AITrends. Retrieved 30 January 2018.
  3. 1 2 Vashisth, Shubhangi; Brethenoux, Erick; Choudhary, Farhan; Hare, Jim. "Use Gartner's 3-Stage MLOps Framework to Successfully Operationalize Machine Learning Projects". Gartner. Retrieved 30 October 2020.
  4. 1 2 Kreuzberger, Dominik; Kühl, Niklas; Hirschl, Sebastian (2023). "Machine Learning Operations (MLOps): Overview, Definition, and Architecture". IEEE Access. 11: 31866–31879. arXiv: 2205.02302 . doi:10.1109/ACCESS.2023.3262138. ISSN   2169-3536. S2CID   248524628.
  5. Sculley, D.; Holt, Gary; Golovin, Daniel; Davydov, Eugene; Phillips, Todd; Ebner, Dietmar; Chaudhary, Vinay; Young, Michael; Crespo, Jean-Francois; Dennison, Dan (7 December 2015). "Hidden Technical Debt in Machine Learning Systems" (PDF). NIPS Proceedings (2015). Retrieved 14 November 2017.
  6. Sallomi, Paul; Lee, Paul. "Deloitte Technology, Media and Telecommunications Predictions 2018" (PDF). Deloitte. Retrieved 13 October 2017.
  7. Bughin, Jacques; Hazan, Eric; Ramaswamy, Sree; Chui, Michael; Allas, Tera; Dahlström, Peter; Henke, Nicolaus; Trench, Monica. "Artificial Intelligence The Next Digital Frontier?". McKinsey. McKinsey Global Institute. Retrieved 1 June 2017.
  8. "2021 MLOps Platforms Vendor Analysis Report" (PDF). Neu.ro. Retrieved 18 March 2024.
  9. 1 2 Walsh, Nick. "The Rise of Quant-Oriented Devs & The Need for Standardized MLOps". Slides. Nick Walsh. Retrieved 1 January 2018.
  10. "Code to production-ready machine learning in 4 steps". DAGsHub Blog. 2021-02-03. Retrieved 2021-02-19.
  11. 1 2 Warden, Pete. "The Machine Learning Reproducibility Crisis". Pete Warden's Blog. Pete Warden. Retrieved 19 March 2018.
  12. Vaughan, Jack. "Machine learning algorithms meet data governance". SearchDataManagement. TechTarget. Retrieved 1 September 2017.
  13. Lorica, Ben. "How to train and deploy deep learning at scale". O'Reilly. Retrieved 15 March 2018.
  14. Garda, Natalie. "IoT and Machine Learning: Why Collaboration is Key". IoT Tech Expo. Encore Media Group. Retrieved 12 October 2017.
  15. Manyika, James. "What's now and next in analytics, AI, and automation". McKinsey. McKinsey Global Institute. Retrieved 1 May 2017.
  16. Haviv, Yaron. "MLOps Challenges, Solutions and Future Trends". Iguazio. Retrieved 19 February 2020.