Site reliability engineering

Last updated April 16, 2024

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.^[1] SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.^[2]^[3]^[4]

History

The field of site reliability engineering originated at Google with Ben Treynor Sloss,^[5]^[6] who founded a site reliability team after joining the company in 2003.^[7] In 2016, Google employed more than 1,000 site reliability engineers.^[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.^[9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs.^[9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM,^[10] LinkedIn,^[11] Netflix,^[8] and Wikimedia.^[12] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.^[13]^[14]

Definition

Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.^[15] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.^[16] Focuses of SRE include automation, system design, and improvements to system resilience.^[16]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.^{[ citation needed ]}

Site reliability engineering has also been described as a specific implementation of DevOps, although they differ slightly. SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly.^[2]^[3]^[4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.^[9]

Principles and practices

There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:^[1]^[17]

Automation or elimination of anything repetitive in a cost-effective way.
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
Observability—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.^[18]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

Toil management as the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running an incident management process.
Capacity planning.
Change and release management, including CI/CD.
Chaos engineering.

Implementations

Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high-level overview of common SRE team implementations:^[19]

Kitchen Sink, a.k.a. “Everything SRE”

The scope of services or workflows covered is usually unbounded.

Infrastructure

These focus on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focus on performing most, if not all, of the work described in the principles and practices listed above. Platform teams tend to focus on building the platform, and while reliability is desirable, that's not their sole priority.

Tools

The teams focus on tools to measure, maintain, and improve system reliability. For example, Nagios Core or Prometheus (software).

Product or application

SRE team for product and/or application. Some large companies tend to staff several of these.

Embedded

Usually, SRE solo practitioners or pairs staffed within a software engineering team apply most of the principles and practices described above.

Consulting

These teams consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are sometimes called "Customer Reliability Engineers".

Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.

Industry

The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry and also holds regional conferences with similar themes.^[20]

Related Research Articles

An IT administrator, system administrator, sysadmin, or admin is a person who is responsible for the upkeep, configuration, and reliable operation of computer systems, especially multi-user computers, such as servers. The system administrator seeks to ensure that the uptime, performance, resources, and security of the computers they manage meet the needs of the users, without exceeding a set budget when doing so.

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

CollabNet VersionOne is a software firm headquartered in Alpharetta, Georgia, United States. It was Founded by Tim O’Reilly, Brian Behlendorf, and Bill Portelli. CollabNet VersionOne products and services belong to the industry categories of value stream management, DevOps, agile management, application lifecycle management (ALM), and enterprise version control.

Urs Hölzle is a Swiss software engineer and technology executive. As Google's eighth employee and its first VP of Engineering, he has shaped much of Google's development processes and infrastructure, as well as its engineering culture. His most notable contributions include leading the development of fundamental cloud infrastructure such as energy-efficient data centers, distributed compute and storage systems, and software-defined networking. Until July 2023, he was the Senior Vice President of Technical Infrastructure and Google Fellow at Google. In July 2023, he transitioned to being a Google Fellow only.

Progress Chef is a configuration management tool written in Ruby and Erlang. It uses a pure-Ruby, domain-specific language (DSL) for writing system configuration "recipes". Chef is used to streamline the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms such as Amazon EC2, Google Cloud Platform, Oracle Cloud, OpenStack, IBM Cloud, Microsoft Azure, and Rackspace to automatically provision and configure new machines. Chef contains solutions for both small and large scale systems.

DevOps is a set of practices and tools integrating and automatizing software development ("dev") and IT operations ("ops"). DevOps is used for improving and shortening the systems development life cycle.

Continuous testing is the process of executing automated tests as part of the software delivery pipeline to obtain immediate feedback on the business risks associated with a software release candidate. Continuous testing was originally proposed as a way of reducing waiting time for feedback to developers by introducing development environment-triggered tests as well as more traditional developer/tester-triggered tests.

New Relic is a US-based web tracking and analytics company. The company's cloud-based software allows websites and mobile apps to track user interactions and service operators' software and hardware performance.

Jesse Robbins is an American technology entrepreneur, investor, and firefighter notable for his pioneering work in Cloud computing, role in creating DevOps/Chaos Engineering, and efforts to improve emergency management.

In software engineering, service virtualization or service virtualisation is a method to emulate the behavior of specific components in heterogeneous component-based applications such as API-driven applications, cloud-based applications and service-oriented architectures. It is used to provide software development and QA/testing teams access to dependent system components that are needed to exercise an application under test (AUT), but are unavailable or difficult-to-access for development and testing purposes. With the behavior of the dependent components "virtualized", testing and development can proceed without accessing the actual live components. Service virtualization is recognized by vendors, industry analysts, and industry publications as being different than mocking. See here for a Comparison of API simulation tools.

In software engineering, a microservice architecture is a variant of the service-oriented architecture structural style. It is an architectural pattern that arranges an application as a collection of loosely coupled, fine-grained services, communicating through lightweight protocols. One of its goals is that teams can develop and deploy their services independently of others. This is achieved by the reduction of several dependencies in the code base, allowing developers to evolve their services with limited restrictions from users, and for additional complexity to be hidden from users. As a consequence, organizations are able to develop software with fast growth and size, as well as use off-the-shelf services more easily. Communication requirements are reduced. These benefits come at a cost to maintaining the decoupling. Interfaces need to be designed carefully and treated as a public API. One technique that is used is having multiple interfaces on the same service, or multiple versions of the same service, so as to not disrupt existing users of the code.

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.

Infrastructure as code (IaC) is the process of managing and provisioning computer data center resources through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this process comprises both physical equipment, such as bare-metal servers, as well as virtual machines, and associated configuration resources. The definitions may be in a version control system, rather than maintaining the code through manual processes. The code in the definition files may use either scripts or declarative definitions, but IaC more often employs declarative approaches.

Bazel is a free and open-source software tool used for the automation of building and testing software. Google uses the build tool Blaze internally and released an open-source port of the Blaze tool as Bazel, named as an anagram of Blaze. Bazel was first released in March 2015 and entered beta by September 2015. Version 1.0 was released in October 2019.

DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.

BeyondCorp is an implementation, by Google, of zero-trust computer security concepts creating a zero trust network.

MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of "machine learning" and the continuous development practice of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation, orchestration, and deployment, to health, diagnostics, governance, and business metrics. According to Gartner, MLOps is a subset of ModelOps. MLOps is focused on the operationalization of ML models, while ModelOps covers the operationalization of all types of AI models.

<span class="mw-page-title-main">ModelOps</span>

ModelOps, as defined by Gartner, "is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models". "ModelOps lies at the heart of any enterprise AI strategy". It orchestrates the model lifecycles of all models in production across the entire enterprise, from putting a model into production, then evaluating and updating the resulting application according to a set of governance rules, including both technical and business KPI's. It grants business domain experts the capability to evaluate AI models in production, independent of data scientists.

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations is a software engineering book co-authored by Nicole Forsgren, Jez Humble and Gene Kim. The book explores how software development teams using Lean Software and DevOps can measure their performance and the performance of software engineering teams impacts the overall performance of an organization.

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components. To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

References

1 2 "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
1 2 Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
1 2 Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
1 2 "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.
↑ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian . Retrieved June 17, 2021.
↑ "What is SRE?". Red Hat . Retrieved June 17, 2021.
↑ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
1 2 Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch . Retrieved June 17, 2021.
1 2 3 Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
↑ "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
↑ "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.
↑ "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
↑ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
↑ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus . Retrieved June 17, 2021.
↑ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
1 2 Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login: . Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
↑ "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
↑ "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
↑ "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
↑ "Usenix SREcon". USENIX . 2021. Retrieved June 17, 2021.

External links

Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning center with resources for SREs working with Kubernetes

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:7-1] 1 2 "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.

[:0-2] 1 2 Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.

[:2-3] 1 2 Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.

[:6-4] 1 2 "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.

[5] Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian . Retrieved June 17, 2021.

[:3-6] "What is SRE?". Red Hat . Retrieved June 17, 2021.

[7] Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.

[:1-8] 1 2 Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch . Retrieved June 17, 2021.

[:5-9] 1 2 3 Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.

[10] "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.

[11] "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.

[12] "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.

[13] Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.

[14] Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus . Retrieved June 17, 2021.

[15] Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.

[:4-16] 1 2 Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login: . Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.

[17] "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.

[18] "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.

[19] "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.

[20] "Usenix SREcon". USENIX . 2021. Retrieved June 17, 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]