Site reliability engineering

Last updated

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations. [1] SRE aims to create highly reliable and scalable IT systems. Although they are closely related, SRE is slightly different from DevOps. [2] [3] [4]

Contents

History

The field of site reliability engineering originated at Google with Ben Treynor Sloss, [5] [6] who founded a site reliability team after joining the company in 2003. [7] In 2016, Google employed more than 1,000 site reliability engineers. [8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers. [9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs. [9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM, [10] LinkedIn, [11] Netflix, [8] and Wikimedia. [12] According to a 2021 report by the DevOps Institute, 22% of respondents in a survey of 2,000 worldwide IT professionals had adopted the SRE model compared to 15% percent the previous year. [13] [14]

Definition

Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. [15] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration. [16] Focuses of SRE include automation, system design, and improvements to system resilience. [16]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.[ citation needed ]

Site reliability engineering is considered a specific implementation of DevOps; [17] SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly. [2] [3] [4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change. [9]

Principles and practices

There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions: [1] [18]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

Implementations

Site Reliability Engineering (SRE) teams collaborate with other departments within organizations to implement SRE principles effectively. Below is an overview of common practices: [20]

Kitchen Sink, a.k.a. “Everything SRE”

In Site Reliability Engineering (SRE), "Kitchen Sink" refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including everything from system design and performance optimization to incident management and automation. This holistic approach allows SREs to address many challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities. By embracing this comprehensive perspective, SRE teams can foster a culture of continuous improvement and resilience, ultimately enhancing the overall reliability of services.

Infrastructure

Infrastructure SRE (Site Reliability Engineering) teams focus on maintaining and improving the reliability of key systems that support other teams’ workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring uptime, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.

Tools

Teams utilize a variety of tools to measure, maintain, and enhance system reliability. These tools play a crucial role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance, Nagios Core is widely used for system monitoring and alerting, while Prometheus (software) is popular for collecting and querying metrics in cloud-native environments. Leveraging these tools, SRE teams can ensure optimal performance and quickly respond to potential reliability challenges.

Product or application

Site Reliability Engineering (SRE) teams dedicated to specific products or applications are common in large organizations. These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets

Embedded

In an embedded model, individual SREs or small SRE pairs are integrated directly within software engineering teams. These SREs work closely with developers, applying core SRE principles, such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach helps improve reliability and performance while fostering collaboration between SREs and developers.

Consulting

Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with extensive experience across various implementations, these teams provide valuable insights and guidance tailored to specific organizational needs. When working directly with clients, these SREs are often referred to as 'Customer Reliability Engineers.'

In large organizations that have adopted SRE, a hybrid model is common. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the unique reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications

Industry

Since 2014, the USENIX organization has hosted the annual SREcon conference, bringing together site reliability engineers from various industries. This conference serves as a platform for professionals to share knowledge, explore best practices, and discuss the latest trends in site reliability engineering. [21]

See also

Related Research Articles

<span class="mw-page-title-main">System administrator</span> Person who maintains and operates a computer system or computer network

An IT administrator, system administrator, sysadmin, or admin is a person who is responsible for the upkeep, configuration, and reliable operation of computer systems, especially multi-user computers, such as servers. The system administrator seeks to ensure that the uptime, performance, resources, and security of the computers they manage meet the needs of the users, without exceeding a set budget when doing so.

Software development is the process used to create software. Programming and maintaining the source code is the central step of this process, but it also includes conceiving the project, evaluating its feasibility, analyzing the business requirements, software design, testing, to release. Software engineering, in addition to development, also includes project management, employee management, and other overhead functions. Software development may be sequential, in which each step is complete before the next begins, but iterative development methods where multiple steps can be executed at once and earlier steps can be revisited have also been devised to improve flexibility, efficiency, and scheduling.

<span class="mw-page-title-main">USENIX</span> Organization supporting operating system research

USENIX is an American 501(c)(3) nonprofit membership organization based in Berkeley, California and founded in 1975 that supports advanced computing systems, operating system (OS), and computer networking research. It organizes several conferences in these fields.

The World Wide Web has become a major delivery platform for a variety of complex and sophisticated enterprise applications in several domains. In addition to their inherent multifaceted functionality, these Web applications exhibit complex behaviour and place some unique demands on their usability, performance, security, and ability to grow and evolve. However, a vast majority of these applications continue to be developed in an ad hoc way, contributing to problems of usability, maintainability, quality and reliability. While Web development can benefit from established practices from other related disciplines, it has certain distinguishing characteristics that demand special considerations. In recent years, there have been developments towards addressing these considerations.

<span class="mw-page-title-main">Urs Hölzle</span> Swiss computer scientist

Urs Hölzle is a Swiss software engineer and technology executive. As Google's eighth employee and its first VP of Engineering, he has shaped much of Google's development processes and infrastructure, as well as its engineering culture. His most notable contributions include leading the development of fundamental cloud infrastructure such as energy-efficient data centers, distributed compute and storage systems, and software-defined networking. Until July 2023, he was the Senior Vice President of Technical Infrastructure and Google Fellow at Google. In July 2023, he transitioned to being a Google Fellow only.

<span class="mw-page-title-main">Release management</span> Process of software building

Release management is the process of managing, planning, scheduling and controlling a software build through different stages and environments; it includes testing and deploying software releases.

DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle. DevOps is complementary to agile software development; several DevOps aspects came from the agile way of working.

Continuous testing is the process of executing automated tests as part of the software delivery pipeline to obtain immediate feedback on the business risks associated with a software release candidate. Continuous testing was originally proposed as a way of reducing waiting time for feedback to developers by introducing development environment-triggered tests as well as more traditional developer/tester-triggered tests.

Platform engineering is a software engineering discipline for the development of toolchains and self-service workflows. The goal is to create a shared platform for software engineers using computer code.

<span class="mw-page-title-main">Jesse Robbins</span> American entrepreneur

Jesse Robbins is an American technology entrepreneur, investor, and firefighter notable for his pioneering work in Cloud computing, role in creating DevOps/Chaos Engineering, and efforts to improve emergency management.

In software engineering, a microservice architecture is an architectural pattern that arranges an application as a collection of loosely coupled, fine-grained services, communicating through lightweight protocols. A microservice-based architecture enables teams to develop and deploy their services independently, reduce code interdependency and increase readability and modularity within a codebase. This is achieved by reducing several dependencies in the codebase, allowing developers to evolve their services with limited restrictions, and reducing additional complexity. Consequently, organizations can develop software with rapid growth and scalability, as well as implement off-the-shelf services more easily. These benefits come with the cost of needing to maintain a decoupled structure within the codebase, which means its initial implementation is more complex than that of a monolithic codebase. Interfaces need to be designed carefully and treated as API's.

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.

<span class="mw-page-title-main">Dynatrace</span> American technology company

Dynatrace, Inc. is a global technology company that provides a software observability platform based on artificial intelligence (AI) and automation. Dynatrace technologies are used to monitor, analyze, and optimize application performance, software development and security practices, IT infrastructure, and user experience for businesses and government agencies throughout the world.

Infrastructure as code (IaC) is the process of managing and provisioning computer data center resources through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this process comprises both physical equipment, such as bare-metal servers, as well as virtual machines, and associated configuration resources. The definitions may be in a version control system, rather than maintaining the code through manual processes. The code in the definition files may use either scripts or declarative definitions, but IaC more often employs declarative approaches.

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, virtual machines, or physical servers. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.

<span class="mw-page-title-main">MLOps</span> Approach to machine learning lifecycle management

MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of "machine learning" and the continuous delivery practice (CI/CD) of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation, orchestration, and deployment, to health, diagnostics, governance, and business metrics.

<span class="mw-page-title-main">ModelOps</span>

ModelOps, as defined by Gartner, "is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models". "ModelOps lies at the heart of any enterprise AI strategy". It orchestrates the model lifecycles of all models in production across the entire enterprise, from putting a model into production, then evaluating and updating the resulting application according to a set of governance rules, including both technical and business key performance indicators (KPI's). It grants business domain experts the capability to evaluate AI models in production, independent of data scientists.

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components. To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

<i>DevOps Research and Assessment</i> Research team in Google Cloud

DevOps Research and Assessment is a team that is part of Google Cloud that engages in opinion polling of software engineers to conduct research for the DevOps movement.

References

  1. 1 2 "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
  2. 1 2 Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN   978-1-4919-5118-7. OCLC   945577030.
  3. 1 2 Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
  4. 1 2 "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.
  5. Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian . Retrieved June 17, 2021.
  6. "What is SRE?". Red Hat . Retrieved June 17, 2021.
  7. Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
  8. 1 2 Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch . Retrieved June 17, 2021.
  9. 1 2 3 Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
  10. "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
  11. "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.
  12. "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
  13. Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
  14. Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus . Retrieved June 17, 2021.
  15. Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
  16. 1 2 Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login: . Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
  17. Dave Harrison (9 Oct 2018). "Interview with Betsy Beyer, Stephen Thorne of Google" . Retrieved 24 July 2024.
  18. "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
  19. "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
  20. "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
  21. "Usenix SREcon". USENIX . 2021. Retrieved June 17, 2021.

Further reading