SRE 和 DevOps

2019-06-26 15:41 作者:wsgzao 0人讀過 | 我要投稿

前言

在搜索 SRE 和 DevOps 相關概念的過程中偶然發(fā)現(xiàn) Google Cloud 的 Blog 專門制作了這樣一篇文章，國內(nèi)雖然有不少翻譯但并沒有完全做到翻譯術語中的 “信，雅，達”，這里轉(zhuǎn)載 Google 官方的文章和 YouTube 視頻，同時也選擇了臺灣網(wǎng)友精心翻譯的文章并把視頻搬運至 bilibili 也就是 B 站方便大家瀏覽，相信大家可以對 SRE 和 DevOps 有更深入的理解。

SRE vs. DevOps: competing standards or close friends?

更新歷史

2019 年 06 月 25 日 - 初稿

閱讀原文 -?https://wsgzao.github.io/post/sre-vs-devops/

擴展閱讀

SRE vs. DevOps: competing standards or close friends? -?https://cloud.google.com/blog/products/gcp/sre-vs-devops-competing-standards-or-close-friends
DevOps 和 SRE -?https://blog.alswl.com/2018/09/devops-and-sre/

英文原文

SRE vs. DevOps: competing standards or close friends?

Seth Vargo: Staff Developer Advocate
Liz Fong-Jones: Site Reliability Engineer
May 9, 2018

Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they’re not so different after all.

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones?and?Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

1. The difference between DevOps and SRE

It’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.

The?DevOps movement?began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group’s priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement?does not explicitly define?how?to succeed?in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:

DevOpsSREReduce organization silosShare ownership with developers by using the same tools and techniques across the stackAccept failure as normalHave a formula for balancing accidents and failures against new releasesImplement gradual changeEncourage moving quickly by reducing costs of failureLeverage tooling & automationEncourages “automating this year’s job away” and minimizing manual systems work to focus on efforts that bring long-term value to the systemMeasure everythingBelieves that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

If you think of DevOps like an interface in a programming language,?class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.

DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out?How SRE relates to DevOps?(Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.

2. SLIs, SLOs, and SLAs

The SRE discipline collaboratively decides on a system’s availability targets and measures availability with input from engineers, product owners and customers.

It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system’s uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer’s code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.

SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
SLOs are targets for the cumulative success of SLIs over a window of time (like “l(fā)ast 30 days” or “this quarter”), agreed-upon by stakeholders

The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.

SLIs, SLOs and SLAs tie back closely to the DevOps pillar of “measure everything” and one of the reasons we say?class SRE implements DevOps.

3. Risk and error budgets

We focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.

Maximizing a system’s stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won’t notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an “error budget.” When error budgets are depleted, the focus shifts from feature development to improving reliability.

As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.

Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.

4. Toil and toil budgets

An important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of “normal” is constantly changing.

Toil is not simply “work I don’t like to do.” For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.

The SRE discipline aims to reduce toil by focusing on the “engineering” component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it’s realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE’s time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.

Toil and toil budgets are closely related to the DevOps pillars of “measure everything” and “reduce organizational silos.”

5. Customer Reliability Engineering (CRE)

Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.

In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.

Thus, in 2016, we?launched the CRE program?as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.

CRE aligns with the DevOps pillars of “reduce organization silos” by forcing collaboration across organizations, and it also closely relates to the concepts of “accepting failure as normal” and “measure everything” by creating a shared responsibility among all stakeholders in the form of shared SLOs.

Looking forward with SRE

We are working on some exciting new content across a variety of mediums to help showcase how users can adopt DevOps and SRE on Google Cloud, and we cannot wait to share them with you. What SRE topics are you interested in hearing about? Please?give us a tweet?or?watch our videos.

Posted in:

DevOps & SRE
Application Development

中文翻譯

中文翻譯原文為繁體中文，我轉(zhuǎn)化為簡體中文，視頻替換為 B 站

[好文翻譯] 你在找的是 SRE 還是 DevOps？

Neil Wei in KKStream
Aug 3, 2018

敝社這半年來開始?大舉征才，其中不乏 DevOps 和 SRE 的職缺，然而 HR (或其他部門的同事) 對于兩者的相異之處并不了解，甚至認為 SRE 和傳統(tǒng)維運單位一樣，只是換個名字，從管機房到管云端而已，究竟兩者到底有什么差別呢？

這對前來的面試的應征者會有負面的影響，好像連我們自己要找什么樣的人都不清楚似的。于是，花了點時間跟 HR 介紹兩者的差異，也在支援了 SRE 團隊四個月后留下這篇翻譯文加一點點心得。

請先記得…

SRE?is?a DevOps (香蕉是一種水果)

DevOps?is?NOT a SRE (水果不是香蕉)

DevOps 并不是一個 “工作職稱”，SRE 才是

《本文已取得原作者之一?Seth Vargo?同意翻譯刊登》

原文網(wǎng)址：https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html?m=1

正文開始

Site Reliability Engineering (SRE) 和 DevOps 是目前相當熱門的開發(fā)與維運文化，有著很高的相似程度。然而，早期有些人會把 SRE 視為和 DevOps 不同的實踐方式，認為兩者不一樣，必需選擇其一來執(zhí)行，但是現(xiàn)在大家更傾向兩者其實其實很相似。

究竟 SRE 和 DevOps 有什么相同點呢？在年初，Google 的工程師 (Liz Fong-Jones?與?Seth Vargo) 準備了一系列的影片去解答這些問題以及嘗試跳出來去減少社群間的意見分歧，本篇文章總結了影片中所涵蓋到的主題，以及如何實際去建置一個更加可靠的系統(tǒng)。

1. SRE 和 DevOps 的差異

在開始之前，先了解一下 SRE 和 DevOps 有什么相同之處？又有什么相異之處？

DevOps 文化的興起是因為在早期 (約十年前)，有許多開發(fā)者對于自己的程式是怎么跑在真實世界，其實所知有限。開發(fā)者要做的事情就是將程式打包好，然后扔給維運部門后，自己的工作周期就結束了，而維運部門會負責將程式安裝與部署到所有生產(chǎn)環(huán)境的機器上，同時也要想盡各種辨法與善用各種工具，確保這些程式持續(xù)正常地執(zhí)行，即使維運部門完全不了解這些程式的實作細節(jié)。

這樣的工作模式很容易造成兩個部門之間的對立，各自的部門都有自己的目標，而各自的目標和公司商業(yè)需求可能會不一致。DevOps 的出現(xiàn)是為了帶來一種?新的軟體開發(fā)文化，用以降低開發(fā)與維運之間的鴻溝。

然而，DevOps 的本質(zhì)并不是教導大家?怎么做?才會成功，而是訂定一些?基本原則讓大家各自發(fā)揮?，以程式設計的術語來說，DevOps 比較像是一個抽象類別 (abstract class)，或是介面 (interface)，定義了這種文化該有什么樣的行為，實作則是靠各個部門成員一起決定，只要符合這個「介面」，就可以說是 DevOps 文化的實踐。

SRE 一詞由 Google 提出，是 Google 在這十多年間為了解決內(nèi)部日漸龐大的系統(tǒng)而制定出一連串的規(guī)范和實作，和 DevOps 不同的是，它實作了 DevOps 的所定義的抽象方法，而且規(guī)范了更多關于?如何用軟體工程的方法與從維運的角度出發(fā)，以達成讓系統(tǒng)穩(wěn)定的目的?。簡單來說，SRE 實作了 DevOps 這個介面 (interface)，以下列出五點 DevOps 定義的?介面?以及 SRE 如何?實作?：

DevOps：?減少組織之間的谷倉效應

SRE：?在整個開發(fā)周期中，和開發(fā)團隊使用相同的工具以及一起分享與所有權。(注：Infra as code,?configuration as code)

DevOps：接受失效，視失效為開發(fā)周期中的一個元素

SRE：?對于新的版本，建立一套可以量化的指標去衡量 “意外” 和 “失效”

DevOps：?逐漸改變

SRE：鼓勵團隊透過降低排除故障的成本來達成速交付的目的 (就是不需要一次做到最好，而是逐漸改變)

DevOps：善用工具和自動化

SRE：鼓勵團隊把自己今年的工作自動化，最小化” 工人智慧” 要做的事，把精力放在中長期的系統(tǒng)改善。

DevOps：任何事都是可以被量測的

SRE：相信維運是軟體工程的范籌，規(guī)范關于可用性，運行時間 (uptime)，停機時間 (outages)，哪些是苦工等量測值。

如果你已經(jīng)認同 DevOps 是一個 “介面 (interface)”，那么以程式語言的角度來說就是：

class SRE implements DevOps

雖然實際上兩者之間仍有需多獨立的原則，SRE 并非完全 1:1 實作了 DevOps 的所有的概念，但最終他們兩個的結論是相同的，也和程式語言相同，類別在繼承介面之后，可以做更多的延伸，也可以實作更多不同的介面，SRE 包含了更多細節(jié)是 DevOps 原本所沒有定義的。

在軟體開發(fā)和維運的領域中，DevOps 和 SRE 并非互相競爭誰才是業(yè)界標準?，相反地，兩者都是為了減少組職之間的隔閡與更快更好的軟體所設計出來的方法，如果你想看更多細節(jié)的話，How SRE relates to DevOps?(Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) 這本書值得一看。

2. SLIs, SLOs, and SLAs

SRE 的原則之一是針對不同的職務，給出不同的量測值。對于工程師，PM，和客戶來說，整個系統(tǒng)的可用程度是多少，以及該如何測量，都有不同的呈現(xiàn)方式。

如果無法衡量一個系統(tǒng)的運行時間與可用程度的話，是非常難以維運已經(jīng)上線的系統(tǒng)，常常會造成維運團隊持續(xù)處在一個救火隊的狀態(tài)，而最終找到問題的根源時，可能會是開發(fā)團隊寫的 code 出了問題。

如果無法定出運行時間與可用程度的量測方法的話，開發(fā)團隊往往不會將「穩(wěn)定度」視為一個潛在的問題，這個問題已經(jīng)困擾了 Google 好多年，這也是為什么要發(fā)展出 SRE 原則的動機之一。

SRE 確保每一個人都知道怎么去衡量可靠度以及當服務失效時該做什么事。這會細到當問題發(fā)生時，從 VP 或是 CxO，至最組織內(nèi)部的每一個相關員工，都有做己該做的事。每一個「人」，該做什么「事」都被規(guī)范清楚，SRE 會和所有的相關人員溝通，去決定出 Service Level Indicators (SLIs) 與 Service Level Objectives (SLOs)。

SLIs 定義了和系統(tǒng)「回應時間」相關的指標，例如回應時間，每秒的吞吐量，請求量，等等，常常會將這個指標轉(zhuǎn)化為比率或平均值。

SLOs 則是和相關人員討論后，得出的一個時間區(qū)間，期望 SLIs 所能維持一定水準的數(shù)字，例如「每個月 SLIs 要有如何的水準」，比較偏內(nèi)部的指標。

該影片也討論到了 Service Level Agreements (SLAs)，即使這不是 SRE 每天所關心的數(shù)字。作為一個線上服務的提供者，SLA 是對客戶的承諾?，確保服務持續(xù)運行的百分比，通常是和客戶「談」出來的，每年 (或每月) 的停機時間不得低于幾分鐘。

SLI, SLO, SLA 的概念和 DevOps 所提的「任何事都可以被量測」非常相似，這也就是為什么會說 class SRE implements DevOps 的原因之一了。

3. 風險和犯錯預算

對于風險，我們會用犯錯預算來評估，犯錯預算是一個量化的值，用來描述服務每天 (或每月) 可以失效的時間，若服務的 SLAs 是 99.9%，那么開發(fā)團隊就等于有 0.1％的犯錯預算通可以用。這個值是一個和 Product Owner 和開發(fā)團隊談過之后取得平衡的值，以下的影片也講到了為什么 0 犯錯預算并不是一個適合的值。

致力于將一個系統(tǒng)的可用程度維持在 100% 是一件會累死你又無意義的事情，不切實際的目標會限制了開發(fā)團隊推出新功能到使用者手上速度，而且使用者多半也不會注意到這件事 (例如可靠度是 99.999999%)，因為他們的 ISP 業(yè)者，3G/4G 網(wǎng)路，或是家里的 WiFi 可能都小于這個數(shù)字。致力維持一個 100% 不間斷的服務會嚴重限制開發(fā)團隊將新功能交付出去的時間。為了要達成這個嚴酷的限制，開發(fā)人員往往會選擇不要修 bug，不要增加功能，不要改進系統(tǒng)，反之，應該要保留一些彈性讓開發(fā)團隊可以自由發(fā)揮。

SRE 的原則之一就是計算出可以容忍的「犯錯預算」，一旦這個預算耗盡，才應該開始將重點放在可靠性的改善而非持續(xù)開發(fā)新功能。

如第二個影片提到的，這個文化能讓管理階層買單是最重要的事，因為 SLIs 是大家一起訂出來的，如果不照游戲規(guī)則走的話，SRE 又會淪為持續(xù)為了讓系統(tǒng)維持一定的穩(wěn)定度了而一直做苦力的事，但是沒人知道 (因為沒有訂標準)，最終這個服務一定會失敗。風險和犯錯預算會將犯錯視為正常的事，而改善的方式之一是讓新功能持續(xù)且小規(guī)模的發(fā)布，這也和 DevOps 的原則相符合。

4. 瑣事和瑣事預算

另一個 SRE 的原則是瑣事的控管，如何減少瑣事？何謂瑣事？

維運中需要手動性操作的、重復的，可以被自動化的

或是一次性，沒有持久價值的工作，都是瑣事。

然而瑣事并不是「我不想做的事」，舉例來說，公司會有許多經(jīng)常性的事務，一再的發(fā)生，例如開會，溝通，回 email，這些都不是瑣事。

反之，像是每天手動登入某臺機器，取得某個檔案后做后續(xù)的處理，然后做成報告寄出來，這種就是瑣事，因為他是手動，重復，可以被自動化的。

SRE 的原則是嘗試使用軟體工程的方法消除這些事情，當 SRE 發(fā)現(xiàn)事情可以被自動化后，便會著手執(zhí)行自動化流程的開發(fā)，避免之后再做一樣的事情，雖然使瑣事最小化很重要，但實際上，這是不可能完全消除的，Google 致力于將 SRE 的日?，嵤驴s小到 50% 以下?，使得 SRE 成員可以將時間發(fā)費在更有意義的事情上，每季的回顧也都會檢視成果。

然而瑣事也并非完全是壞事，對于新進成員來說，先參與這事例行事務有助于了解這個服務該做些什么事情，這是相對低風險與低壓力的，但是長遠來看，任何一個工程師都不該一直在做瑣事。

瑣事管理也和 DevOps 的原則 — 任何事都是可被測量與減少組織之間的谷倉效應相符。

5. 客戶可靠性工程 (Customer Reliability Engineering, CRE)

個人覺得這個主題對目前而言稍微走遠了，就不逐句翻譯。

大意如何將 SRE 的概念傳達出去，讓 GCP 的客戶知道該怎么正確的使用 GCP 的各項服務以及推廣 SRE 的風氣。

個人后記

其實目前敝社漸漸轉(zhuǎn)型中，的確處在一個從傳統(tǒng)開發(fā)與維運轉(zhuǎn)互相獨立，到目前漸漸實做 DevOps 文化的路上，在支援了 SRE 部門 4 個月后，參與了很多現(xiàn)實面會碰到的挑戰(zhàn)，也和大家一起制定自動化流程與改善目前現(xiàn)有的瑣事，也漸漸朝 DevOps 的文化前進中，希望讓大家可以知道：

SRE 是軟體工程，不該只是維運人員或是系統(tǒng)管理員。

DevOps 并不是一個職稱，SRE 才是，就像你不會到市場菜攤跟老板說我要買 “青菜”，而且會說要買高麗菜還是小白菜吧！

不過理想總是完美的，還是要面對現(xiàn)實，我們的公司不叫 Google，大部份的人也進不去 Google，Google 的 SRE 可能比大多數(shù)公司的軟體開發(fā)工程師還要會寫 code，比網(wǎng)路工程師還要懂網(wǎng)路，比維運工程師還要懂維運，在我們周圍的環(huán)境所開的 SRE 職缺，其實很多都不是想象中的這樣美好，瑣事 / 手動的事可能還是占大多數(shù)，部門間還是存在隔閡，不會寫 code 的 SRE 可能也很多，維運還是占日常工作的多數(shù)等現(xiàn)況。

傳統(tǒng)維運人員或 IT 網(wǎng)管人員若想往 SRE 發(fā)展的話，也必需改變一下思維，跳脫舒適圈，在這個什么都 as code，什么都 as a service 的年代，不寫 code 就等著等淘汰了。

改變是緩慢而且需要慢慢培養(yǎng)的，就讓我們… 咦… P0 事件發(fā)生了！先這樣啦！

延伸閱讀

在此感謝所有人的分享，推動技術的不斷進步

Google 儲存 SRE 團隊負責人第一手經(jīng)驗大公開
https://rickhw.github.io/categories/DevOps/SRE/
一篇文章徹底讀懂 DevOps 與 SRE 來龍去脈 (譯)?
Site Reliability Engineering
DevOps 和 SRE

標簽：

SRE 和 DevOps的評論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

SRE 和 DevOps

前言

更新歷史

英文原文

1. The difference between DevOps and SRE

2. SLIs, SLOs, and SLAs

3. Risk and error budgets

4. Toil and toil budgets

5. Customer Reliability Engineering (CRE)

Looking forward with SRE

中文翻譯

延伸閱讀

SRE 和 DevOps的評論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

SRE 和 DevOps

前言

更新歷史

英文原文

1. The difference between DevOps and SRE

2. SLIs, SLOs, and SLAs

3. Risk and error budgets

4. Toil and toil budgets

5. Customer Reliability Engineering (CRE)

Looking forward with SRE

中文翻譯

延伸閱讀

本文作者的其他文章

SRE 和 DevOps的評論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

SRE 和 DevOps的評論 (共條)