“从根本上说,当您要求软件工程师设计操作功能时,就会发生这种情况。” -Ben Treynor Sloss,Google工程副总裁,Google SRE创始人
请先查看贡献准则 。
- 什么是站点可靠性工程?
- KBen Treynor撰写的SRE关键点
- 谷歌SRE资源
- Pedro Canahuati的生产工程说明
- PostOps:从运营中恢复
- 喜欢DevOps?等到您遇到SRE [video]
- Google如何进行行星规模基础设施的行星规模工程
- Facebook的站点可靠性工程
- Uber网站可靠性工程的历史
- 案例研究:在StackOverflow上采用SRE原则
- Dropbox的站点可靠性工程
- 站可靠性工程师-保持Google 24/7全天候运行
- Salesforce的站点可靠性工程
- 从Sys Admin到Netflix SRE - video and slides
- SRE@Google:自2004年以来已有数千个DevOps
- 交易系统管理正在杀死我们,必须停止
- SRE需求的层次结构
- PostOps:软件,脆弱性和可靠性的非手术故事
- 组成出色的SRE团队
- 在Google工作:与我们的生产工程师见面进行现场可靠性直播
- 辛劳:每个工程师都应该知道的一句话
- 网站的工程可靠性:Google SRE
- DEVOPS&SRE AMA-建立高性能组织
- 约翰·奥尔斯帕(John Allspaw)关于事件分析和事后分析的AMA
- Paul Newson进行站点可靠性工程- Part 1 & Part 2
- SysAdmins如何自我贬值
- SRE,名词。另请参阅:信心,信任。
- tephen Weinberg的站点可靠性工程
- 我们是Google网站可靠性团队。我们使Google的网站正常运行。询问我们任何事情!
- 我们是Google网站可靠性工程团队。询问我们任何事情!
- Ops身份危机
- 大型生产系统中错误的不可再现性
- 谈现场可靠性工程
- 微服务,DevOps和生产复杂性
- Google客户可靠性工程简介
- 进化还是叛逆?站点可靠性工程师(SRE)的崛起
- 站点可靠性工程,系统管理和DevOps之间的区别
- 小型和大型SRE
- SRE聚会:不同的SRE角色和挑战(Netflix)
- 小组讨论:谁/什么是SRE?
- 希望不是策略
- SRE的宗旨
- 现场可靠性工程揭秘
- 站点可靠性工程是DevOps中真正的“机会”吗?
- SRE vs. DevOps vs. Cloud Native:服务器机架匹配
- SRE:最大的主意是什么?
- 在LinkedIn建立SRE文化
- RackN的SRE白皮书
- SRE:偶尔维护您讨厌的基础架构
- 在全球最大的软件公司中拼接SRE DNA序列
- 为什么您的应用程序应获得SRE支持?-CRE生活课程
- SRE如何在服务中找到地雷-CRE生活课程
- 充分利用SRE服务接管-CRE生活课程
- SRE和基础架构运营
- SRE模型
- 入职新站点可靠性工程师
- Google网站可靠性的基础
- 超越Google SRE:Medium的网站可靠性工程是什么样的?
- 智能站点可靠性工程–机器学习的观点
- LinkedIn全球站点运营速成课程
- 托德·安德伍德(Todd Underwood)的Google网站可靠性工程
- 站点可靠性工程师的崛起
- 定义现代软件角色:New Relic上的SRE
- 站点可靠性工程师使用哪些工具?
- 什么是站点可靠性工程?(VMware)
- SRE简介
- 通过电影和书籍了解站点可靠性工程
- GOTO2017•Google网站可靠性工程
- 成功的地理分布SRE团队的组成 - Part1 & Part2
- SRE的技术领导力
- 微软Azure SRE
- “DevOps”的可扩展性
- 进行站点可靠性管理
- 如何激发Knowlarity的系统可靠性
- 站点可靠性工程入门
- "Nat Welch的“ Dickerson金字塔的实际应用”
- LinkedIn的Kurt Andersen发现SRE实施中的盲点
- Google的Stephen Thorne对Betsy Beyer的采访
- 通过更大的人性来降低风险-Dave Rensin
- SRE入门-Stephen Thorne,Google
- 在大型企业中建立成功的SRE
- 用站点可靠性工程解决可靠性恐惧
- SRE与DevOps:竞争标准还是密友?
- 如何避免即使是最好的团队也无法抓住的5个SRE实施陷阱
- 可靠性工程–复杂系统的基本准则
- 在OCI之上的现代站点可靠性工作台
- 第三时代的SRE
- 关于SRE以及如何(不)应用它
- 将典型的工程运营团队过渡到SRE
- 银行业的SRE
- 使用SRE原理识别和跟踪工作
- 教育SRE
- 从零到英雄:培训不断发展的SRE团队的推荐做法
- 站点可靠性工程的系统工程方面
- 从Bootcamp毕业并有兴趣成为站点可靠性工程师?
- 因此,您想成为站点可靠性工程师吗?
- 运营债务螺旋式增长和SRE编码势在必行
- 所以您想成为SRE?
- 职业简介/站点可靠性工程师
- 站点可靠性工程师的作用是什么?
- DevOps基础:站点可靠性工程
- 事故管理培训:不幸之轮
- 站点非可靠性工程
- 构建90天入职计划的终极指南
- SRE基础知识:SLI,SLA和SLO
- 如何进入SRE
- 您有SRE团队吗?如何开始和评估您的旅程
- SRE团队的组织方式以及入门方法
- 为什么SRE文件很重要
- 如何开始进行站点可靠性工程
- 站点可靠性工程经理的职责
- Practical Linux Infrastructure
- Site Reliability Engineering: How Google Runs Production Systems
- The Site Reliability Workbook: Practical Ways to Implement SRE
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Web Operations - Keeping the Data On Time
- The Checklist Manifesto: How to Get Things Right
- Microservices in Production - Standard Principles and Requirements
- Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization
- Systems Performance: Enterprise and the Cloud [Sample chapter titled CPUs]
- Monitoring Distributed Systems: Case Studies from Google's SRE Teams
- The Human Side of Postmortems: Managing Stress and Cognitive Biases
- Chaos Engineering: Building Confidence in System Behavior through Experiment
- Post-Incident Reviews: Learning from Failure for Improved Incident Responses
- Antifragile Systems and Teams
- How to Monitoring the SRE Golden Signals (E-Book)
- Incident Management for Operations
- Distributed Systems Observability
- Real-World SRE
- Seeking SRE
- What is SRE?
- Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications
- SRE Hiring
- Hiring SREs at LinkedIn
- Hiring Site Reliability Engineers
- Hiring your first SRE
- Growing the Site Reliability Team at LinkedIn: Hiring is Hard
- How to Find and Hire Site Reliability Engineers (SREs)
- Engineering Manager - Site Reliability Engineering Interview Preparation
- The Realities of the Job of Delivering Reliability
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- How we break things at Twitter: failure testing
- Reliable Cron across the Planet
- Push our limits - reliability testing at Twitter
- The Verification of a Distributed System by Caitie McCaffrey
- Weathering the Unexpected
- The Remediation Ballet
- SRE Hour: Tech Talks by Box & Yelp
- Simplicity: A Prerequisite for Reliability
- The Two Sides to Google Infrastructure for Everyone Else
- How Embracing Continuous Release Reduced Change Complexity
- Making "Push On Green" a Reality
- BeyondCorp: A New Approach to Enterprise Security
- Brainstorming Failure by Jeff Smith
- The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
- The infrastructure behind Twitter: efficiency and optimization
- Dickerson's Hierarchy of Reliability
- The Morning Paper on Operability
- Production is all that matters
- Using load shedding to survive a success disaster - CRE life lessons
- How to avoid a self-inflicted DDoS Attack - CRE life lessons
- Don't gamble when it comes to reliability
- Resilience Engineering: Learning to Embrace Failure
- The Infrastructure Behind Twitter: Scale
- Scaling Reliability at Twitter: So You Want to Add a 9
- Principles Of Chaos Engineering
- Chaos Engineering
- Available...or not? That is the question - CRE life lessons
- How Google Backs Up The Internet Along With Exabytes Of Other Data
- Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
- The Production Environment at Google - Part 1 & Part 2
- Reliable releases and rollbacks - CRE life lessons
- How release canaries can save your bacon - CRE life lessons
- Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
- Every Day Is Monday in Operations
- Under the Hood: Ensuring Site Reliability
- Designing reliable systems with cloud infrastructure (Google Cloud Next '17)
- A Google SRE explores GitHub reliability with BigQuery
- Know thy enemy: how to prioritize and communicate risks - CRE life lessons
- Chaos Engineering resources
- CRE life lessons: What is a dark launch, and what does it do for me?
- Why you should pick strong consistency, whenever possible
- The Network is Reliable
- Are You Load Balancing Wrong?
- How production engineers support global events on Facebook
- Google: A Collection Of Best Practices For Production Services
- Canary Analysis Service
- Tips for High Availability
- Progressive Service Architecture At Auth0
- Google Cloud Production Guideline
- production readiness
- Trust By Design: The Fusion of Operational Maturity and Risk Modeling
- Top Seven Myths of Robust Systems
- Taming chaos: Preparing for your next incident
- PID Loops and the Art of Keeping Systems Stable
- Are you ready for production? - Slides
- Production Checklist for Web Apps on Kubernetes
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Monitoring distributed systems
- Observability at Uber Engineering: Past, Present, Future
- The 4 Golden Signals of API Health and Performance in Cloud-Native Applications
- My Philosophy on Alerting by Rob Ewaschuk
- Time To Detect - Netflix
- Why Percentiles Don’t Work the Way you Think
- Building Twitter’s Next-Gen Alerting System
- Instrumentation: Worst case performance matters
- Instrumentation: What does 'uptime' mean?
- Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned
- An introduction to monitoring and alerting with timeseries at scale, with Prometheus
- Detecting outliers and anomalies in realtime at Datadog
- How to Monitor the SRE Golden Signals
- Monitoring in a DevOps World
- Monitoring Your Monitoring’s Monitoring
- Observability: the new wave or buzzword?
- Monitoring Isn't Observability
- Monitoring in the time of Cloud Native
- Principles of Monitoring Microservices
- The Many Ways Your Monitoring Is Lying to You
- GitOps Part 3 - Observability
- Want to Debug Latency?
- Debugging Latency in Go 1.11
- Alerting on SLOs like Pros
- Applied Alerting Philosophy
- Observations on Observability
- Deploys: It's Not Actually About Fridays
- Site Reliability Engineering Best Practices for Data Pipelines
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- Incident Response at Heroku
- Who's On Call?
- SysAdvent - Day 6 - No More On-Call Martyrs
- On Being On Call
- The On-Call Handbook
- Incident management at Google — adventures in SRE-land
- How Spotify and GOV.UK handle on call, and more
- Run Book / Operations Manual template
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
- Project STAR*: Streamlining Our On-Call Process
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- How To Establish a High Severity Incident Management Program
- How Your Systems Keep Running Day After Day - John Allspaw
- On-call doesn’t have to suck
- Why, as a Netflix infrastructure manager, am I on call?
- Oncall and Sustainable Software Development
- On Call Rotations: How Best to Wake Devs Up in the Middle of the Night
- Understanding The Role Of The Incident Manager On-Call (IMOC)
- 3 Ways to Minimize the Impact of High Severity Incidents
- Advice to Management Teams While Enrolling Changes to On-Call Systems
- Moving Past Shallow Incident Data
- Sustainable On-Call
- dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident
- Incident Management at Netflix Velocity
- Incidents, fixes, and the day after
- 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
- Checklists: a stupidly simple but valuable operational gift
- How to write a status page update
- Atlassian Incident Handbook
- PagerDuty Incident Response Handbook
- Avoiding Burnout for SREs
- Better On-Call the SRE way
- Managing Incidents at Monzo
- Making On-Call Not Suck
- How we (Monzo) respond to incidents
- Code Yellow: When Operations Isn’t Perfect
- MTTR is dead, long live CIRT
- Extended Dreyfus Model for Incident Lifecycles
- Inhumanity of Root Cause Analysis
- Incident insights from NASA, NTSB, and the CDC
- How to avoid On-Call Burnout the SRE Way
- My week shadowing a GitLab Site Reliability Engineer
- How our production team runs the weekly on-call handover
- A collection of post-mortems
- Collection of Kubernetes Failure Stories
- Blameless PostMortems and a Just Culture
- A Tale of Postmortems
- Building a Blameless Post-Mortem Culture with Jason Hand
- The infinite hows
- Failure is Always An Option: How a Blameless Culture Leads to Better Results
- How to write an Incident Report / Postmortem
- SysAdvent - Day 1 - Why You Need a Postmortem Process
- Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
- Writing Your First Postmortem
- How to Write Great Outage Post-Mortems
- A collection of postmortem templates
- Embracing Feedback
- Postmortem Action Items: Plan the Work and Work the Plan
- Social Issues In Postmortems
- Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant
- Postmortem culture: how you can learn from failure
- re:Work - Postmortem discussion template
- Post-mortems to the rescue
- Postmortem Action Items: Plan the Work and Work the Plan
- Why Every Company Can Benefit from a Blameless Culture
- "It's dead, Jim": How we write an incident postmortem
- Our incident postmortem template
- Capacity Planning
- SouthBay SRE: Cloud Capacity Planning
- Intent-based Capacity Planning and Autoscaling with Kubernetes
- SLA Aware Maintenance for Operators - Joe Smith
- If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
- Service Level Agreements in the Cloud: Who cares?
- Making a point with SLAs
- SysAdvent- Day 20 - How to set and monitor SLAs
- SLOs, SLIs, SLAs, oh my - CRE life lessons
- Service Levels and Error Budgets
- (Un)Reliability Budgets - Finding Balance between Innovation and Reliability
- The Calculus of Service Availability
- Availability Calculator: Calculate how much downtime should be permitted in your SLA
- Standardize cloud SLA availability with numerical performance data
- Best practices to develop SLAs for cloud computing
- A Practical Guide to SLAs
- Building good SLOs - CRE life lessons
- No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
- Consequences of SLO violations — CRE life lessons
- Service Level Objectives in Practice
- SRE Consensus Building
- An example escalation policy — CRE life lessons
- Error Budget Calculator
- Understanding error budget overspend - part one - CRE life lessons
- Good housekeeping for error budgets - part two - CRE life lessons
- SRE fundamentals: SLIs, SLAs and SLOs
- SLOs & You: A Guide To Service Level Objectives
- Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment
- Nines are Not Enough: Meaningful Metrics for Clouds
- How many nines is my storage system?
- Don't follow the sun.
- The Tyranny of the SLA
- Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter
- DevOpsDays Chicago 2019 - The Art of SLOs
- The Art of SLOs Workshop Materials
- How to Include Latency in SLO-Based Alerting
- Performance Checklists for SREs
- South Bay SRE Meetup - Netflix Cloud Performance Team
- Software Performance Analysis Guided By SLOs
- A framework for pragmatic performance engineering
- Go Language for Ops and Site Reliability Engineering
- Go for SREs using Python
- Operability in Go
- Go Reliability and Durability at Dropbox
- What is SRE (Site Reliability Engineering)?
- Here’s How Google Makes Sure It (Almost) Never Goes Down
- Are site reliability engineers the next data scientists?
- Site Reliability Engineers: "solving the most interesting problems"
- Site Reliability Engineers: the "world’s most intense pit crew"
- Site reliability engineering kicks rote tasks out of IT ops
- Notes on Site Reliability Engineering
- Adventures in SRE-land: Welcome to Google Mission Control
- LinkedIn Preps Site Reliability Engineers (SREs) For Exciting Careers
- Book Review: Site Reliability Engineering - How Google Runs Production Systems
- Site Reliability Engineers: “We solve cooler problems”
- SREcon17: Brave new world of site reliability engineering
- Open AWS guide
- 20 SRE / Devops / System Engineer Tricks
- Commentary on Site Reliability Engineering
- Site Reliability Engineering: 4 Things to Know
- Looking for SRE Success? Then Find the Intrapreneurs!
- What Team Structure is Right for DevOps to Flourish?
- Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency
- Building blameless working environment
- SRE Adoption Report
- SREs: The Happiest – and Highest Paid – in the Industry
- The Role of Site Reliability Engineering, Today and Tomorrow
- SRE as a Lifestyle Choice
- SRECon EMEA 2019 Recap
- #sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
- #incident_response channel at Hangops Slack - Discussion about Incident Response.
- USENIX SREcon Slack
- Brendan Gregg's Blog - Highly Technical Blog Posts About Systems Internals, Performance and SRE.
- Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
- High Scalability - Technical Blog Posts About Systems Architecture.
- rachelbythebay - Techincal Blog Posts.
- SRE Weekly - Weekly Site Reliability Newsletter.
- Production Ready - A mailing list about building resilient infrastructure and tools.
- Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
- SysAdvent - One article for each day of December, ending on the 25th article.
- Operations for Developers - A collection of resources for developers to strengthen their Ops skills.
- Stephen Thorne's Blog - Blog Posts About SRE
- Increment - A digital magazine about how teams build and operate software systems at scale.
- O’Reilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
- GopherSRE - Blog Posts about Go and SRE.
- Cindy Sridharan - Blog posts about distributed systems and their management.
- Blameless Blog - Blog posts about SRE culture and practices.
- Resilience Roundup - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
- Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.
- SRECon Conferences - The Official SRE Conference.
- LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
- SRE Tech Talks - SRE Talks Hosted by Google.
- South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
- San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
- Front Range Site Reliability Engineering - SRE Meetup in Boulder/Denver/Golden/DTC/FoCo area.
- Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.
- ADDO - All Day DevOps - A 24 hour conference that is completely online and free.
- Site Reliability Engineering Paris, France - SRE Meetup in the city of light.
- Google SRE Twitter Account - Google's SRE Twitter Account.
- SREBook - The Official Twitter Account of Site Reliability Engineering Book.
- SREcon - SRECon's Official Twitter Account.
- SREWorkbook - The Official Twitter Account of Site Reliability Workbook.
- The SRE Dev - SRE-related Posts from dev.to.
- Twitter SRE - The Official Twitter Account of Twitter's SRE team.
- Twitter SRE Weekly - The Official Twitter Account of SRE Weekly Newsletter.
- USENIX Association - The Official USENIX Twitter Account.
- Awesome SRE Tools - A curated list of Site Reliability and Production Engineering tools
- List of Continuous Integration services