SRE - Site Reliability Engineering

By Keith Thuerk | SCIFI Future | 17 Aug 2022

SRE - Site Reliability Engineering

The term SRE came out of Google in '03, it was invented by Ben Treynor Sloss VP of Engineering. 

2-Main Goals of SRE Scalable & Highly reliable enterprise infrastructures.

What is SRE? Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Operators that are responsible for reliability typically use a DevOps tool-set. 100% reliability is not expected, failure is planned for and accepted! 


  • Observability,
  • Experimentation,
  • Reporting,
  • Culture

What makes up an SRE?

  • Software engineers that specialize on reliability w/ only 50% of time dedicated to Ops  (Best Balance between Velocity and Stability)
  • They are the bridge between development & ops. They are heavy into investigation. Taking over traditional Ops w/ Engineering skills SREs combine the skills of software engineers and production plus operations management, to achieve high reliability and ensure that SLO/SLA/SLI targets are met.
  • Compliments DevOps , leverage Infrastructure as Code (IAC) to automate Evolution of IT Ops  Works w/ 'Lean' Store everything in a monorepo
  • Can incorporate chaos engineering can be used by SRE to help id weaknesses in deployments  via use of LitmusChaos tool(s) Add in AI for automation

Engagement Model

  • Concerns production only, which includes
  • System architecture and inter-service dependencies
  • Instrumentation, metrics and monitoring
  • Emergency response
  • Capacity planning
  • Change management
  • Performance — availability, latency and efficiency

Not all SRE engagements are the same nor long term

  • Baseline support — tactical and reactive ad-hoc support like office hours or consulting projects, where the developers execute based on advice received, or as part of the incident response team in larger-scale outages
  • Assisted engagement — SRE provides strategic, proactive, product-focused consultancy, with a dedicated SRE point of contact and a shared production roadmap; this can be an SRE temporarily embedded on a dev team for a critical product where an SRE can be a force multiplier
  • Full support — SRE is the effective owner of production, does on-call rotations to solve less obvious and complex production problems — the goal is the SRE to automate themselves out of a job in 18 months.

Use Cases:

  • Chaos Engineering (understanding what breaks systems (not just hardware but on the software stack too) which should dictate outcomes for what is allowed to run in production.
  • Relied heavily upon for those Enterprises going thru Digital Transformation (AKA DX) 

Summary - SRE roles continue to mature and are playing a huge role in digital transformation and show how traditional IT skill silos are being broken down. How are your skills evolving to incorporate SRE?  Are you keeping up?

How do you rate this article?


Keith Thuerk
Keith Thuerk

Currently learning about Crypto & DeFi to combat the Inflationary Tidal wave coming our way!

SCIFI Future
SCIFI Future

Quantum Computing In Bite Size Pieces & SCIFI items

Send a $0.01 microtip in crypto to the author, and earn yourself as you read!

20% to author / 80% to me.
We pay the tips from our rewards pool.