Site Reliability Engineering
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team
Site Reliability Engineering
Training Mode: Live Instructor-led (online) | Corporate Training | Job Support
Course Duration: 48 hours
Trainer: Certified and expert instructors
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.”
- Training by Realtime Expert trainer
- Live Online Classes
- Free study material
- Online virtual Classes available in morning, evening and weekend
There are no as such specific pre-requisites but IT experience/Operations experience/DevOps knowledge is recommended
We guarantee learning at your convenience & pace.
- Instant Access:
Get instant access to self-paced training after signup. - Streaming video recording:
Watch lessons any time at your schedule, free recording. - Exercises:
Practical exercises help you test what you are learning as you go. - Free Demo:
Sign up for free demo to check whether the course is right for you and interact with the faculty live. - Experienced Trainers:
We only hire the industry’s best trainers - Live free interactive web sessions:
Ask the Expert Shell Scripting trainers about the career prospects and clarify your questions any time after you complete the course. - Structured Curriculum Schedule:
Progress with your complete daily interactive lessons and assignments. - Faculty Mentoring:
Turn in daily and weekly homework for personalized feedback from faculty. - Virtual Office Hours:
Live interaction with the faculty and other students around the world. - Hands on Live Projects:
Work on live lab sessions to tackle real-world projects. Get 100% faculty guidance and ratings.
Introduction
- The Sysadmin Approach to Service Management
- Google’s Approach to Service Management: Site Reliability Engineering
- Tenets of SRE
- Demand Forecasting and Capacity Planning
- Efficiency and Performance
The Production Environment at Google, from the Viewpoint of an SRE
- Hardware
- System Software That “Organizes” the Hardware
- Storage
- Networking
- Monitoring and Alerting
Principles
- Embracing Risk
- Managing Risk
- Motivation for Error Budgets
- Benefits
Service Level Objectives
- Service Level Terminology
- Indicators in Practice
- What Do You and Your Users Care About?
- Agreements in Practice
Eliminating Toil
Monitoring Distributed Systems
- Why Monitor?
- Setting Reasonable Expectations for Monitoring
- Symptoms Versus Causes
- Black-Box Versus White-Box
- As Simple as Possible, No Simpler
- Bigtable SRE: A Tale of Over-Alerting
- Gmail: Predictable, Scriptable Responses from Humans
The Evolution of Automation at Google
- The Value of Automation
- A Platform
- Faster Repairs
- Faster Action
- Automate Yourself Out of a Job: Automate ALL the Things!
- Resolving Inconsistencies Idempotently
- Borg: Birth of the Warehouse-Scale Computer
Release Engineering
- The Role of a Release Engineer
- Philosophy
- Self-Service Model
- Testing
- Packaging
- Configuration Management
Simplicity
- System Stability Versus Agility
- The Virtue of Boring
- Minimal APIs
- Modularity
- Release Simplicity
Practical Alerting from Time-Series Data
- Instrumentation of Applications
- Collection of Exported Data
- Storage in the Time-Series Arena
- Labels and Vectors
- Alerting
Being On-Call
- Life of an On-Call Engineer
- Balanced On-Call
- Balance in Quantity
- Balance in Quality
Effective Troubleshooting
- Theory
- In Practice
- Problem Report
- Triage
Emergency Response
- What to Do When Systems Break
- Test-Induced Emergency
- Response
- Keep a History of Outages
Managing Incidents
- Unmanaged Incidents
- Poor Communication
- Freelancing
- Live Incident State Document
- Clear, Live Handoff
- A Managed Incident
Postmortem Culture: Learning from Failure
- Google’s Postmortem Philosophy
- Collaborate and Share Knowledge
Tracking Outages
- Escalator
- Outalator
- Aggregation
- Tagging
- Analysis
Testing for Reliability
- Types of Software Testing
- Traditional Tests
- Production Tests
- Testing at Scale
Software Engineering in SRE
- Why Is Software Engineering Within SRE Important?
- Traditional Capacity Planning
- Intent-Based Capacity Planning
- Fostering Software Engineering in SRE
Load Balancing at the Frontend
- Power Isn’t the Answer
- Load Balancing Using DNS
- Load Balancing at the Virtual IP Address
Load Balancing in the Datacenter
- The Ideal Case
- Identifying Bad Tasks: Flow Control and Lame Ducks
- A Simple Approach to Unhealthy Tasks: Flow Control
- A Robust Approach to Unhealthy Tasks: Lame Duck State
Handling Overload
- The Pitfalls of “Queries per Second”
- Per-Customer Limits
- Client-Side Throttling
- Criticality
- Handling Overload Errors
- Deciding to Retry
- Load from Connections
Addressing Cascading Failures
- Causes of Cascading Failures and Designing to Avoid Them
- Server Overload
- Resource Exhaustion
- Service Unavailability
- Preventing Server Overload
- Queue Management
- Planned Changes, Drains, or Turndowns
- Testing for Cascading Failures
Managing Critical State: Distributed Consensus for Reliability
- Motivating the Use of Consensus: Distributed Systems Coordination Failure
- Case Study 1: The Split-Brain Problem
- Case Study 2: Failover Requires Human Intervention
- Case Study 3: Faulty Group-Membership Algorithms
- How Distributed Consensus Works
- Paxos Overview: An Example Protocol
- System Architecture Patterns for Distributed Consensus
- Reliable Replicated State Machines
- Reliable Replicated Datastores and Configuration Stores
Distributed Periodic Scheduling with Cron
- Cron
- Reliability Perspective
- Cron Jobs and Idempotency
Data Processing Pipelines
- Origin of the Pipeline Design Pattern
- Initial Effect of Big Data on the Simple Pipeline Pattern
- Challenges with the Periodic Pipeline Pattern
Data Integrity: What You Read Is What You Wrote
- Data Integrity’s Strict Requirements
- Choosing a Strategy for Superior Data Integrity
- Data Integrity Is the Means; Data Availability Is the Goal
- The 24 Combinations of Data Integrity Failure Modes
- First Layer: Soft Deletion
- Second Layer: Backups and Their Related Recovery Methods
- Overarching Layer: Replication
- Third Layer: Early Detection
- Trust but Verify
- Hope Is Not a Strategy
Reliable Product Launches at Scale
- Launch Coordination Engineering
- The Role of the Launch Coordination Engineer
- Setting Up a Launch Process
- Capacity Planning
- Failure Modes
Accelerating SREs to On-Call and Beyond
- You’ve Hired Your Next SRE(s), Now What?
- Initial Learning Experiences: The Case for Structure Over Chaos
- Learning Paths That Are Cumulative and Orderly
Dealing with Interrupts
- Add your content…o Managing Operational Load
- Factors in Determining How Interrupts Are Handled
- Imperfect Machines
- Cognitive Flow State
Embedding an SRE to Recover from Operational Overload
- Phase 1: Learn the Service and Get Context
- Identify the Largest Sources of Stress
- Identify Kindling
- Phase 2: Sharing Context
- Write a Good Postmortem for the Team
- Sort Fires According to Type
- Phase 3: Driving Change
- Start with the Basics *. Communication and Collaboration in SRE
- Communications: Production Meetings
- Collaboration within SRE
- Team Composition
The Evolving SRE Engagement Model
- SRE Engagement: What, How, and Why
- The PRR Model
- The SRE Engagement Model
- Alternative Support
- Production Readiness Reviews: Simple PRR Model
- Engagement
- Analysis
Lessons Learned from Other Industries
- Meet Our Industry Veterans
- Preparedness and Disaster Testing
- Relentless Organizational Focus on Safety
Contact us
Got more questions?
Talk to our team directly. A program advisor will get in touch with you shortly.
We’re happy to answer any questions you may have and help you determine which of our services best fit your needs.
Schedule a Free Consultation