Chapter 1 Introduction Part 1

10 Dec 2019

What comprises of a Google SRE Team?

A Google SRE team is responsible for the service’s:

A Google SRE team has around 50-60% SE. The rest are people with 85% SE skillset required for SE with additional knowledge useful to SRE but uncommon to SE (Unix internals & networking layer 1-3).

SRE places a heavy focus on Engineering. They are people who are willing to develop program to solve complex problems.

To ensure SRE teams have enough time for development tasks, Google limit the time spent on aggregate ops work to 50%. Services should run & repair themselves - Google wants automatic system, not just automated.

How do they enforce the 50% time on development tasks?

A few note on operating at GG:

Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Thoughts and Reflections

Our DevOps team has 03 members, including me.

I have background on software development and I love coding. It was actually just a year ago that I decided to be a DevOps Engineer - let’s not debate whether or not it’s a role here.

My decision was not due to fear of coding/software development. I changed because I wanted different perspectives and more indepth knowledge of system internal workings.

So getting back to coding sounds amazing. Actually, after nearly a year tweaking configurations, provisioning AWS resources, setting up monitoring tools, hacking away at bash scripts (to be fair, I always try to apply whatever I know as a software developer to writing scripts), going back to software development feels refreshing.

Google mentions alot about having SREs involve in development tasks and how it enables easier transfering members between product development and SRE.

I wonder what exact development work are they talking about?

Are they letting the SRE members taking part in developing product services like Gmail or do they men SRE develop tools for use internally in operation?

If so, the next big question is:

For medium startup like my company, is spending resources on building internal tools for operation better than making use of community-avaiabile tools?

I hope following chapters will shed light on this.