Site Reliability Part 1
At my company, my title is a DevOps Engineer. For now, I accept that DevOps is not a role, but rather a culture. But that’s not important right now.
What’s more notable is that my job involves a lot more than just building pipelines. To be precise, my day-to-day tasks mostly include:
- Maintaining and enhancing CI & CD pipeline - modifying Jenkinsfile…
- Deploying - modifying Dockerfiles, docker-compose, building and deploying custom builds…
- IT - whitelisting IP addresses, grant ssh accesses…
- Ops - monitoring, troubleshooting incidents…
among other things.
As such, I have realized that what I do is not DevOps. When you hear DevOps, you think CI & CD. But there’s this one critical mission that I’m also assigned - Ops.
Essentially, if the server suffers outtage, I’ll have to bring it back up and report.
The bad news is we have suffered many outtages lately that led me to think there’s something wrong with the way our DevOps team works.
Enters Site Reliability Engineer.
I heard this word sometimes ago and thought SRE is going to be in my career path eventually, but did not really look into it. Now that Ops works are giving me headache, I know it’s time I read about this.
I’m following a book from Google, called
Site Reliability Engineering - How Google Runs Production Systems
Having read only the Preface and Chapter 1 but I can tell it’s a worthwhile reading.
That’s why I’m starting this series - to follow this book and optionally put in my two cents where possible.
I want to update it at least once a month (promise to self).