Site Reliability Engineer (SRE): Whoops, tomorrow is Friday and I forgot to tell you I am going on vacation for two weeks starting next week.
Full Stack Developer: No big deal, I know how things work. And also everything is in slack #walk-the-floor
In our team everyone is considered DevOps. We all “Walk the floor”. I hope this post gives you an idea of how it works in practice, how it helps quality, and how it helps us.
First, here is a little background context: It is 2019. I am a product owner (PO) on a remote first team which was born in 2015 and is comprised of about fourteen people. We are organized into a few squads. We span all North American timezones, India Standard Time, and Greenwich Mean Time. We operate like a start up within a large corporation. We focus on continuous delivery of features and enhancements and on 100% uptime for our customers. Committed code for our custom-built application — Let’s call it “A” system — goes immediately to production 90+ times a workday on a Kubernetes platform. We have full control of our pipeline and automated “gating” tests which prevent deployment to production if one of the tests fails.
As PO, I accept that every once in a while someone will unintentionally introduce something into production which creates a problem. I embrace that we allow the whole system to move quicker. I accept responsibility for any fallout of that speed, for any impact when and if something bad happens in production. Everyone on the team participates in the weekly on-call rotation managed in PagerDuty. Everyone == me, the SRE, front-end developers, and full-stack engineers. “Walking the floor” is what the on-call person does, virtually walking the floor of the shop on a daily basis in large part to notice what “normal” looks like. This novel approach is powerful given our different areas of expertise and individual lenses.
Quality Matters: The rotation of fresh eyes and diverse roles on both the front and back-ends of the application yields some excellent discoveries. Examples: UI bugs some of us missed, logging which could be improved so as to pinpoint an error, an issue left unaddressed which will soon reveal a memory leak and could topple the system over and cause an outage, realizing an important job errored out and stopped without us finding out in a timely manner. When someone sees something interesting, not desirable or likely to become a problem, they fix it right away or decide to automate things so we never have to think about this problem going forward. All of this happens transparently and with collaboration via slack. Our rotation fosters a deeper shared sense of ownership of quality.
As the person on-call, you do not want to be awakened by PagerDuty unless there is truly a problem. You do want to have to awaken your teammates either. There was pain in this area with false alerts early on so we tackled accuracy of the automated monitoring quickly. In making the system better for ourselves from a supportability perspective via continuous improvements in areas such as accurate monitoring, reliability, scalability, performance, we raise the level of internal team happiness. And these boosts make the system better for the people that matter most, our customers. “Walking the floor” has fostered creativity, autonomy, and ingenuity and thus nurtured even more “teaminess”.
Rolling it out: The 2 sections which follow are what we wrote for our team when we formalized this operational practice. I hope Guidelines and Field Notes Please provide you with an idea of how this works.
Guidelines
The “A” system is fairly simple but we need to make sure it’s working well as our jobs rely on it. The whole team should work together to support each other in making this happen.
When you are on the “A” production schedule on PagerDuty you get to be the person that guides the system on that day. If there Is an outage then you should gather people who can help in #swat-prod-down. Your experience with the system and the team should help you work out who can be useful.
You want to avoid emergencies when you are finished for the day so it pays to look at the state of the world:
- when you go on duty
- when you start work for the day
- when you are finishing up for the day
- if the team are doing something exciting such as releasing a major feature
The sort of things to investigate change over time:
- we use different technologies
- different parts are under stress
- people are focused on different areas (such as during demos)
- You need to exercise your judgement over what is worth investigating. You have access to all the things you need to review. You can find links to everything you need in Amazing tools. Get familiar with the tools, suggest improvements, discuss with your teammates.
Here’s some examples of things we look at today:
- “loglevel:error” in logmet to see if things are falling over
- resources for wdc0105wtp00* in New Relic
- outages in things tagged with “amazing” in Pingdom
- problems reported in #a-monitoring
- the pulse of the system in #a-node-chatter
The main thing to remember is that we’re all here to help each other. If you’re not sure then get help – don’t panic!
Field Notes Please
Note and share what you notice (urls, screenshots!) in #walk-the-floor as you take a look at things. It keeps your teammates informed and also lets them help you learn how to proactively monitor and explore the health of “A” system ecosystem. Your teammates will read and learn from your #walk-the-floor field notes and gain insight into what can be inspected and why it matters. They will probe, ask questions, give feedback. This builds the foundation for you and for the team’s for continuous improvement in our operations.
In the event of an emergency, because you have practiced and have experience, you know who to partner with to inspect multiple aspects of the system. As you gather people who can help in #swat-prod-down, in the same way you practiced making field notes in #walk-the-floor, you and the folks helping post notes and screenshots of the details of your investigation while you pinpoint the issue and bring things back to a happy state. This will make for swift root cause analysis determinations, also known as Blameless Post-Mortem’s.
Parting thoughts: This team has a culture of operational excellence, a shared sense of ownership for multiple dimensions of quality, accountability to one another, and autonomy. We have fun with this, too. #walk-the-floor commentary is often entertaining and sometime leads to ad hoc short, casual video conference sessions where we hangout and chit chat while inspecting things.
- Does everyone on the team love being on call? No, and yet I have seen individuals grow in experience and confidence.
- Was it beneficial anytime our gating tests stopped the pipeline? Yes, we consistently learn from these events and have more tuning to do.
- Are critical situations handled well? Yes, in that we isolate the issue quickly with an “all hands on deck” swarm in video conference and #swat-prod-down and we communicate status immediate to our stakeholders.
Attributions: David Bowen is the author of the “Guidelines” section. Thank you, David, for your “good-finding-mind“ and encouragement on this post. David Bowen, Mitch Ferrer, Diego Cadenas, Rob Sebestyen, thank you for being champions of the process.
As always, here is a musical accompaniment: Coldplay’s “Don’t Panic.“