Throwing Seven Different Kinds of Smoke

How the Site Reliability Engineers are changing the face of one small eCommerce platform.

By Ashley McGeePublished 3 years ago • Updated 3 years ago • 7 min read

SRE stands for "pulled in many directions" --Co-Worker

About six months ago, the business that I work for transitioned our tier 2 support team into a full-scale site reliability engineering team. Up until that point, there was no real liason between Support, Engineering and DevOps to resolve issues of site reliability, and our DevOps team was fully investing their time in infrastructure maintenance, monitoring, and automation of routine maintenance tasks.

About two months ago, I joined the brand new SRE team, moving up from Tier 1 Support. Since joining that team, the positive impact I was experiencing in my work in Support hasn't dropped off in the slightest. If anything, my degree of separation from our merchants gives me the time I need to pursue problems in full to see if I can funnel platform issues up to Engineering, address bugs in new software roll outs, respond to priority incidents as they occur, and assist DevOps and Database in building a better architecture.

But what is Site Reliability Engineering, and how could five people be making that big of a difference?

What is Site Reliability?

Site Reliability was sort of all-the-way invented by Google. Google needed a way to make sysadmins and developers work together to ensure the highest amount of up time possible while not choking the life out of new features. Investors want to see innovation. Users want to see stability. SRE is the solution to that problem. This position is by no means pure in many organizations, but the goal is the same: protect the health of the software, the hardware, and the infrastructure to maintain the highest amount of stability so the business can thrive.

For our org, when I say "site reliability" I mean exactly that: since we host the eCommerce websites, our job as SREs are literally to ensure that those sites have the maximum amount of up-time possible. This might be a little different from a traditional SRE roll, where uptime of an app, deliverable, or service entails the maintenance of software and hardware is the primary--or sole--concern of the organization. Our service is websites, and websites need to be up. On the other end of those websites are people trying to run a business. On the other end of their businesses are customers prepared to spend their hard-earned money. In addition to being up, those sites need to work as intended from the standpoint of every active user, from the most highly skilled CMS ninja to some one selling their artwork on the side.

Our organization does differ--at the moment--from other SRE orgs because we customer face. Our team was not originally SRE; it was more of Tier 2 Support. Because we have so much experience with the software, though, we're pulled in a lot of directions. Developers needing new servers for staging, support reps needing DNS records installed or updated, DevOps requires routine maintenance for PCI compliance, along with the needs of the QA team means we don't sit still for long.

In an effort to reshape our product, we've gotten back to our roots (whether or not that's a good thing remains to be seen), and we've sent out a number of feature updates to improve the quality of our product for some of our highest performing users. Sadly, other internal things sprang up that prevented us from properly vetting the user experience of some of those features.

This is where SRE came in clutch, and not in a vague statistical sense. I'm talkin' just yesterday.

The Never-Ending Tale of Bashing the Bug

If there was ever any proof that our jobs directly impacted the success of a medium-sized business, it was literally yesterday morning.

All week we were getting reports of a very specific bug, which we had plenty of experience replicating from the customer's perspective, but that the QA team was having a dog of a time replicating in the QA environment. When customers report bugs to merchants, we do as much due diligence as we can about the issue and document it so the QA team can replicate it and provide proof to the developers so the team can correct it with a hot fix. This occurs most often on our newer features, and this was no minor inconvenience; it was breaking the checkout page for hundreds of people.

This is the hard part: when QA can't make the bug happen under normal conditions. The abnormal conditions are typically what produce these errors, but until we get someone to show us exactly what someone was doing when they got the error or bug, we get left guessing what the abnormal condition could have been. We get taken down a lot of rabbit holes that go nowhere. That behavior may coincide with certain circumstances, but can't be proven to be caused by them. This is the never-ending struggle between customers, merchants, and support: it's happening, but we can't prove it.

That was not the case with this bug. I knew exactly how I had replicated it, so I got with a member of QA yesterday morning to prove this was indeed happening and that it deserved to get sent to Engineering. Since I knew I'd forced the bug in a single set of circumstances, I was quick to recreate the scenario, but of course, like usual, I didn't replicate it with the QA engineer. However, neither of us were willing to give up. We hit the site with two different user agents, with three different devices. Eventually I did force the error on both a mobile and desktop device, and at almost the exact same time, the QA engineer did as well. I say "eventually". The whole thing took maybe 20 minutes.

I was able to provide the documentation and screen shots of the request and response to and from a particular security service that was trying to say that we updating the service too many times, and it was flagging the attempt as malicious. This explained why no one was able to replicate it during random testing. We weren't hitting the sites often enough at one time. Most common tests select a very specific scenario and may get one or two attempts in before confirming or denying the behavior, and then we move on. We hit that site about 6 times in the span of about 10-15 minutes, which is far less than the site typically sees for regular traffic. Turns out it had to do with velocity, not settings or user error, as we all had thought.

We reported the replication immediately. Engineering traced the source of the problem, and corrected it in real time. I sent reports out to my merchants requesting an update, but never heard back. However, this isn't to say they stopped paying attention. As someone who had previously worked in a small business with an eCommerce store, I know that merchants have to trust that the people making the software were going to fix the issue. Once it was resolved, my attention would immediately have been diverted elsewhere, like packing orders or the financial close. The fact that no one was jumping for joy or raising the roof is not an indication that the fix didn't work. Arguably, it's an indication that it did work. If we never hear back from anyone about it, that's fine by us.

Without SRE to advocate for the merchants to the QA team, who then advocates for us to Engineering, we would not have resolved this platform-breaking issue, and our merchants would have lost more sales, which not only directly impacts the people running the business, it impacts us because they would have no choice but to find another vendor. One small error can lead to thousands, if not hundreds of thousands, of lost revenue for everyone, and finding the source or trigger for that bug or error is perhaps one of the most satisfying feelings I can imagine.

Sounds Exhausting. Why Do It?

My home office consisting of my personal MacBook Air and my work rig.

Things are probably going to stay pretty hectic for a while. That is our present reality. But I'm going to show up to that keyboard anyway, ready to come onto the floor throwing seven different types of smoke to make sure questions get answered, problems get solved, and we're putting out the best quality product possible. I'm not exactly building flashy landing pages or designing multi-million dollar click funnels, but at the end of the day, I'm helping to protect the health of our business by protecting the health of thousands of others.

But what is it that really makes this job special? I'm not just sitting in a chair watching monitors all day. I'm one of the first people to respond to a problem and one of the last people to leave it, and I get to solve those problems every day. When those problems get solved, our merchants feel it. When I succeed, my merchants succeed, and there are merchants that I have watched grow their business from the ground up. I've been present for their first sale on their freshly launched site. I've met amazing business owners achieving their goals against--literally--all odds, and I've seen that the people behind the software can and do make a difference in the lives of the users.

It's the people that drive industry, not the other way around, and at least at our organization, helping people is and always will be the first function of SRE.

humanity

About the Creator

Ashley McGee

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Ashley McGee and writers in Journal and other communities.