From Data Outage to Four Nines Uptime — Shopify App Development (2021)


Thousands of Shopify stores rely on Littledata’s ecommerce platform to get the most complete information about how, why, and when a customer ordered from their store.

Our core product to accomplish this is server-side tracking for Shopify stores. We do it by listening to webhooks from Shopify and relaying enriched, de-duplicated, formatted data onto the chosen data destination (such as Google Analytics and Segment). 

Littledata now processes hundreds of millions of data points a month.

But it’s not always smooth sailing. 

When we hit a rare outage or similar data dilemma, our team’s principles ensure we learn from them. That’s helped us improve not only our processes, but our app itself.

I’d like to explain how we learned from one particular outage to build a data pipeline with four nines of reliability (99.99 percent), and earned the trust necessary to handle that volume of financial and customer data.

Our biggest outage

Let me take you back to October 2019, in what feels like a bygone age where international travel and IRL events were possible—and I was attending our partner ReCharge’s Charge X conference in Santa Monica, CA.

As I was enjoying some post-conference downtime on Santa Monica pier, I started getting Slack alerts that a couple of customers were missing orders in their analytics.

Our core data pipeline had been running reliably on a DigitalOcean server setup for over a year. We were pretty confident that over 30 Docker engines and a load balancer was a robust way to spread the requests. So, my initial assumption was that a setup issue was blocking these customers from getting orders in their data destination.

Our entire engineering team works from Europe, but luckily one of our senior engineers was up late and we started investigating the issue in detail. By now, we had yet more reports of customers losing order data, so it seemed to be a bigger outage.

What we found was disturbing!

Mistake #1: No in-depth server metrics

Here’s a high-level view of how our event processing infrastructure was set up.

Littledata’s first attempt at a server-side tracking setup spread requests across 30 Docker engines. Source: Littledata.

We had uptime monitoring pinging the endpoints and receiving data from Shopify every minute—but we weren’t monitoring server response time. While the event volumes our servers handled had been ticking up by about 20 percent a month, the server infrastructure hadn’t grown proportionally.

Our servers’ response to webhooks coming from Shopify was getting longer and longer, and once the server response time went past 30 seconds, Shopify considered them to have timed out.

This lack of awareness about server response time hid another issue: as the incoming requests piled up, all the extra concurrent NodeJS threads had been eating up memory. Node is a strong choice for high-throughput processing, but is weak when the event loop gets blocked.

To save on server costs, we’d been hosting multiple Docker engines on a single DigitalOcean “droplet.” The memory metrics on the droplet itself looked healthy, but underneath many of the engines were out of their allocation of memory and unresponsive. So the Docker swarm manager was allocating incoming requests to a shrinking pool of workers—and we didn’t know.

Our emergency solution was to shift to one Docker engine per droplet. That way, we could at least see which engines were genuinely out of memory.

Mistake #2: No queueing of incoming requests

 

Large red circle with the words

We scaled up the infrastructure on DigitalOcean and kept monitoring every hour for a day. But the next day, the server metrics hadn’t improved. That’s when we realized we were getting a cascading wave of webhook retries.

Shopify automatically retries webhooks with exponential back-off up to 72 hours after the first failure. So by day two, we were getting retries from 48 hours ago, 24 hours ago, 12 hours ago, and so on, hitting us wave after wave.

And scaling to more than 20 DigitalOcean droplets was getting hard to physically manage with all the repetitive clicking to restart droplets.

Mistake #3: Lack of a crisis response plan

At this point, only myself and one other engineer were managing the crisis. We had a 10 hour time difference between us, and I was also in meetings with partners and organizing my family for the flight home. Frankly, we were stretched.

On top of that, we were fielding a flurry of questions from our internal team—and customers—on the outage. What we lacked was a battle-tested response plan to follow.

As a temporary measure, I took on the internal communications, delegated the external communications to our head of customer service, and left our senior engineer to focus on the infrastructure. This helped get us into day three of the crisis.

Mistake #4: Too many false alerts

Shopify sends email notifications to app developers that webhooks are failing for individual stores. We were now getting hundreds of these alerts daily, so I told our customer support team to ignore and archive the alert emails. We knew what the problem was, and we were knee-deep in fixing it.

What I did not realize is that hidden in this flood of emails from Shopify was a different kind of email—one which looked very similar when forwarded to our Intercom support inbox—telling us that the webhooks had failed so many times that Shopify had removed them.

Mistake #5: No end-to-end measurement of orders

We’d reached a point where the server metrics we’d hurriedly added were looking healthy but stores were still not getting orders in their data destination because Shopify had stopped pinging us.

It took us way too long to figure out the extent of the outage because we were not measuring end-to-end event delivery. We knew that our servers were relaying event data, but not whether they were receiving all the data they should be.

We were also spending too long summing up snapshots of logs to guess what the order throughput was for a particular store rather than being able to see the bigger picture hour by hour.

Mistake #6: No fallback processing

When we realized our mistake in letting Shopify remove the unhandled webhooks (and added them back via Shopify’s admin API) we had no way to recover the data lost.

Without any fallback system that could go back in time and pull out orders that were not received by our main server infrastructure, we had to leave a few days gap (permanently) in the customer store’s analytics.

After four long days of anxiety, we finally had all the data flowing normally. But many of our customers were understandably angry at how long this outage lasted. The breakdown in trust took some time for our customer success team to repair.

You might also like: Building Shopify Apps: 9 App Developers Share Their Experiences.

What we learned

For major issues at Littledata, we’ve copied a technique from hedge fund mastermind Ray Dalio’s book Principles to run a root cause analysis so we can learn from them.

Red and white text on a black background that reads: Work principle of the day: Create a culture in which it is okay to make mistakes and unacceptable not to learn from them.
Work Principle #3: “Create a culture where it is acceptable to make mistakes and unacceptable not to learn from them.” Source: Principles by Ray Dalio.

Now when a large-scale mistake happens, we get everyone involved on a call and ask the Five Whys of how it could have gone wrong.

In the case of this outage, the root causes were not so much a particular technical component failing, but more that our whole attitude to reliability needed to change.

“In the case of this outage, the root causes were not so much a particular technical component failing, but more that our whole attitude to reliability needed to change.”

To make sure this kind of outage could never happen again, we needed to fix not just one thing, but multiple aspects of processes, technology, and our team’s work.

Learning #1: Prioritize internal dashboards

Running production software without having good metrics on how the services are performing is like driving a car with a blindfold on. However skilled you are, it’s only a matter of time before you crash.

“Running production software without having good metrics on how the services are performing is like driving a car with a blindfold on. However skilled you are, it’s only a matter of time before you crash.”

Over the months following the outage, we built health checks for all kinds of internal services. To this day, a key objective of our engineering team is to find out about outages before our customers ever notice.

This was helped by moving to AWS Lambda, where we can easily set up alerts from AWS CloudWatch.

Learning #2: Add a message queue

Working with a high-frequency data stream requires queuing of incoming messages. So, in the event of an infrastructure outage, our customers have data delay rather than data loss.

We chose AWS SQS as a proven solution. Initially, we built a small web service to accept a webhook from Shopify and dump to SQS. But we were able to replace this within a few months when Shopify began natively supporting AWS EventBridge.

In 2020, we moved to a serverless setup. The workers to handle messages off the queue are infinitely scalable — and much easier to manage than having dozens of cloned droplets.

A collinear model displaying the flow of little datas serverless setup with the same images a in the previous model, but including another flow with the Shopify Admin API and hourly reconciliation to pick up missed webhooks.
Littledata’s serverless setup uses AWS EventBridge and SQS to handle outages and peak load, with a separate reconciliation job to pick up missed webhooks. Source: Littledata

If incoming events fail to get processed, we send them to a dead-letter queue for further analysis.

Learning #3: Creating a crisis response plan

It was too late to think about management responsibilities and reporting lines in the middle of a crisis.

So we’ve evolved a crisis-response plan which includes:

  • Who leads on the investigation, internal communication, and external communication
  • How to triage an incident and decide when we need to pull more team members into the response
  • What the expected response times are at each stage
  • When to notify customers
  • Maintaining a status page

Every incident is a bit different, but we are now much faster in communicating status updates during a data disruption. This becomes even more important as we scale our global team.

Learning #4: Reducing error alerts in production

We saw, ironically, that we were notified of the secondary effect of the outage—Shopify removing the webhooks after 72 hours—but never saw the notification amongst a thicket of other alerts.

Since then, we’ve reduced alerting noise, fixing production error alerts even when they have no impact on customer data. That way, when a real problem comes along, the alert stands out.

Ultimately, having 100 alerts per day that something is wrong is worse than having no alerts—it’s only human to ignore them all.

Learning #5 and #6: Throughput reporting and reconciliation

Our head of product saw the biggest problem not as the infrastructure weakness, but the fact that it was too time-consuming to work out if there was any problem with end-to-end throughput from our data source (Shopify) to data destinations (Google and Segment).

When the next crisis came, the root cause would be different, but we still needed to know where in the processing chain the issue was.

Over the following six months we built out a backup system which, on an hourly basis: 

  • Checked orders in Shopify
  • Checked orders we had logged as processed
  • Checked orders received in the destination
  • Queued any orders that were missed for reprocessing

This had the bonus of also pushing new data to Amplitude, which we use for many different aspects of our product analytics. We were now able to send hourly data on order throughput per store (the percent of orders received in the destination divided by the number of orders processed on Shopify).

You might also like: How to Build a Shopify App: The Complete Guide.

The impact: order throughput up from 98 percent to 99.99 percent

a enlarged, zoomed in photo of a motherboard, with the entire image cast in red

Measuring the order throughput showed us how much room there was for improvement. Even after this crisis was resolved we were only processing 98 percent of orders across all customers—still better than a typical client-side tracking setup, but not great.

The lean startup mantra is “measure, build, learn.” Now that we were measuring throughput, we could build a better system and learn from each interruption.

Order throughput is one of the guiding metrics for our engineering team. With every bug and outage, we have been able to improve, iteration after iteration.

For example, an outage with AWS Kinesis in November 2020 took down half the internet—along with Littledata’s AWS services. But, with a queue and backup in place, when the surge of webhooks resumed, our Lambda functions effortlessly scaled up and no single event was lost.

Yet, the biggest change has been to our culture. Knowing about reliability means caring about reliability. As Littledata has transitioned from a startup to a scaleup, we’ve gotten more serious about all aspects of the software development process.

“The biggest change has been to our culture. Knowing about reliability means caring about reliability.”

One of our mantras is to make it easier for engineers who follow after us. That means better code quality, testing, documentation, deployment, and standardization to make it easier for a newbie on the team to make changes within risk of breakage.

You might also like: Shopify App Challenge Past Participants: Where Are They Now?.

The road to five nines

The gold standard for system uptime is five nines: 99.999 percent uptime. This equates to less than 52 minutes annually where the system is unavailable, versus around eight hours a year for a four nines system.

Littledata wants to be a trusted data partner for the larger direct-to-consumer brands, so how could we get to this next level? 

I see a few areas for us to work on.

1. Multi-region redundancy

The AWS outage in 2020 only affected the US-East-1 region for AWS. If we had redundant services ready in another cloud region, we could have continued some data processing. Although I believe the EventBridge from Shopify would still have been impacted.

2. Less reliance on third parties

Littledata exists as a technology partner to some of tech’s well-known giants. When those giants fail—as they do—we fail too. For example, if Google Analytics stops processing incoming events, our users will still complain, even though we are powerless to fix it. 

As Littledata moves towards becoming an ecommerce data platform in its own right, and perhaps pushes data directly into AWS RedShift or a data warehouse of the customer’s choice, we have more control over the data in the destination. And we can guarantee a greater level of availability.

3. 24/7 timezone coverage for DevOps

Our engineering team is still all located in Europe. While this greatly benefits our collaboration across only two time zones, the downside is that it is hard to respond within an hour in the middle of the European night.

As we scale, we are planning to hire DevOps specialists in other time zones. That way, if something does happen, we can get the systems back up within those critical first 50 minutes.

Learning from failure means learning to succeed

While it may seem contradictory at first, it’s absolutely true that mistakes can be your best teacher. Our team’s response to a single data outage, the lessons we took from it, and the process and infrastructure changes we made in response have helped us build a four nines data pipeline. 

“While it may seem contradictory at first, it’s absolutely true that mistakes can be your best teacher.”

We took stock of each mistake we made, including oversights that helped lead to the outage in the first place. That helped us better measure throughput, build a more effective system, and most importantly, learn from our mistakes.

When things go wrong, it’s a chance to test your process and improve upon it. Your team will gain new skills and knowledge in responding to a crisis, and with the right attitude, you can find a solution to prevent future crises. 

Look at mistakes as learning opportunities and you’ll find the path to constant improvement and continued success. Following that approach helped us not only manage a major data crisis, but innovate both our product and process to reach four nines of uptime for our app.



Source link