Incident on November 22nd, 2022 UTC

Incident Report for Boost AI Search & Discovery

Postmortem

Overview

With more than 13,000 customers and a big proportion of large inventory and Shopify Plus stores, we know deep down that BFCM season is the most important sale season of the year. So system preparation for BFCM season has long become our annual operational task. This year is no different, the engineering team carefully pre-scaled x4 the whole system from November 11th, 2022 with anticipate peak traffic of BFCM campaigns, including Search Engine Cluster, Web App API Servers and Application Load Balancer in advance. Although the total requests had increased 50% since the beginning of the month, our system was able to handle the spike. The request average response time was reduced by 70% after the scale-up.

Unfortunately during the peak traffic of the starting BFCM campaign, we do encounter unexpected circumstances. What we do is to continuously learn from them and prevent similar cases from happening again, ever. Also, to be transparent about what happened, we publish this post-mortem so you can understand it better.

What happened and what we have done to solve it

At 19:30:00 UTC (12:30 EST) on November 23rd 2022, two of our Search Engine Client Nodes got overloaded unexpectedly and it caused a bottleneck in one of our database clusters. This is the very first time we encountered this issue as technically, the client servers are data transporters and don’t do any computational tasks.

Since we have multiple database clusters, the incident impacted some stores in that one particular cluster. Our engineering and customer success team worked our best to troubleshoot case by case and support customers during the incident.

Our engineering team quickly identified the root cause and added replacement servers. It took us 1.5 hours ****to get servers ready for serving and to route the traffic from the overloaded server to the new ones.

At 21:00:00 UTC (14:00 EST) on November 23rd, 2022, the system was back to normal.

Lesson learned

After the incident, our team have learned the lessons of managing infrastructure before, during, and after the sales season. We have optimized our workflow and action plan immediately. In detail, we have:

Reviewed the entire infrastructure for all components, especially data server nodes.
Kept monitoring the performance of the system to make sure it runs stably anytime.
Prepared detailed plans for scaling/transferring servers when they reach full CPU capacity.

Once again, we apologize for letting the issue happened. Please rest assured that we take this seriously and are working our best to improve our service. And this post-mortem is a transparent way to show our honesty and sincere apology to you, our valued customers. Thank you for the trust you have placed in us.

Posted Nov 30, 2022 - 09:13 UTC

Resolved

two database client servers suddenly surged to 100% CPU and became unresponsive. This created a bottleneck in one of our database clusters. It took us 1.5 hours to recover the system back to normal. A small portion of stores were impacted.

Posted Nov 22, 2022 - 19:30 UTC