Data Sync is partially down
Incident Report for Boost AI Search & Discovery
Postmortem

Overview

As a leading SaaS company, engineers at Boost Commerce always strive for optimal e-commerce solutions to both product functionalities and non-functional requirements like security and performance. Thanks to that, we’ve developed a large, constantly-evolving eco-system allowing us to find better solutions to serve our customers.

In such a process of improvement and innovation, we do encounter unexpected circumstances. What we do is to continuously learn from them and prevent similar cases from happening again, ever. Also, to be transparent about what happened, we publish this post-mortem so you can understand it better.

What happened

We’re developing a big feature that supports multiple data centers in different regions. Once fully functional, it will speed up the storefront navigation and search, resulting in a better conversion rate on webstores. The feature will also reduce the risk of downtime if one data center goes offline.

At 2022-10-04T03:00:00 UTC during implementation and testing of this feature, a configuration was deployed and unexpectedly overwrote the configuration in production sync database cluster. A few minutes later, our engineers start receiving error notifications from our sync services, mentioning that it could not locate expected configurations.

What we have done to resolve it

The code was immediately reverted to bring the system back to normal.

In-charged engineers also further investigated the impact and came up with solutions. We realized that a full sync for impacted stores was necessary to recover the misconfiguration. This was a long process so you could experience slower sync than usual during that time. We truly apologize for this inconvenience.

Our engineering and customer success team have been working our best to troubleshoot case by case. At the same time, we have

  • kept a sharp eye on the system performance
  • made some optimization to speed up the sync 
  • raised a ticket to Shopify Support for assistance.

Also, we were trying to recover the overwritten product ranking settings for some stores from all sources we have such as backup data, daily logs, analytics tool, our weekly reports, etc. In addition, we displayed in-app notifications for impacted customers so that they can review their settings and contact us if needed.

What we learn from it

For the deployment practice: at Boost, we always follow a strict testing process, and checklists in place for all releases of our application. Still, we missed certain documentations for deploying that particular configuration to the database cluster. We have reviewed our engineering workflows & checklists and added necessary documents. From now on, we also regularly check these procedures to make sure we don’t miss anything else.

For the overwritten product ranking settings: We have improved our logic to make sure all the app settings including product ranking one have backups now.

For the slow sync processes: we’re already having several tasks to improve the sync in many logics. Our goal is to eventually make the sync smarter, to only sync what is needed. It would look easy but actually it’s challenging because it requires both deep knowledge in e-commerce and software algorithms, but we will achieve it.

Conclusion

Once again, we apologize for letting the inconvenience happen. Please rest assured that we take this seriously and are working our best to improve our service. And this post-mortem is a transparent way to show our honesty and sincere apology to you, our customers. Thank you for staying with us.

Posted Oct 07, 2022 - 10:47 UTC

Resolved
This incident has been resolved.
Posted Oct 04, 2022 - 10:43 UTC
Update
Some of premium customers are already back to normal. We continue monitor it and resync the rest
Posted Oct 04, 2022 - 07:40 UTC
Monitoring
The issue is fixed and now we are backfilling data by triggering full sync process
Posted Oct 04, 2022 - 05:39 UTC
Update
We have found the root cause and fixing it now
Posted Oct 04, 2022 - 04:36 UTC
Update
We are continuing to investigate this issue.
Posted Oct 04, 2022 - 04:35 UTC
Investigating
We are facing a partial downtime in our sync processes. Some stores would be failed to trigger sync process. Other systems (Search & Filter, Admin) are not affected
Posted Oct 04, 2022 - 03:20 UTC
This incident affected: Data Sync.