As a leading SaaS company, engineers at Boost Commerce always strive for optimal e-commerce solutions to both product functionalities and non-functional requirements like security and performance. Thanks to that, we’ve developed a large, constantly-evolving eco-system allowing us to find better solutions to serve our customers.
In such a process of improvement and innovation, we do encounter unexpected circumstances. What we do is to continuously learn from them and prevent similar cases from happening again, ever. Also, to be transparent about what happened, we publish this post-mortem so you can understand it better.
We’re developing a big feature that supports multiple data centers in different regions. Once fully functional, it will speed up the storefront navigation and search, resulting in a better conversion rate on webstores. The feature will also reduce the risk of downtime if one data center goes offline.
At 2022-10-04T03:00:00 UTC during implementation and testing of this feature, a configuration was deployed and unexpectedly overwrote the configuration in production sync database cluster. A few minutes later, our engineers start receiving error notifications from our sync services, mentioning that it could not locate expected configurations.
The code was immediately reverted to bring the system back to normal.
In-charged engineers also further investigated the impact and came up with solutions. We realized that a full sync for impacted stores was necessary to recover the misconfiguration. This was a long process so you could experience slower sync than usual during that time. We truly apologize for this inconvenience.
Our engineering and customer success team have been working our best to troubleshoot case by case. At the same time, we have
Also, we were trying to recover the overwritten product ranking settings for some stores from all sources we have such as backup data, daily logs, analytics tool, our weekly reports, etc. In addition, we displayed in-app notifications for impacted customers so that they can review their settings and contact us if needed.
For the deployment practice: at Boost, we always follow a strict testing process, and checklists in place for all releases of our application. Still, we missed certain documentations for deploying that particular configuration to the database cluster. We have reviewed our engineering workflows & checklists and added necessary documents. From now on, we also regularly check these procedures to make sure we don’t miss anything else.
For the overwritten product ranking settings: We have improved our logic to make sure all the app settings including product ranking one have backups now.
For the slow sync processes: we’re already having several tasks to improve the sync in many logics. Our goal is to eventually make the sync smarter, to only sync what is needed. It would look easy but actually it’s challenging because it requires both deep knowledge in e-commerce and software algorithms, but we will achieve it.
Once again, we apologize for letting the inconvenience happen. Please rest assured that we take this seriously and are working our best to improve our service. And this post-mortem is a transparent way to show our honesty and sincere apology to you, our customers. Thank you for staying with us.