On May 13th, at 03:29 UTC, we began routine database upgrades. During the upgrade process we noticed errors in our logs indicating certain queries weren’t able to complete successfully.
What went wrong
After investigating into the errors thrown, we found data anomalies in our Production data set that didn’t exist in our Staging data set. This difference resulted in slow queries and errors that cascaded into service interruptions for a subset of our customers.
What we're doing to prevent this from happening again
Moving forward we’re taking steps to ensure our Staging data set is 100% up to date with our Production data set. If the copies of the data were exact, this would have been caught in Staging and wouldn’t have caused a disruption in service.
The incident was resolved at 11:49 UTC