IronWorker degraded performance

Incident Report for Iron.io

Postmortem

The team here prides itself on providing stable services to our customers, and when things go wrong, we take it seriously. On behalf of myself and the entire team, I want to apologize for yesterdays service disruption. Some details about the incident are as follows:

Overview

On February 20th, at around 4:00 pm PST, we noticed increased CPU rates on our primary MongoDB instance. We immediately contacted our database vendor, mLab, who jumped into chat with us within minutes to help diagnose the issue. Many customers experienced a large slow-down in tasks being processed, and some customers experienced their tasks not being processed at all.

What went wrong

We had a significant increase in the number of tasks coming through our system, and our system is designed to scale up in such cases. However, one query started increasing in run-time and ended up causing CPU to rapidly rise on our primary database. This caused task processing to slow down, and in some cases, tasks from certain projects weren't being processed at all.

We eventually traced this query to an account setting that sets a maximum task limit for a given account. Since some of our customers process hundreds of millions of tasks a day and have complex deployments, this setting is often set to a very high number. When this setting is set, however, it results in an extra collection count query to fire off for each task. This influx of queries was the culprit and resulted in our primary database's CPU to be pegged.

What we're doing to prevent this from happening again

A frontend cache is being implemented to prevent a N+1 query problem with this collection count query. This will prevent resource starvation and a possible thundering herd scenario.
A higher level account flag is being added to mitigate the need for this collection count query. This will result in fewer queries and will result in a platform-wide performance benefit.
We're adding more permutations to our test suite to cover this case as well as other possible resource starvation scenarios.

Resolution time

The incident was resolved at approximately 11:00pm PST.

Posted Feb 21, 2018 - 12:04 PST

Resolved

This incident has been resolved.

Posted Feb 21, 2018 - 09:13 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 20, 2018 - 21:13 PST

Investigating

We are currently investigating this issue.

Posted Feb 20, 2018 - 18:03 PST