DNS issue with an upstream provider

Incident Report for Iron.io

Postmortem

Overview

On August 6th, at 15:07 UTC, we noticed connectivity issues across our network. These connectivity issues caused IronMQ to degrade into an unhealthy state which rendered the service un-usable.

What went wrong

At 12:49 AM PDT, the vendor who we rely on for DNS (AWS Route 53) experienced issues. In-network connectivity was broken and many components of our network were unable to communicate with each other. When the vendor issue was resolved at 1:04 AM PDT, the issue persisted within our network due to caching and TTL issues.

What we're doing to prevent this from happening again

We identified the places within network that could have caused this issue and reviewed their caching strategies and TTL times. Multiple cache times were too aggressive and we’ve increased timeouts in the necessary places. We’re testing various failure scenarios within our staging network to confirm the validity of these timeout values.
We’re currently discussing backup DNS strategies as a team and will be posting updates on our blog about our strategy moving forward, and, continued progress.

Resolution time

The incident was resolved at 16:04 UTC.

Posted Aug 07, 2018 - 18:00 PDT

Resolved

This incident has been resolved.

Posted Aug 06, 2018 - 10:23 PDT

Update

We are continuing to investigate this issue.

Posted Aug 06, 2018 - 09:15 PDT

Investigating

We are currently investigating this issue.

Posted Aug 06, 2018 - 07:14 PDT

This incident affected: IronMQ v3 (AWS US-East), IronWorker Dedicated, and IronWorker Public.