Overview
On August 6th, at 15:07 UTC, we noticed connectivity issues across our network. These connectivity issues caused IronMQ to degrade into an unhealthy state which rendered the service un-usable.
What went wrong
At 12:49 AM PDT, the vendor who we rely on for DNS (AWS Route 53) experienced issues. In-network connectivity was broken and many components of our network were unable to communicate with each other. When the vendor issue was resolved at 1:04 AM PDT, the issue persisted within our network due to caching and TTL issues.
What we're doing to prevent this from happening again
Resolution time
The incident was resolved at 16:04 UTC.