AWS S3 is down now causing job processing issues... stand by please while we try and reroute around it

Incident Report for Iron.io

Resolved

The system has now almost fully caught up. We're continuing to scan for any residual jobs that may have not run but all should have ran, or be queued up to run shortly.

Thank you for your patience as AWS recovered their core services.

We will be evaluating options of running core iaas outside of AWS.

Posted Feb 28, 2017 - 15:16 PST

Monitoring

Job processing is almost fully up to speed again. It may take awhile to get through the backlog of jobs.

Posted Feb 28, 2017 - 13:54 PST

Update

We are now seeing recovery of IronWorker and working through backlogs of jobs.

Posted Feb 28, 2017 - 13:32 PST

Update

We see jobs going through again... none should be lost but they will be queued up since the issues started this morning.

Posted Feb 28, 2017 - 12:59 PST

Update

Update from AWS. We are quickly trying to restore our services as well:

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

Posted Feb 28, 2017 - 12:58 PST

Update

Unfortunately the issue has now cascaded to over 45 AWS services causing unrecoverable issues upstream. At this point, we have to wait on AWS and then begin a fully multi-cloud initiative.

Posted Feb 28, 2017 - 12:48 PST

Update

We are considering bypassing s3 but even then, Docker hub is down and would block any Upstream updating of code packages as they are all built with Docker.

Posted Feb 28, 2017 - 12:02 PST

Update

Reported S3 issues in US-East: https://status.aws.amazon.com/

Trying to bypass their S3 service completely. We will build around it in the future.

Posted Feb 28, 2017 - 10:08 PST

Update

https://news.ycombinator.com/item?id=13755673

Posted Feb 28, 2017 - 09:57 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 28, 2017 - 09:57 PST

This incident affected: IronWorker Dedicated and IronWorker Public.