BlastIQ major outage
Incident Report for BlastIQ
Postmortem

We sincerely apologise for the major incident that impacted BlastIQ customers globally on Monday morning AEST.

Consequences

During the incident an expired SSL certificate prevented users, integrated systems and field devices from logging in to BlastIQ systems.

Background

BlastIQ uses Microsoft’s Azure Front Door (AFD) service to provide reverse proxy services for BlastIQ systems and to accelerate customer access to BlastIQ. When connecting to BlastIQ, a user is automatically routed to the nearest AFD edge node, which then routes traffic privately to BlastIQ using Azure’s global fibre network.

BlastIQ uses Azure’s managed SSL service to automatically issue and update the security certificates used by BlastIQ. After a new certificate is issued, Azure automatically deploys the new certificate to all AFD edge nodes.

Failure

The BlastIQ team have worked with engineers at Microsoft who investigated the cause of the issue. Microsoft have identified that the SSL management automation correctly issued a new certificate to replace the expiring certificate, however a bug in their deployment tools prevented the new certificate from being deployed to the AFD edge nodes.

When BlastIQ engineers informed Microsoft of the issue impacting customers, Microsoft engineers performed a manual update of the certificate on the AFD edge nodes, however approximately 10% of AFD edge nodes were not updated successfully, resulting in intermittent failures for some users and Microsoft engineers performed additional checks to manually identify those AFD edge nodes which still had an expired certificate and update them.

Corrective Actions

Microsoft have identified the bug in their automated certificate deployment tooling and will correct it to prevent recurrance for any Azure customers.

BlastIQ will continue to use Azure’s automated management tools and global network services to provide fast, reliable service for BlastIQ customers globally. We will introduce some additional monitoring checks to ensure that where possible we identify when automation has failed to perform its functions correctly.

Summary

We sincerely apologise to our customers who experienced interuption and disruption in their work due to this outage. We have investigated this issue and are confident that the root cause is being addressed.

We would like to thank the engineers at Microsoft who worked with us to resolve the issue and restore service.

Posted May 15, 2020 - 11:21 AEST

Resolved
This incident has been resolved.
Posted May 11, 2020 - 14:27 AEST
Monitoring
The BlastIQ team have worked with Microsoft engineers to resolve this issue and all BlastIQ systems appear to be operating normally.

We will now investigate the incident and ensure we understand the cause of the failure that has occured in our infrastructure automation systems and make improvements to ensure the failure does not occur in the future.

We sincerely apologise for the impact of this outage on your work.
Posted May 11, 2020 - 12:21 AEST
Update
We are continuing to work with Microsoft engineers as they resolve this issue.

We apologise for the inconvenience this is causing to our customers and will update you as soon as our systems are operational.
Posted May 11, 2020 - 10:39 AEST
Identified
The BlastIQ team are currently working to resolve a networking issue that is impacting all customers globally.

We will provide updates as soon as we have an estimated time to resolve.
Posted May 11, 2020 - 07:17 AEST
This incident affected: User Sign In, Public API, BlastIQ Mobile, BlastIQ Insights, BlastIQ Insights Quarries USA, FRAGTrack, Administration Portal, and SHOTPlus.