CloudBees Rollout service incident

Incident Report for CloudBees

Resolved

All CloudBees Rollout (Feature Flags) services are now fully operational. We have not identified any data-loss or security impact from this outage.

An outage post-mortem and corrective actions will be performed in due course.

Thank you for your patience.

Posted Jul 16, 2020 - 01:39 UTC

Update

The Rollout core service (API/login/web) outage has been resolved and these services are now fully operational.

However, Impression Analytics are not currently available, the engineering team are working to resolve this issue.

We will continue to provide service updates on the status of Impression Analytics until the issue is resolved.

Posted Jul 16, 2020 - 01:18 UTC

Monitoring

Apart from Impression analytics - which is currently not working - the service is back to operational. We're monitoring the situation.

Posted Jul 16, 2020 - 00:40 UTC

Update

Our Engineering Team has been successfully restored the database but still not 100% operational.

More updates to follow.

Posted Jul 16, 2020 - 00:05 UTC

Update

IBM Compose update - "Virtual networking is up across all hosts in the cluster and the situation appears to be stable. We are slowly starting data/member capsules. Once those are up, we will start portals which will restore customer access"

In parallel - CloudBees engineering teams are now working to restore the database service to our own infrastructure - with a view to failing over if Compose is not able to restore access in a timely manner.

Posted Jul 15, 2020 - 22:42 UTC

Update

We've recieved this message from our service provider: “At this point we are cautiously optimistic. Our engineers are close to having virtual networking up across all hosts in the cluster. So far so good. Once stable we will start bringing capsules back up.”

More updates to follow.

Posted Jul 15, 2020 - 21:07 UTC

Update

Monitoring https://status.compose.com/ for further updates

Posted Jul 15, 2020 - 16:02 UTC

Update

Our service provider announced is going to take longer to recover the system. The current rough estimate for recovery is from 5 to 9 hours.

Posted Jul 15, 2020 - 14:08 UTC

Update

Our service provider has updated us about the situation and the rough estimate for recovery is 2 to 3 hours.

Posted Jul 15, 2020 - 13:11 UTC

Update

We have confirmed the database issue with our service provider and they are working to restore service.

Service impact is that experiments can’t be updated, but existing flags are unaffected.

Posted Jul 15, 2020 - 12:38 UTC

Identified

There is an outage with a 3rd party service. We're contacting them to see what's the situation.

Posted Jul 15, 2020 - 11:41 UTC