RCA P1 Incident 2025-07-09

Incident Summary

Product: CERRIX Platform

Date: 2025-07-09

Reference ID: RCA-20250709-P1-ReleaseDowntime

Incident Timeline

Timestamp

Event

2025-07-08 22:00

Opening of release window of sprint 46 and sprint 23 to all production environments

2025-07-08 22:13

Downtime starts for first customer environment

2025-07-08 23:59

All customer environments unavailable

2025-07-09 07:20

Detection that the production environments not available

2025-07-09 07:30

Requeued the release for all production environments

2025-07-09 08:00

Customer environments starting to become available again one by one

2025-07-09 08:45

Mail sent to all customers informing them of the downtime

2025-07-09 09:37

All production environments up and running except for the incidents module

2025-07-09 11:00

All production environments up and running including the incidents module

Root Cause

The incident was caused by a configuration error related to deployment permissions. As a safety measure, the environments were automatically shut down to prevent further issues and required manual intervention to restore service.

Detection Gaps

Although monitoring and alerting systems functioned as designed, the alerts were not noticed by the responsibles at time of deployment.

Reason the Issue Was Not Identified Before Release

The incident was not related to the code changes deployed, but rather to the deployment configuration itself. As a result, it was not identified during pre-release testing or acceptance testing.

Impact Analysis

Customers have experienced 8.5-11 hours downtime in all their production environments, mostly outside of business hours. The environments were not accessible, instead they provided the customer with an error page, making it impossible for them to log in.

The release concerned the production environments only, all their test and acceptance environments have been up and running without any issues.

Resolution & Recovery Steps

Short-term fixes implemented

A new deployment was executed on PRD for all environments, with the correct configuration.

Confirmation of stability

Confirmation from customers, monitoring on environments, health check and smoke testing on internal production environments, all took place in order to make sure that the environments are back to normal.

Lessons Learned

What worked well

The communication was handled by the Product Team and Consultancy with great care. The time between time of detection and time of resolution was quite fast.

What didn’t work well

One of the main issues highlighted by this incident, is the fact that for the time when the incident took place, the alerting did not notify a member of the team when manual intervention was required for the update to be completed.

Preventive & Corrective Actions

To prevent similar incidents in the future and further strengthen our service reliability, we are taking the following actions:

Action

Due date

We will review and improve our release configurations to prevent this specific issue from occurring again in future deployments.

2025-07-14

We are enhancing our release tooling to allow for more streamlined and reliable requeuing of releases, ensuring smoother and faster resolution in case of unexpected issues.

2025-08-04

In addition, we are providing further training and knowledge sharing to ensure that senior team members are fully up to speed with all aspects of the release process.

2025-07-14

Finally, we are revising our release planning and support structure to ensure that technical support is always available during release windows, including outside regular business hours when needed.

2025-08-04 (before the next PRD release)

We remain fully committed to continuously improving our processes and minimizing any potential impact on our customers. Thank you for your understanding and trust.

If you have any questions about this Root Cause Analysis, please get in touch with us.

PreviousRoot Cause Analysis

Last updated 28 days ago