Root Cause Analysis for Riva Cloud Insight Outage on May 29, 2024
Summary:
An issue with the Riva Cloud Insight web servers caused a 504 error for users on May 29, 2024. The outage lasted intermittently throughout the day and was ultimately resolved by migrating to new web servers.
Timeline:
- 10:00 AM EST: The team identified a 504 error impacting Riva Cloud Insight.
- 10:20 AM EST: An investigation into the cause of the error began.
- 12:29 PM EST: The issue was identified as originating from the web servers and was causing intermittent outages.
- 2:56 PM EST: Emergency updates were deployed to the web servers in an attempt to restore service.
- 5:57 PM EST: The move to new web servers was completed, resolving the issue.
Root Cause:
The root cause of the outage is suspected to be a failed Windows Update on the Riva Cloud Insight web server. This failed update likely caused a process or service to malfunction, resulting in high CPU usage. The high CPU usage overloaded the server and caused the 504 error, preventing users from accessing the service.
Analysis:
- The reboot of the web server did not resolve the issue, indicating the problem likely originated from the operating system or a service running on it.
- Monitoring tools in AWS identified high CPU usage on the web server, suggesting a resource constraint within the server itself.
- The successful migration to new web servers points towards an issue with the original server's configuration or software state.
- While investigating further in AWS, it was discovered that a failed Windows Update might be the culprit preventing the instance from booting successfully.
Improvements Implemented:
- Improved health checks: To prevent similar outages in the future, the team has already implemented more comprehensive health checks for the Riva Cloud Insight web servers. These checks monitor resource utilization, service health, and application functionality, allowing for earlier detection of issues and facilitating faster failover procedures.
By having these improved health checks in place, the team is better equipped to proactively identify potential problems before they disrupt service.
May 29, 2024, 557PM EST
The move to the new web servers has been completed and confirmed to be working. This has resolved the issue that was experienced throughout the day when connecting to the Riva Cloud Insight service.
May 29, 2024, 256PM EST
To restore service, the team is in the process of making emergency updates to the web servers for the US data center. An update will be posted when this is completed and tested.
May 29, 2024, 1229PM EST
The 504 issues are continuing intermittently. The team is working to isolate the root cause and restore functionality.
Note, that this does not impact Riva Cloud Sync.
May 29, 2024, 1020AM EST
At 1000 EST this morning the team identified an issue with Riva Cloud Insight that is causing a 504 error. We are investigating the problem and will provide an update in the next hour.
Related to