Root Cause Analysis for the Microsoft Graph outage reported on July 22, 2024
Summary
An expired client secret within our system caused an interruption in the connection to Microsoft Graph, resulting in a service outage for all customers using Microsoft Graph as their email connection. This led to synchronization failures for affected users.
Timeline
- July 21, 2024, 8:56 AM EDT: The issue began with users experiencing errors related to expired certificates.
- July 22, 2024, 9:53 AM EDT: Riva notified customers of the critical issue affecting all Microsoft Graph clients.
- July 22, 2024, 2:25 PM EDT: Riva identified and resolved the issue for US pods, restoring synchronization for affected clients.
- July 22, 2024, 5:40 PM EDT: The issue was resolved for all other regions, including, Canada, Europe and Australia.
Root Cause
An expired client secret prevented Riva from successfully authenticating with the Microsoft Graph service, disrupting data synchronization.
Analysis
The outage underscores the critical importance of credential management within critical infrastructure. Expired credentials disrupt communication and functionality, impacting user experience. The swift identification and resolution by the Riva team minimized the downtime for clients in our US and CA data centers, followed by a broader fix for all regions.
Improvements Implemented
To prevent a recurrence, Riva has transitioned from using client secrets to certificate-based connections for Microsoft Graph authentication. This change allows for more efficient monitoring and management of credentials. Additionally, a rigorous certificate rotation process has been implemented to ensure the ongoing security and reliability of the service.
2024-7-22 5:40 pm EDTThe issue has been resolved for all affected CA pods. Note, that there were no affected accounts on our EU and AU pods. 2024-7-22 2:25 pm EDTOur Engineering team has identified and resolved the issue for all affected US pods. We have confirmed that synchronization has resumed for all affected clients. 2024-07-22 9:53 AM EDTThis message is to inform you of a critical issue currently affecting all graph clients. The error has been identified and is impacting the performance and functionality of the graph services. Issue Details: · Date/Time of Occurrence: 2024-07-21 8:56 am EDT · Description: Users are receiving errors in regard to expired certificates. · Affected Clients: All customers with Microsoft Graph set as their email connection. · Impact: Affected users are failing to synchronize
Our Response: Our technical team is actively investigating the issue to identify the root cause and implement a resolution as quickly as possible. We are working diligently to minimize the disruption and restore normal service. Next Steps: We will provide regular updates on our progress and any significant developments. In the meantime, please monitor your services and report any additional issues to our support team at Support@RivaEngine.com |