Issue summary:
On November 14, 2024, Splash experienced major service degradation that caused intermittent disruptions in touchpoint loading and other service functionality. The issue affected the logged-in experience, RSVP to an event, and Virtual Event Page check-in.
Issue timeframe:
November 14, 2024 - 12:03 PM EST - 12:33 PM EST. (30 minutes)
Sequence of events:
- November 14, 2024 12:03 PM EST: Internal alerts triggered; investigation began.
- November 14, 2024 12:13 PM EST: First customer-reported issues received by Support.
- November 14, 2024 12:25 PM EST: System impact and root cause identified; recovery efforts initiated by Splash Engineering.
- November 14, 2024 12:33 PM EST: Services stabilized; monitoring continued to ensure resolution.
- November 14, 2024 1:02 PM EST: Incident marked as resolved.
Root cause:
Splash encountered a significant spike in traffic that exceeded normal levels, resulting in increased API errors. The autoscaling mechanisms set in place, and designed to adjust resources based on demand, did not perform as anticipated.
Steps to prevent recurrence:
- Splash to adjust internal processes to better flag high-traffic events proactively to prepare systems for potential surges.
- Splash to implement, QA, and adjust the autoscaling policy to manage traffic dynamically.