Issue summary:
The platform experienced an unexpected surge in traffic that significantly exceeded normal levels resulting in slower-than-usual response times across all Splash services.
Issue timeframe:
April 3, 2024, 12:51 PM EST to 1:57 PM EST (1 hour 06 mins)
Sequence of events:
- April 3rd, 12:51 PM EST - First internal alert received; investigation started.
- April 3rd, 12:56 PM EST - System impact identified; root cause traced to an abnormally elevated number of incoming traffic and resulting API calls.
- April 3rd, 1:13 PM EST - Restorative actions taken to decrease impact on the platform.
- April 3rd, 1:25 PM EST - Manual scaling of resources implemented within the database.
- April 3rd, 1:30 PM EST - Performance returned to 100% adjusting the incident to Monitoring status.
- April 3rd, 1:57 PM EST - Incident adjusted to Resolved status.
Root cause:
Splash encountered an unexpected spike in traffic that significantly exceeded normal levels.
The autoscaling mechanisms set in place, and designed to adjust resources based on demand, did not perform as anticipated.
This resulted in slower-than-usual response times across all Splash services.
Steps to prevent recurrence:
- Splash to adjust autoscaling and alerting metrics.
- Splash to update logic to accommodate increased traffic within the platform.
- Splash to investigate the source of traffic at the time of degraded performance.