Issue summary:
At 12:00 PM ET on March 11, 2024, Splash experienced performance degradation including delayed and stuck email sending, failed RSVPs, and touchpoint loading issues. The service interruption was found to be caused by a long-running query that prevented other processes like RSVPs or Contacts operations from performing correctly. Services were fully restored by 1:11 PM ET.
Issue timeframe:
March 11, 2024 from 11:11 AM ET to 1:11 PM ET (2 hours)
Sequence of events:
- March 11, 2024, 11:11 AM ET: Internal system monitoring alerts that queries to the Splash contacts database are slowing.
- 12:00 PM ET: First customer reports of failures received by Support.
- 12:09 PM ET: Internal alerts received; investigation started.
- 12:49 PM ET: System impact identified, and root cause traced to a long-running query that blocked other processes.
- 1:11 PM ET: All problematic queries and processes were eliminated, and services were fully restored.
Root cause:
A query process overconsumed Splash resources, blocking other processes from being executed properly, leading to performance degradation in other areas as a side effect.
Steps to prevent recurrence:
- Splash to run query processes in contained environments to monitor for and prevent significant performance impact.