Issue summary:
The platform experienced intermittent errors impacting multiple features caused by an elevated number of incoming API calls sent by a third-party automated process.
Issue timeframe:
February 6, 2024 03:18 AM EST to 04:58 AM EST (1 hour 48 mins)
Sequence of events:
- February 6, 03:18 AM EST First internal alert received; investigation started.
- February 6, 03:40 AM EST System impact identified, and root cause traced to an abnormally elevated number of incoming API calls.
- February 6, 04:10 AM EST Source IP addresses of the requests identified
- February 6, 04:22 AM EST IP addresses blocked, and system begins to return to stability
- February 6, 04:58 AM EST Functionalities fully restored.
Root cause:
Due to an automated process by an external application, an unnecessary number of API calls were sent to Splash that exceeded our native limits, which caused intermittent issues in several parts of the platform.
After identifying the IPs making this call, action was taken to block them, and all functionalities in the platform were restored.
Steps to prevent recurrence:
- Splash connected with the third party generating the API calls, to ensure this process is corrected in all future instances.
- Splash will enhance our IP address-level API rate limiting.
- Splash will implement additional quality assurance and testing controls around rate limiting.