Explorer API outage

TL;DR

The endpoint responsible for projectId validation saw an increased latency on the AP region. In an attempt to improve latency on that endpoint, our Cloud engineering team decided to merge 2 database queries by using an inner join.

The join turned out to be inefficient as opposed to the existing solution and caused our Cloud Database to be overloaded and unresponsive.

Summary

When did it start/end
- It started on May 28 at 16:11 UTC and ended at 16:45
Who found the issue
- Our alarms got triggered and our team got notified right away
How was it mitigated
- We rolled back the change and restarted our read replica
What was the (customer) impact
- An increased latency as well as a high error rate on relay connections.

Root Cause

Refactoring the endpoint responsible for projectId validation to make a single database query instead of 2 by using an inner join without an index on the foreign key.

5 Whys

Why did the endpoint latency increase initially? Because we attempted to optimize the endpoint by reducing the number of database queries from two to one using an inner join that turned out to be inefficient.

Why did the inner join increase latency rather than decrease it?

The inner join was inefficient because it likely caused a full table scan or combined rows excessively due to non-optimized join conditions or lack of proper indexing. It caused our read replica to be overloaded and unresponsive.

Why was there a lack of indexing ?

We naively assumed there was an existing index for a highly used column.
Due to lack of query planning analysis

Why wasn't there thorough testing?

Our current tests are limited to unit and integration tests and although our tests covered the expected behaviour, it didn’t cover testing under production workloads and it didn’t take query plan results into consideration.

Why were the testing protocols insufficient?