TL;DR

The endpoint responsible for projectId validation saw an increased latency on the AP region. In an attempt to improve latency on that endpoint, our Cloud engineering team decided to merge 2 database queries by using an inner join.

The join turned out to be inefficient as opposed to the existing solution and caused our Cloud Database to be overloaded and unresponsive.

Summary

Root Cause

Refactoring the endpoint responsible for projectId validation to make a single database query instead of 2 by using an inner join without an index on the foreign key.

5 Whys

Why did the endpoint latency increase initially? Because we attempted to optimize the endpoint by reducing the number of database queries from two to one using an inner join that turned out to be inefficient.

Why did the inner join increase latency rather than decrease it?

The inner join was inefficient because it likely caused a full table scan or combined rows excessively due to non-optimized join conditions or lack of proper indexing. It caused our read replica to be overloaded and unresponsive.

Why was there a lack of indexing ?

Why wasn't there thorough testing?

Our current tests are limited to unit and integration tests and although our tests covered the expected behaviour, it didn’t cover testing under production workloads and it didn’t take query plan results into consideration.

Why were the testing protocols insufficient?