TL;DR
Issue 1:
All projects that were created after April 18 11:30 UTC until April 22 11:39 UTC were considered invalid when making a request to api.web3modal.com because the API was still querying data from the temporary replica that was out of sync with the latest projects created. 494 developers created a project during that period.
Issue 2:
While fixing the issue and redeploying a new version of the worker, another bug occurred where all projects started receiving a 400 bad request response.
Summary
- It started on April 18 11:30 UTC and stopped on April 22 11:39 UTC
- Web3Modal Flutter developers and Alfredo raised the issue
- Fixed by updating the environment variables so that the worker queries data from our original Supabase DB (read replica)
- New projects attempting to use Web3Modal were unable to do so
- All projects got a 400 response starting from 09:45 UTC until 11:39 UTC
Root Cause
While upgrading our Supabase instance to the latest Postgres version, we had to point our workers to a temporary replica to avoid downtime.
That process involves updating environment variables manually through the Cloudflare dashboard UI twice (once before the upgrade and after the upgrade to point back to the original db).
Issue 1
The first issue occurred because the environment variables were not properly updated to point back to the original DB.
Issue 2
The second issue occurred because there was a connection issue between our Cloudflare worker and Supabase.
5 Whys
- Why were new projects considered invalid when making requests to the API?
- The API was querying data from a temporary replica (that was used during the Postgres update) that was out of sync with the original database because the last projects it was aware of were the ones created before the upgrade process.
- Why was the API querying data from the temporary replica?
- Environment variables were not updated to point back to the original database after completing the database upgrade.
- Why were the environment variables not updated?
- Manual updates through the Cloudflare dashboard UI were required and were overlooked and/or mishandled.
- Why were the updates to the environment variables mishandled?
- There seems to be a bug in the Cloudflare UI where updating an encrypted secret and a non encrypted secret at the same time, results in the encrypted secret not being updated. We are missing logging and alerting when such situation occurs.
- Why was there no logging or alerting in place?
- We explicitly decided to ignore 4XX errors from our log ingestion
- We are only monitoring and alerting based 5XX errors
What could we have done better?