Alpha Launch Postmortem

07-10-2020

6 minute read

On May 27 Supabase hit the top of Hacker News and stayed on the front page for more than 24 hours.

Since then, Supabase has been featured on the Stack Overflow podcast, hit the trending page on GitHub, and scaled to over 1000 databases.

Here is everything that went wrong along the way.

Quick stats - launch week

Before we get into the details, here are some high-level numbers for the week following the launch.

We had 30,000 new website visitors to supabase.io:

This graph shows the traffic to our website, that peaks at 15,000 the first day and tails off throughout the week. We had a total of 30,000 visitors to our site. That's according to Google Analytics though, and everyone on HN blocks GA so who knows what the real number is.


We had over 1400 signups in 7 days: This graph shows our signups. Before the launch we had around 100 signups. After the launch it sky-rocketed to 600 in the first day, 900 by the second, and 1000 by the third day. By the end of the week we had over 1400 signups.

GitHub stars rocketed for our two main repos, supabase and realtime. This graph shows our github stars. It's basically a vertical line. Our main repo went from 300 stars to 1300 overnight.

The good

Here are the things that survived well.

Middleware: docker compose up

This was the most surprising survivor. Our middleware was served from a single Ubuntu server with 4 CPUs and 8GB of RAM. This server was running our middleware using docker compose up:

This image shows in middleware architecture. We used docker compose to pull up 5 open source tools: Kong, Realtime, PostgREST, PG-API, and PG-BOSS.

In case you're wondering why any sane company would use that in production, it's because we weren't planning to launch - the HackerNews post was created by an early GitHub follower, while we were alpha testing, and it was too scary to migrate the middleware while it was servicing the thundering herd. All Supabase projects use the same middleware stack (without docker compose), so I guess this counts as a successful load test.

We have since migrated our middleware to multiple ECS clusters, globally load-balanced using AWS's Global Accelerator.

Frontend: Netlify, Vercel, Auth0

We serve our marketing site (supabase.io) from Netlify. It's a static-build Docusaurus (v2) site, so it had no problems (apart from one developer in Russia who couldn't access the site - it looks like some of Netlify's IP addresses are blocked there).

We serve our app (supabase.com/dashboard) using Vercel, and the login system uses Auth0. These were both rock-solid. Before the launch we noticed that Vercel was extremely slow on their free plan, and once we upgraded to their Pro Plan for multi-region deploys it solved performance issues. It looks like they are changing their plans again so buyer beware.

The Bad

Digital Ocean cloud limits

In May we were using Digital Ocean to serve all of our customer databases. We hit the first server limit (400 servers) in the space of a few hours. They bumped our limit up to 1000 and we hit that again a few hours later.

Digital Ocean were very responsive when we asked for increases, each time responding in 30 minutes or less.

Cloudflare cloud limits

Each Supabase project gets a unique URL for their API and database. This is set up using Cloudflare's API. This was a seamless process until we hit the 1000-subdomain limit, at 4am in the morning. My cofounder was awake, managed to identify the problem early, and reached out to the support to increase the limit.

Cloudflare support advised him to upgrade the account to increase the limit, but the upgrade could only be done by the owner (me). Unfortunately my phone was on silent, so for 3 hours our systems were down. Ideally any one of our team could have upgraded our account, but I imagine that Cloudflare have their reasons for this restriction.

The Ugly - migrating 1800 servers

Let me start by saying that Digital Ocean have been great for getting up and running. The experience was simple to start with, but ended painfully.

Digital Ocean production errors

The first sign of problems were the frequent production errors. This is a screenshot of emails from Digital ocean for the month of June.

This image shows all the emails I was receiving from Digital Ocean. I was receiving an email every other day about servers which needed to be migrated.

Each of these emails represents one or more servers that has a critical issue:

We have identified an issue on the physical machine hosting one or more of your Droplets. In the event that we are not able to perform a live migration of a Droplet, we will perform an offline migration during which the Droplet will be powered off and migrated offline during the window.

Luckily we are in alpha, and our community has been extremely patient.

Credit limits

Most of our frustration was due to their internal policy around credits program. We were generously granted $10,000 Digital Ocean credits in February through Stripe Atlas, and so they became our primary cloud provider.

Digital Ocean have another (more generous) credits package for YC companies. Unfortunately when we applied for this we were told we weren't eligible because it was a "Partner switch" from Stripe to YC. This was frustrating because a "Partner switch" is completely arbitrary to us as a customer.

Also we had conveniently just run out of credits. We don’t expect cloud providers to fund our inefficiencies, but we needed time to optimize our infrastructure after our surprise launch. We had assumed that the more generous credits package was guaranteed - an assumption which cost us several thousand dollars. After swift deliberation, we decided to migrate away from Digital Ocean.

Going Forward: AWS t3a

In the past 3 weeks we have migrated 1800 servers over to AWS.

We are on the AWS "Activate" program, which grants us $100,000 credits. Since Firebase has a very generous free-tier, and we want to be able to offer Supabase developers a similar experience, this is ultimately a huge benefit to our community.

The AWS team were helpful, suggesting their new t3a instances. We're already seeing improvements, with database startup times almost halved from ~90s to ~50s. From our research, this is the fastest Postgres setup in the market.

We will release a detailed write-up on these instances in the next few weeks. Sign up to our newsletter if you want to be notified when we release the post.

Share this article

Build in a weekend, scale to millions