Primary keys are important. They uniquely identify rows of data in tables, and make it easy to fetch data. The job of a database is to archive and recall data and you're going to have a hard time finding data without a good primary key or a good index.
Sometimes it makes sense to use a “natural key” (like an email
column in a users
table) and sometimes it's better to use a “surrogate key”, a value made for the purpose of identifying a row (and has no other meaning).
At first glance, the question of which primary key to use is easy! Just throw a integer
/serial
on there, right? Numeric IDs are cool, but what about random value IDs or Universally Unique IDentifiers (UUIDs)?
Turns out the question of which identifier (and in this case, UUID) to use is complicated -- we're going to dive into some of the complexity and inherent trade-offs, and figure things out:
- What are the choices for identifiers?
- If we choose to use/add UUIDs, which ones should we choose?
- How can we get these UUIDs into postgres?
- Which UUIDs perform best?
But first, a quick history lesson.
A brief history of identifiers and why we use them
integer
/biginteger
Let's think about identifying rows of data from first principles. What's the first way you might think of identifying things? Assigning them numbers!
We can set up a table like this:
_12-- Let's enable access to case-insensitive text_12CREATE EXTENSION IF NOT EXISTS citext;_12_12-- Heres a basic users table_12CREATE TABLE users (_12 id integer PRIMARY KEY,_12 email citext NOT NULL CHECK (LENGTH(email) < 255),_12 name text NOT NULL_12);_12_12-- Let's assume we don't want two users with the exact same email_12CREATE UNIQUE INDEX users_email_uniq ON users USING BTREE (email);
This looks great, but what should id
be on new rows? I don't know -- maybe the application can figure it out? If they store some value in memory? That doesn't seem right.
Maybe we could figure out the next integer from what's in the table itself -- we just need to be able to "count" upwards. We do have all the users
tables rows in there, so we should be able to do it:
_10insert into users_10 (id, email, name)_10select count(*) + 1, '[email protected]', 'new_user' from users;
After running that query, we can double check our results:
_10select * from users;
id | name | |
---|---|---|
1 | [email protected] | new user |
2 | [email protected] | new user |
Using COUNT(*)
in our query is not the most efficient (or even easiest) solution though, and hopefully it's clear why -- counting a sequence of numbers for primary keys is a feature built in to Postgres!
serial/bigserial is the right tool in our toolbox to maintain a shared, auto-incrementing sequence of numbers. Let's pretend we read the postgres documentation and use those instead.
serial
/bigserial
serial
is essentially a convenient macro for using Postgres sequences, a database-managed auto-incrementing stream of integer
.
Let's hear it from the docs:
The data types
smallserial
,serial
andbigserial
are not true types, but merely a notational convenience for creating unique identifier columns (similar to theAUTO_INCREMENT
property supported by some other databases).
Using a serial
column to create the users
table would look like this:
_10create table users (_10 id serial primary key,_10 email citext not null check (length(email) < 255),_10 name text not null_10);
OK, now let's try inserting into it - we shouldn't have to specify id
:
_10insert into users_serial_10 (email, name)_10values_10 ('[email protected]', 'new user');
_10select * from users;
id | name | |
---|---|---|
1 | [email protected] | new user |
_10(1 row)_10INSERT 0 1
It works, as you might expect - now the application doesn't have to somehow magically know the right ID to use when inserting.
But what does serial
actually do? Using a serial column is operationally similar to the following SQL:
_10CREATE SEQUENCE tablename_colname_seq AS integer;_10CREATE TABLE tablename (_10 colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')_10);_10ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
Back in application land, the INSERT
statement returns, and provides the new id
the database assigned our new row. Multiple application instances don't need to coordinate what ID to use -- they just don't, and find out from the database.
We've taken a somewhat meandering path to get here, but this is the standard solution for most reasonable database schemas.
Why not stop at serial
?
There are few issues with sequences:
- When writing automation that simply iterates through id values, note that
serial
columns can have gaps, even if you neverDELETE
(e.x. if anINSERT
was rolled back — sequences live outside transactions). - When used from outside code
serial
may leak some data or give attackers an edge (e.x., ifyoursite.com/users/50
works, how aboutyoursite.com/users/51
?). serial
is PostgreSQL specific (i.e. not SQL standards compliant)
Don't be too put off by these reasons -- serial
is still the go-to for most use-cases.
Even the last point about serial
not being standards compliant is solved in Postgres 10+ by using...
integer
/biginteger
(again!)
Postgres 10 added support for the IDENTITY
column syntax in CREATE TABLE
(EDB has a great writeup on the addition).
This means we can happily go back to using integer
/biginteger
:
_10CREATE TABLE (_10 id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,_10 email citext NOT NULL CHECK (LENGTH(email) < 255),_10 name text NOT NULL_10)
So how does this work? Postgres does the same thing under the covers -- it generates a sequence
. As this syntax is standards compliant, it's generally recommended practice for DBAs going forward, for the sake of the realm.
So integer
s are great, but information leakage is still a problem. How do we fix that? Well, make the numbers random, obviously.
Random Numeric IDs
Let's say the application using the DB has some Python code like the following:
_10from random import randrange_10from models import User_10MAX_RANDOM_USER_ID = 1_000_000_000_10def create_user():_10 """_10 Add new user to the database_10 """_10 user_id = randrange(1, MAX_RANDOM_USER_ID)_10 user = User(id=user_id, email="[email protected]", name="new user")_10 db.save(user)
That looks good, but there's a problem -- random is a pseudorandom generator.
Pseudo-random numbers are not what we want for ensuring user IDs cannot be easily guessed/collide. It's possible to get the exact same sequence of values out of a pseudo-random number generator by using the same seed value.
Sometimes you want pseudo-random behavior (let's say for testing or fuzzing), but it's generally not desired for production systems that might run from a duplicated identical application image, since they could have weak pseudo-random seed initialization.
(Secure) Random Numeric IDs
At the very least we need a properly secure random numbers -- we need Python's secrets module:
_10from secrets import randbelow_10# ..._10def create_user():_10 """_10 Add new user to the database (using secure random numbers)_10 """_10 user_id = randbelow(1, MAX_RANDOM_USER_ID)_10 user = User(id=user_id, email="[email protected]", name="new user")_10 db.save(user)
Now we have a secure random value coming in for our user IDs. But having values like 583247
and 8923916
get generated are cool and all, but there are a few problems:
- These numbers are random and quite inscrutable
- The keyspace is fairly small (maybe good for comments on a popular website, but not for IDs!)
- People can still technically check them all (the guessing space is 1 to
MAX_RANDOM_USER_ID
!)
We need something better.
(Secure) Random UUIDs
Along comes UUIDs -- you're probably used to seeing them now, values like this UUIDv4:
468e8075-5815-4fe2-80d3-45a31827954b
.
They're very random (almost always generated with secure random sources), and while they're even worse for remembering, they're near impossible to practically guess -- the search space is just too large!
More importantly, UUIDs introduce methodology to the madness -- different versions of UUID are derived different ways -- combined with other sources of randomness or known values.
There are a lot of versions of UUID, but let's discuss the ones we're more likely to use/see day to day.
UUIDv1
Version 1 UUIDs have three two components:
- a 60 bit date-time (at nanosecond precision)
- a 48 bit MAC address
But where's the randomness? Well v1s assume that you won't generate a ton of values in the same nanosecond (and there are some extra bits reserved for differentiating even when you do), but another source is the MAC address. MAC addresses uniquely (usually) identify network cards -- which is a security risk -- and those bits can be made random.
Here's what a UUIDv1 looks like:
_10a9957082-0b47-11ed-8a91-3cf011fe32f1
You can generate v1 UUIDs in Postgres natively thanks to the uuid-ossp contrib module. Here's how to generate a v1 UUID with random MAC address:
_10CREATE EXTENSION IF NOT EXISTS "uuid-ossp";_10CREATE EXTENSION_10_10SELECT uuid_generate_v1mc();_10_10 uuid_generate_v1mc_10--------------------------------------_10 dd1bbf10-0b47-11ed-80de-db48f6faaf86_10_10(1 row)
UUIDv4
Version 4 UUIDs use all the available bits for randomness -- 122 bits worth!.
UUIDv4s look like this:
_10ce0b897d-03a0-4f54-8c97-41d29a325a23
These don't have a time component, but they don't have in time they make up for in randomness -- it is very unlikely for them to collide, so they make for excellent Global Unique IDentifiers ("GUID"s).
We can generate them in Postgres like this (with uuid-ossp
):
_10SELECT gen_random_uuid();_10_10 uuid_generate_v4_10--------------------------------------_10 6ca93dde-81d4-4ea0-bfe1-92ecb4d81ee4_10_10(1 row)
Since Postgres would catch a collision on a PRIMARY KEY
or UNIQUE INDEX
column, we're done right? If we want to generate UUIDs all we need to do is choose UUID v1 or V4, and we won't leak any schema structure information to the outside world, right?
This is a workable solution, but as you might expect, it's not that easy.
The Post-UUIDv1/v4 era: A Cambrian explosion of identifiers
UUIDv1 and v4 were a start, but weren't enough for many companies out there. There are a couple shortcomings that plague both v1 and v2:
- UUIDs are twice the size of
bigint
/bigserial
- UUIDv1s contain a time element but they're not lexicographically sortable (this means they
SORT
terribly, relative tointeger
or atimestamp
column) - UUIDv1s are less random than UUIDv4, and can collide/overlap in close enough time intervals, at large scale
- UUIDv4s index terribly, as they're essentially random values (obviously, they
SORT
terribly as well)
Many of the world's biggest companies generated UUIDs at speeds that made all of these deficiencies a problem.
A cambrian explosion of UUIDs resulted, as noticed by the IETF -- this resulted in the new UUID formats (v6,v7,v8) being published in 2021.
Here's a quick list (from that IETF document):
- LexicalUUID by Twitter
- Snowflake by Twitter
- Flake by Boundary (now BMC TrueSight Pulse)
- ShardingID by Instagram
- KSUID by Segment
- Elasticflake by P. Pearcy
- FlakeID by T. Pawlak
- Sonyflake by Sony
- orderedUuid by IT. Cabrera
- COMBGUID by R. Tallent
- ULID by A. Feerasta
- SID by A. Chilton
- pushID by Google
- XID by O. Poitrey
- ObjectID by MongoDB
- CUID by E. Elliott
That's... A lot of UUIDs. They're all slightly different, but the innovation was summed up by the IETF:
An inspection of these implementations details the following trends that help define this standard:
- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.
- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.
- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.
- The ID format SHOULD be Lexicographically sortable while in the textual representation.
- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.
- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.
- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource. The IETF went on to introduce three 3 new types of UUIDs that have these properties these companies were looking for: UUIDv6, UUIDv7, and UUIDv8.
So what's the difference you ask?
- UUIDv6 - 62 bits of gregorian time + 48 bits of randomness
- UUIDv7 - 36 bits of big endian unix timestamp (seconds since epoch + leapseconds w/ optional sub-second precision) + variable randomness up to 62 bits
- UUIDv8 - variable size timestamp (32/48/60/64 bits) + variable size clock (8/12 bits) + variable randomness (54/62 bits)
It's not quite easy to work out what this all means but let's boil it down:
- All of these UUIDs sort properly (the "high bits" of time are first, like putting the year before the month -- "2022/07")
- UUIDv6 requires randomness
- The data contained in the UUID can be variable (ex. UUIDv8), this means you can bytes that mean something else (ex. an encoding of the compute region you're running in)
Alright, done hearing about UUIDs? Let's get to the fun part.
Benchmarking ID generation with uuid-ossp
and pg_idkit
With the history lesson behind us, let's benchmark these ID generation mechanisms against each other! For UUIDv1 and UUIDv4 we can use uuid-ossp.
Unfortunately, uuid-ossp
isn't quite so advanced as to have many of these newer UUIDs we've been discussing, so we'll pull in pg_idkit here.
pg_idkit is built with Rust, so it gives us access to the following ID generation crates:
- nanoid (a well known package from the NodeJS ecosystem)
- ksuid
- ulid
- rs-snowflake (Twitter's Snowflake algorithm)
- timeflake-rs (Inspired by Twitter's Snowflake, Instagram's ID and Firebase's PushID)
- sonyflake
- pushid
- xid
- cuid
- uuidv6
- uuid7
For each type of UUID, we can test the following:
- Generation speed: **How fast can I generate IDs (let's say 1,000,000 of them)?
- Table & Index size: How much larger do tables and associated indices get?
Generation speed
Generation speed is pretty easy to test, we can enable \\timing
mode on psql
and run a simple benchmark with generate_series
:
_10\timing -- enable psql timing mode_10_10-- Generate IDs 1 million times with ksuid_10SELECT COUNT(idkit_ksuid_generate()) FROM generate_series(1, 1000000);
Running all of the ID generation mechanisms on a single core of my machine (which happens to be an Oryx Pro), the lowest of 5 runs for each ID looks like this:
To be fair, generation speed shouldn't be a deal breaker as it's unlikely to be the bottle neck for most applications. That said, it is nice to have some data on where each ID generation mechanism lands.
Table & Index size
We can check the size of our tables & related indices with this query (after running VACUUM
):
_10select_10 relname as table_name,_10 pg_size_pretty(pg_total_relation_size(relid)) as "Table Size",_10 pg_size_pretty(pg_indexes_size(relid)) as "Index Size",_10 pg_size_pretty(pg_relation_size(relid)) as "Total Size"_10from pg_catalog.pg_statio_user_tables_10order by pg_total_relation_size(relid) desc;
Here are the sizes in tabular form:
These numbers are mostly a reflection of the length of the default settings of pg_idkit
but probably worth having in front of you anyway.
With this, we probably have enough information to make a decision (and a new library to generate our UUIDs with)!
Which ID should you use?
As usual, it depends -- you didn't think it'd be that easy, did you?
All I can offer are some general rules of thumb that hopefully work for you:
integer
s andserial
have obvious benefits for simplicity, storage, and sortability. You might not want to expose them to the world though.- If you want the ultimate in collision avoidance UUIDv4 is OK
- UUIDv1 could have been great, but it doesn't lexicographically sort.
- The best time-based ID seems to be
xid
, with good performance and sort friendliness - If you want to be a little more standards-oriented, UUID v6/v7
As usual, the best results will come from weighing all the options and finding what's best for your use-case, and doing appropriate testing on your data.
Possible Improvements
We've done some good exploration so far, but here are some ideas for interesting use cases for pg_idkit
and measuring the impact of ID generation using it.
Usecase: Generating our created_at
columns from our IDs
One interesting feature would be using at least partially time-based UUIDs for created_at
columns -- we could save space by virtualizing our created_at
columns:
_11-- At table creation_11CREATE TABLE users (_11 id text PRIMARY KEY DEFAULT idkit_ksuid_generate(),_11 name text,_11 email text,_11);_11_11-- An example query for a specific KSUID that uses created_at_11SELECT *, idkit_ksuid_extract_timestamptz(id)_11FROM users_11WHERE id = '0F755149A55730412B0AEC0E3B5B089C14B5B58D';
Ideally we could use the GENERATED ALWAYS AS ( ... )
syntax for generated columns while creating the table, but as the time of this post Postgres does not yet support virtual generated columns (only stored ones).
Benchmarking: Measuring index fragmentation
How fragmented do our indices get after use of each of these methods?
Luckily for us Postgres has the pgstattuple extension so we can find out -- thanks to Laurenz Albe on StackOverflow).
Integrating and including these tests in the pg_idkit
README would greatly help people looking to make a decision.
Benchmarking: Measuring SORT
friendliness
Another great metric to measure might be performance of these indices on certain common SORT
patterns. While this is inherently workload-specific, it would be great to pick a workload and see what we get.
In most code bases, simple WHERE
queries with SORT
s abound, and one of the big benefits of UUIDv6, UUIDv7 and the other alternatives is lexicographic sorting, after all.
Knowing just how good a certain ID generation method is at maintaining locality would be nice to know.
Creating and using a function like idkit_uuidv1_extract_timestamptz
and using it in a “functional index” (an index on an expression) could resolve the sort unfriendliness of UUIDv1 as well!
Wrap-up
Identifiers have a long history and are surprisingly unsolved at present. It can be confusing but thanks to the power of Postgres we don't have to over-think it -- tables can be migrated from one ID pattern to another, and we can use DDL statements in transactions.
Hopefully this article helps you head off some bikeshedding with your teammates when you next discuss which IDs are best to use.