Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
August 30, 2021 07:41 am GMT

UUID or cached sequences?

One reason why people choose a UUID for the surrogate key is scalability. Because it generates unique values without having to synchronize with a central generator like a sequence. There are other reasons to use UUID but this post is about scalability. In most databases, sequences can be scalable with a cache: the central catalog object that manages the sequence doesn't have to be read each time we need a next value. Even when the database does not provide a sequence cache, it is very easy to manage from the application. When you read from the sequence, multiply it by 50 and use it as the base to add to a local counter from 1 to 50. You will read the next value from the central point of truth only once every 50 next value generation. This is what JPA @SequenceGenerator is doing and 50 is the Hibernate default.

If you have a high throughput of inserts, you can increase the cache. Reading from the central sequence every 32767 inserts is probably a negligible overhead. Why am I putting 32767 ? Because this takes 2 bytes, out of the 8 bytes of the PostgreSQL bigint. So, even if the cached values are not used (which should happen only when connections in the pool are recycled) it remains 1e14 numbers from the bigint - enough to ingest a million rows every second for 30 years. You see where I'm going: even with a huge cache, an 8 bytes sequence is sufficient for scalability, compared to the 16 bytes UUID required to ensure unicity without a sequence.

PostgreSQL

In PostgreSQL the sequence cache is per session. If you constantly connect and disconnect, you will waste the cache range each time. But anyway, this is not how the database should be used. A connection is a process creation on the backend and if you do that for each transaction, you have a bigger problem than the wasted sequence cache. The application server should use a connection pool, and then the cache is re-used. I've created the following example, in python, to read the nextval() for each session grabbed from the pool. With 5 concurrent threads and a connection pool that can grow to 5 connections.

import sqlalchemyfrom   sqlalchemy import Sequenceimport threadingyb=sqlalchemy.create_engine('postgresql+psycopg2://franck:[email protected]:5433/yb_demo_northwind',pool_size=1,max_overflow=4)def mythread(): print(threading.current_thread().name) for i in range(100):  nextid = yb.connect().execute(Sequence('myseq'))  print(f'{nextid:6d} from {threading.current_thread().name:8s} {yb.pool.status()}');yb.connect().execute('drop sequence if exists myseq; create sequence myseq cache 32767')mythreads=[]for i in range(5): t=threading.Thread(target=mythread) mythreads.append(t) t.start()for i in mythreads: t.join()exit();

Here is the beginning of the output:

     1 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     2 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     3 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     4 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     5 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     6 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 32768 from Thread-4 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     7 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 65535 from Thread-5 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 98302 from Thread-2 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4131069 from Thread-3 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 32769 from Thread-4 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     8 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 65536 from Thread-5 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4131070 from Thread-3 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 98303 from Thread-2 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 32770 from Thread-4 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4     9 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 65537 from Thread-5 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4131071 from Thread-3 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 98304 from Thread-2 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4 32771 from Thread-4 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4    10 from Thread-1 Pool size: 1  Connections in pool: 1 Current Overflow: 4 Current Checked out connections: 4...

Even if I am using different session each time, they are grabbed from the connection pool and the sequence cache is still there from the previous one. Thread 1 has started first and grabs the first connection from the pool, and uses numbers within the cached range of 1-32767. It may not use them all if the connection is recycle but for sure the waste of number is not as large as the cache value. You can see that the other threads grab additional connections, starting at 32768, 65535... and then using the values in this range. So, with this solution, you waste only occasionally a small part of the 2 bytes range. Nothing to worry when compared to the UUID which has to waste a lot more (UUID is 16 bytes) to ensure a very low probability of collision.

(Here is a kaggle with this example if you want to play with)

YugabyteDB

I said PostgreSQL but I was connected to a YugabyteDB because they are fully compatible with postgres. The same query layer is used but the storage is different. In Yugabyte, all sequences are stored in a distributed table which is replicated according to the replication factor, and is still available in the event of a node crash or network failure. Because the leader tablet for the sequence may be on another node, the latency to read it is higher. And because it is replicated to other nodes, the latency to write to the quorum is also higher than monolith PostgreSQL. As a consequence, a large cache is a strong recommendation. To be sure that performance is fine even when not defining the cache explicitly, the default in YugabyteDB is pushed to 100 instead of 1 in PostgreSQL. And if the throughput is high, it is highly recommended to go higher. Don't worry about the gap, this will still be smaller than a UUID.

Despites their name, sequence generators are there to provide unique numbers, not a no-gap series. There's a lot to say about primary key, surrogate keys... will discuss about this at Joker 2021 in October: SQL primary key, surrogate key, composite keys, foreign keys... and JPA

Of course, with YugabyteDB as with PostgreSQL, you can generate a UUID with gen_random_uuid() as pg_crypto is installed by default. You can also install uuid-ossp.

But let's see another advantage of the cached sequence in a distributed database. You may want to distribute all rows even when generated from the same session. And this will be the case if you define the primary key with HASH sharding, the default.
Like this:

create table t1 ( id bigint generated always as identity ( start with 1 cache 32767 ), value text, constraint t1_id primary key ( id HASH ));

Or you may prefer to keep clustered the rows that are inserted together:

create table t1 ( id bigint generated always as identity ( start with 1 cache 32767 ), value text, constraint t1_id primary key ( id ASC ));

You can even define the ranges. The cache range will put rows at different place, in the same table it they are all in the same range. That's a choice, you can also rely on Automatic Tablet Splitting.

In short

You have the choice. There are reasons to use a UUID which have nothing to do with size or performance. For example, seeing UUID in logs rarely raises security audit concerns but having numbers generated in sequence may allow some guesses about your business information. This can be mitigated by starting at a high number, and it is highly shuffled by the cache, but anyway, security policy may not allow it. However, when the concern is scalability we need to think clearly. At first glance, it seems that a cache has a state which must be shared by all sessions, and then is not scalable. But with a large cache, querying and updating the sequence becomes an infrequent operation where the latency doesn't matter. And once you solved this, you can get all advantages of a sequence. The ascending nature of the number helps clustering rows together to accelerate bulk loads. Or, on the opposite, can be distributed by hash sharding if that's preferable to distribute the load to many nodes.

The size also matters. 16 bytes in the primary key and all foreign keys is large. My sequence, with bigint, is two times smaller. And operations on it, like comparing or increasing a integer, is a very simple CPU operation when compared to generating a UUID with enough randomness. Just run the following on PostgreSQL or YugabyteDB:

create extension pgcrypto;\timing oncreate sequence myseq cache 32767;select count(nextval('myseq') ) from generate_series(1,10000000);select count(gen_random_uuid()) from generate_series(1,10000000);

With this large cache, the sequence generates 3 million unique numbers per second, but less than one million per second for the UUID:
Alt Text

If you never thought about the sequence cache size, there's a good chance that you think sequences are not scalable. The problem is not the SQL sequence, but keeping the defaults that are often too small.


Original Link: https://dev.to/yugabyte/uuid-or-cached-sequences-42fi

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To