Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 22, 2023 08:48 am GMT

Range indexes for LIKE queries in YugabyteDB

SQL is powerful and indexes can provide fast access even when you only know partially the value you are looking for.

For this example, I'll use a list of popular baby names born in the US between 1880 and 2009 collected by Hadley Wickham:

drop table if exists baby_names;create table baby_names (   primary key (name, year, sex) ,year int , name text, percent float, sex text);\! wget -qc "https://github.com/hadley/data-baby-names/blob/master/baby-names.csv?raw=true"\copy baby_names from 'baby-names.csv?raw=true' with ( skip 1, format csv );

The primary key starts with the name and is HASH sharded on it (the default in YugabyteDB is HASH on the first column):

yugabyte=# \d baby_names                  Table "public.baby_names" Column  |       Type       | Collation | Nullable | Default---------+------------------+-----------+----------+--------- year    | integer          |           | not null | name    | text             |           | not null | percent | double precision |           |          | sex     | text             |           | not null |Indexes:    "baby_names_pkey" PRIMARY KEY, lsm (name HASH, year ASC, sex ASC)

A query for one name is fast, using the index:

yugabyte=# explain (analyze, dist, costs off)           select name, year, sex from baby_names            where name = 'Frank' order by year;                                         QUERY PLAN-------------------------------------------------------------------------------------------- Index Scan using baby_names_pkey on baby_names (actual time=1.743..1.800 rows=188 loops=1)   Index Cond: (name = 'Frank'::text)   Storage Index Read Requests: 1   Storage Index Execution Time: 2.000 ms Planning Time: 0.059 ms Execution Time: 1.834 ms Storage Read Requests: 1 Storage Write Requests: 0 Storage Execution Time: 2.000 ms Peak Memory Usage: 8 kB(10 rows)

However, a common use-case is when we enter only the beginning of the name. For example, with a name LIKE 'Fran%'.

No index: Seq Scan

Without an index, the whole table is scanned:

yugabyte=# explain (analyze, dist, costs off)           select name, year, sex from baby_names            where name like 'Fran%' order by year; Sort (actual time=118.913..118.959 rows=1420 loops=1)   Sort Key: year   Sort Method: quicksort  Memory: 146kB   ->  Seq Scan on baby_names (actual time=118.334..118.620 rows=1420 loops=1)         Remote Filter: (name ~~ 'Fran%'::text)         Storage Table Read Requests: 1         Storage Table Execution Time: 117.999 ms Planning Time: 1.266 ms Execution Time: 119.047 ms Storage Read Requests: 1 Storage Write Requests: 0 Storage Execution Time: 117.999 ms Peak Memory Usage: 254 kB(13 rows)

This is efficient (120 milliseconds) getting many rows (rows=1420) in one call to the storage (Read Requests: 1) thanks to the Remote Filter. Without this optimization (you can set yb_enable_expression_pushdown to false; to disable it) it would take 254 read requests to fetch 258000 rows by batches of 1024 rows (--ysql_prefetch_limit) to apply the filter in YSQL (the postgres backend of YugabyteDB) and discard most of them (Rows Removed by Filter: 256580).

Even if it is fast, the time complexity is O(N) on the number of rows in the table. To be scalable, we want an index access for which the time complexity depends on the result only.

Range index: Index Scan

The like 'Fran%' is a range query: it starts at 'Fran' and stops at 'Frao' (o being the next value after n in en_US.UTF-8 with lc_collate=C ). The query planner knows that, and can add a ((name >= 'Fran'::text) AND (name < 'Frao'::text)) condition.

My primary key index being hashed on name it cannot be used for a range scan over multiple names. I'm creating a range index for it:

create index baby_names_range_name on baby_names ( name ASC , year ASC );yugabyte=# explain (analyze, dist, costs off)           select name, year,sex from baby_names            where name like 'Fran%' order by year;                                               QUERY PLAN--------------------------------------------------------------------------------------------------------- Sort (actual time=13.473..13.522 rows=1420 loops=1)   Sort Key: year   Sort Method: quicksort  Memory: 146kB   ->  Index Scan using baby_names_range_name on baby_names (actual time=9.891..13.202 rows=1420 loops=1)         Index Cond: ((name >= 'Fran'::text) AND (name < 'Frao'::text))         Remote Index Filter: (name ~~ 'Fran%'::text)         Storage Index Read Requests: 2         Storage Index Execution Time: 5.000 ms         Storage Table Read Requests: 2         Storage Table Execution Time: 8.000 ms Planning Time: 2.966 ms Execution Time: 13.605 ms Storage Read Requests: 4 Storage Write Requests: 0 Storage Execution Time: 13.000 ms Peak Memory Usage: 873 kB(16 rows)

The time is faster even if there are more read requests (for the table and the index) because each request has read only the necessary rows.

Range index: Index Only Scan

If I need only the name and year, as they are in the index, it is even faster without the need to read the table:

yugabyte=# explain (analyze, dist, costs off)           select name, year from baby_names            where name like 'Fran%' order by year;                                                   QUERY PLAN                                 -------------------------------------------------------------------------------------------------------------- Sort (actual time=8.182..8.237 rows=1420 loops=1)   Sort Key: year   Sort Method: quicksort  Memory: 115kB   ->  Index Only Scan using baby_names_range_name on baby_names (actual time=5.287..7.926 rows=1420 loops=1)         Index Cond: ((name >= 'Fran'::text) AND (name < 'Frao'::text))         Remote Filter: (name ~~ 'Fran%'::text)         Heap Fetches: 0         Storage Index Read Requests: 2         Storage Index Execution Time: 8.000 ms Planning Time: 3.053 ms Execution Time: 8.315 ms Storage Read Requests: 2 Storage Write Requests: 0 Storage Execution Time: 8.000 ms Peak Memory Usage: 222 kB(15 rows)

Here I have Storage Index Read Requests but no Storage Table Read Requests thanks to the Index Only Scan. I could have the same even when querying the sexcolumn if it was added to the index on baby_names ( name ASC , year ASC ) include ( sex );

GIN index with Trigrams

The above was a range scan because the prefix of the pattern was known ('Fran%). If the first letters are unknown, like '%Fran%, we are back to full table scan:

yugabyte=# explain (analyze, costs off)           select name, year from baby_names            where name like '%Fran%' order by year;                                  QUERY PLAN------------------------------------------------------------------------------- Sort (actual time=119.386..119.429 rows=1420 loops=1)   Sort Key: year   Sort Method: quicksort  Memory: 115kB   ->  Seq Scan on baby_names (actual time=118.954..119.137 rows=1420 loops=1)         Remote Filter: (name ~~ '%Fran%'::text)         Storage Table Read Requests: 1         Storage Table Execution Time: 117.998 ms Planning Time: 2.233 ms Execution Time: 119.996 ms Storage Read Requests: 1 Storage Write Requests: 0 Storage Execution Time: 117.998 ms Peak Memory Usage: 234 kB(13 rows)

For this, the solution is indexing parts of the test with a GIN index. Those parts will be trigrams and the extension pg_trgm adds a gin_trgm_ops operator for the GIN index:

yugabyte=# create extension if not exists pg_trgm;yugabyte=# create index baby_names_trg_name on baby_names           using gin ( name gin_trgm_ops);yugabyte=# explain (analyze, costs off)           select name, year from baby_names            where name like '%Fran%' order by year;                                                  QUERY PLAN-------------------------------------------------------------------------------------------------------------- Sort (actual time=11.024..11.075 rows=1126 loops=1)   Sort Key: year   Sort Method: quicksort  Memory: 101kB   ->  HashAggregate (actual time=10.637..10.803 rows=1126 loops=1)         Group Key: year, name         ->  Index Scan using baby_names_trg_name on baby_names (actual time=6.670..10.273 rows=1420 loops=1)               Index Cond: (name ~~ '%Fran%'::text)               Rows Removed by Index Recheck: 70 Planning Time: 0.890 ms Execution Time: 11.178 ms Storage Read Requests: 0 Storage Write Requests: 0

Here 1490 rows (out of the 258000 rows in my table) have been accessed by index, with a few false positives that were removed after the index scan (Rows Removed by Index Recheck: 70) to get the final rows=1420.

More about Trigrams

In addition to indexing text by trigrams, pg_trgm comes with interesting functions and YugabyteDB being PostgreSQL compatible, supports all of them. Here is an example:

yugabyte=# select distinct name <-> '%Fran%' as distance, name           from baby_names where name like '%Fran%' order by 1; distance |    name----------+------------        0 | Fran 0.428571 | Franc 0.428571 | Frank 0.428571 | Franz      0.5 | Franco 0.555556 | Frances 0.555556 | Francis 0.555556 | Frankie      0.6 | Francina      0.6 | Franklyn      0.6 | Franklin      0.6 | Francine      0.6 | Francies 0.636364 | Francisca 0.636364 | Francesca 0.636364 | Francisco 0.636364 | Francesco 0.666667 | Francisqui(18 rows)

Here, not only I can list names that contain a pattern, but I can also order them so that the closest to the pattern are displayed first. For example, if the name you are looking for was 'Francesca' or 'Francisca' you would have probably searched with a closer pattern, like '%Fran%a':

yugabyte=> select distinct name <-> '%Fran%a' as distance, name           from baby_names where name like '%Fran%a' order by 1; distance |   name----------+----------- 0.666667 | Francina 0.692308 | Francesca 0.692308 | Francisca(3 rows)

Ordering on the similarity distance with a limit can suggest the most probable values. If the user doesn't find it she can add more to the pattern.

What about CockroachDB?

This post was inspired by a user moving from CockroachDB to YugabyteDB. He was creating the index without mentioning the range sharding (the ASC) because CRDB doesn't have the equivalent of YugabyteDB hash sharding and defaults to ASC. In YugabyteDB, the default is HASH on the first column, and the index scan for a prefixed LIKE pattern was not used.

With explicit range sharding (and it makes sense to explicitly define ASC or DESC as the performance of ORDER BY may be different on a backward scan) the index scan is possible as in the demo above. The same features as PostgreSQL are available in YugabyteDB, with horizontal scalability in addition to it.

Note that I tried to run the same example on CockroachDB but I was quickly limited by the lack of PostgreSQL compatibility:

root@localhost:26257/defaultdb> select version();                                       version--------------------------------------------------------------------------------------  CockroachDB CCL v22.2.8 (x86_64-pc-linux-gnu, built 2023/04/17 13:22:08, go1.19.6)(1 row)# loading from a CSV fileroot@localhost:26257/defaultdb> \copy baby_names from 'baby-names.csv?raw=true' with ( skip 1, format csv );ERROR: at or near "baby-names.csv?raw=true": syntax error: unimplemented: this syntaxSQLSTATE: 0A000DETAIL: source SQL:copy baby_names from 'baby-names.csv?raw=true' with ( skip 1, format csv )                     ^HINT: You have attempted to use a feature that is not yet implemented.# creating a GIN indexroot@localhost:26257/defaultdb> create index baby_names_trg_name on baby_names using gin ( name gin_trgm_ops);ERROR: unimplemented: primary key column name cannot be present in an inverted indexSQLSTATE: 0A000HINT: You have attempted to use a feature that is not yet implemented.# using PG_TRGM functionsroot@localhost:26257/defaultdb> select distinct name <-> '%Fran%' as distance, name from baby_names where name like '%Fran%' order by 1;invalid syntax: statement ignored: at or near "-": syntax errorSQLSTATE: 42601DETAIL: source SQL:select distinct name <-> '%Fran%' as distance, name from baby_names where name like '%Fran%' order by 1                      ^HINT: try \h SELECT

The only thing that is working like PostgreSQL is the first example: prefixed LIKE condition on a regular index. I didn't have to create the index as the primary key is already range sharded (I could have done the same in YugabyteDB by creating the table with primary key ( name asc, year, sex )).

YugabyteDB automatic sharding

All those are PostgreSQL commands and work the same in YugabyteDB. The only thing to take care is the sharding method, with the PostgreSQL compatible syntax ASC or DESC to define range sharding when you know you will have range queries. Do not worry about which value to split on: auto-splitting will do its job when the table grows. In the absence of range query access patterns, it is still better to use HASH sharding as it directly distributes the rows. The hash function (you can see it with yb_hash_code()) returns a value from 0x0000 to 0xFFFF so that it can start with multiple shards (ranges of hash values) and can be split further.

If you don't know all your access patterns, looking at the datatype can help. A UUID or number generated from a sequence are probably good candidates for HASH. Timestamps, numbers with arithmetic semantic, and alphabetical text, will probably be queried by ranges and ordered. In the example above, name and year are in this category.


Original Link: https://dev.to/yugabyte/range-indexes-for-like-queries-in-yugabytedb-10kd

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To