In this long read we explore techniques for advanced indexing and high availability fault-tolerant database applications using PostgreSQL 13. Thanks to its reliability, robustness, and high performance, PostgreSQL has become one of the most advanced open source databases on the market. This long read is sampled from Hans-Jürgen Schönig’s book Mastering PostgreSQL 13 (Fourth Edition).
Database performance depends heavily on indexing, and indeed it is no exaggeration to say that bad indexing is the main source of bad performance. Yes, it’s important to adjust memory parameters and all that, but it is all in vain if indexes are not used properly. There is simply no replacement for a missing index, and there is no way to achieve good performance without proper indexing.
For data that can be sorted (using the <, <=, =, >=, and > operators), B-tree indexes generally work well. Sometimes, however, B-trees are just not enough. This is because not every data type can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length, and so on, but doing this won’t allow you to actually find them using a geometric search.
The solution to this problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0):
I’ll assume you already know about B-trees. (If not, you could check out the resources mentioned in Further reading, below.) But when should one contemplate using these other index types? The following guide outlines the purpose of each index type available in PostgreSQL.
Note that there are some extensions out there that can be used on top of what you can see here. Additional index types that are available on the web include rum, vodka, and, in the future, cognac.
Before going further, let’s create some test data to refer to later. Here’s the code snippet to do that:
test=# DROP TABLE IF EXISTS t_test;
test=# CREATE TABLE t_test (id serial, name text);
test=# INSERT INTO t_test (name) SELECT 'hans'
FROM generate_series(1, 2000000);
INSERT 0 2000000
test=# INSERT INTO t_test (name) SELECT 'paul'
FROM generate_series(1, 2000000);
INSERT 0 2000000
In the first line, a simple table is created. Two columns are used; the first is an auto-increment column that just keeps creating numbers, and the second is a column that will be filled with static values. In all, 4 million rows have been added:
test=# SELECT name, count(*) FROM t_test GROUP BY 1;
name | count
hans | 2000000
paul | 2000000
Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes makes sense, but before PostgreSQL 10.0 it was not advisable to use hash indexes because PostgreSQL had no WAL (write-ahead logging) support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and considered to be 100% crash-safe.
Hash indexes are generally a bit larger than B-tree indexes. Suppose you want to index 4 million integer values. A B-tree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption that’s made by many people is that a hash is super-small on disk, but in many cases that assumption is wrong.
Generalized Search Tree (GiST) indexes are an important index type because they can be applied to a variety of different purposes. GiST indexes can be used to implement R-tree behavior (see https://en.wikipedia.org/wiki/R-tree), and it is even possible for them to act as B-trees. However, abusing GiST for B-tree indexes is not recommended.
Typical use cases for GiSTs are as follows:
- Range types
- Geometric indexes (for example, ones used by the highly popular PostGIS extension)
- Fuzzy searching
Understanding how GiST works
To many people, GiST is still a black box, so let’s try to outline how GiST works internally. Consider the following diagram:
Take a look at the tree. You can see that R1 and R2 are on top. R1 and R2 are the bounding boxes that contain everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the above diagram is that some operations that aren’t available in B-trees are supported, for example overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing.
Of course, it is also possible to come up with your own operator classes. The following strategies are supported:
If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a B-tree, there is only one function. GiST indexes provide a lot more:
Generalized Inverted (GIN) indexes are a good way to index text. Suppose you want to index one million text documents. A certain word may occur millions of times. In a normal B-tree, this would mean that the key is stored millions of times. This is not the case in GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard B-tree. Each entry will have a document list pointing to all the entries in the table that have the same key. A GIN index is very compact. However, it lacks an important feature found in B-trees — sorted data. In GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table, and not by arbitrary criteria.
Just like any other index, GIN can be extended. The following strategies are available:
On top of this, the following support functions are available:
If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way for you to start your own implementation.
A Space-Partitioned GiST (SP-GiST) is mainly designed for in-memory use. The reason for this is that an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in the RAM.
The beauty is that SP-GiST can be used to implement various types of trees, such as quadtrees, k-d trees, and radix trees (tries).
The following strategies are provided:
To write your own operator classes for SP-GiST, a couple of functions have to be provided:
Block Range Indexes (BRINs) are of great practical use. All the indexes we’ve discussed up until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot of disk space because an index pointer is needed for each entry. So if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by BRINs. A BRIN does not keep an index entry for each tuple, but will store the minimum and maximum values of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step.
The following example demonstrates how small a BRIN really is:
test=# CREATE INDEX idx_brin ON t_test USING brin(id);
test=# \di+ idx_brin
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size |
public | idx_brin | index | hs | t_test | permanent | 48 kB |
In my example, the BRIN index is 2,000 times smaller than a standard B-tree. The question naturally arising now is, why don’t we always use BRIN indexes? To answer this question, it is important to reflect on the layout of BRIN; the minimum and maximum values for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data and scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won’t be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within that 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated.
Extending BRIN indexes
BRIN supports the same strategies as a B-tree and therefore needs the same set of operators. The code can be reused nicely:
The support functions required by BRIN are as follows:
Adding additional indexes
Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones that serve your purpose. The instruction to do this is CREATE ACCESS METHOD:
test=# \h CREATE ACCESS METHOD
Command: CREATE ACCESS METHOD
Description: define a new access method
CREATE ACCESS METHOD name
Don’t worry too much about this command — if you ever deploy your own index type, it will come as a ready-to-use extension.
One of these extensions implements bloom filters, which are probabilistic data structures. They sometimes return too many rows, but never too few. Therefore, a bloom filter is a good way to pre-filter data.
How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial.
To use the bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package:
test=# CREATE EXTENSION bloom;
As we stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard B-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case:
test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int,
x5 int, x6 int, x7 int);
Creating the index is easy:
test=# CREATE INDEX idx_bloom
ON t_bloom USING bloom(x1, x2, x3, x4,
x5, x6, x7);
If sequential scans are turned off, the index can be seen in action:
test=# SET enable_seqscan TO off;
test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7;
Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28)
Recheck Cond: ((x3 = 7) AND (x5 = 9))
-> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0)
Index Cond: ((x3 = 7) AND (x5 = 9))
Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial.
In this brief guide we’ve seen that PostgreSQL supports a variety of index types beyond the well-known B-tree, each of which lends itself to a particular set of circumstances or use case. Knowing which sort of index to use will help you create PostgreSQL-based solutions that are optimally performant with your data and for the problems you want to solve.
Explore expert techniques such as advanced indexing and high availability to build scalable, reliable, and fault-tolerant database applications using PostgreSQL 13.
○ Master advanced PostgreSQL 13 concepts with the help of real-world datasets and examples
○ Leverage PostgreSQL’s indexing features to fine-tune the performance of your queries
○ Extend PostgreSQL’s functionalities to suit your organization’s needs with minimal effort
Learn everything you need to know to build efficient SQL queries using this easy-to-follow beginner’s guide.
○ Explore all SQL statements in depth using a variety of examples
○ Get to grips with database querying, data aggregate, manipulation, and much more
○ Understand how to explore and process data of varying complexity to tell a story
Learn how to deploy and monitor databases in the cloud, manipulate documents, visualize data, and build applications running on MongoDB using Node.js.
○ Learn the fundamentals of NoSQL databases with MongoDB
○ Create, manage, and optimize a MongoDB database in the cloud using Atlas
○ Use a real-world dataset to gain practical experience of handling big data