In the Postgres world, indexes are essential to efficiently navigate the table data storage (aka the “heap”). Postgres does not maintain a clustering for the heap, and the MVCC architecture leads to multiple versions of the same tuple lying around. Creating and maintaining effective and efficient indexes to support applications is an essential skill.
Read on to check out a few tips on optimizing and improving the use of indexes in your deployment.
Note: Queries shown below are run on an unmodified pagila sample database.
Use Covering Indexes
Consider a query to fetch the emails of all inactive customers. The customer table has an active column, and the query is straight-forward:
The query calls for a full sequential scan of the customer table. Let’s create an index on the active column:
This helps, and the sequential scan has become an “index scan”. This means Postgres will scan the index “idx_cust1”, and then further lookup the table’s heap to read the other column values (in this case, the email column) that the query needs.
PostgreSQL 11 introduced covering indexes. This feature allows you to include one or more additional columns in the index itself – that is, the values of these extra columns are stored within the index data storage.
If we were to use this feature and include the value of email inside the index, Postgres will not need to look into the table’s heap to get the value of email. Let’s see if this works:
The “Index Only Scan” tells us that the query is now completely satisfied by the index itself, thus potentially avoiding all the disk I/O for reading the table’s heap.
Covering indexes are available only for B-Tree indexes as of now. Also, the cost of maintaining a covering index is naturally higher than a regular one.
Use Partial Indexes
Partial indexes only index a subset of the rows in a table. This keeps the indexes smaller in size and faster to scan through.
Assume we need to get the list of emails of customers located in California. The query is:
which has a query plan that involves scanning both the tables that are joined:
Let’s see what a regular index gets us:
The scan of address has been replaced with an index scan over idx_address1, and a scan of address’s heap.
Assuming this is a frequent query and needs to be optimized, we can use a partial index which only indexes those rows of address where the district is ‘California’:
The query now reads only the index idx_address2 and does not touch the table address.
Use Multi-Value Indexes
Some columns that need indexing may not have a scalar data type. Column types like jsonb, arrays and tsvector have composite or multiple values. If you need to index such columns, it is usually the case that you need to search through the individual values in those columns as well.
Let’s try to find all movie titles that include behind the scenes outtakes. The film table has a text array column called special_features, which includes the text array element Behind The Scenes if a movie has that feature. To find all such films, we need to select all rows that have “Behind The Scenes” in any of the values of the array special_features:
The containment operator @> checks if the left hand side is a superset of the right hand side.
Here is the query plan:
which calls for a full scan of the heap, at a cost of 67.
Let’s see if a regular B-Tree index helps:
The index is not even considered. The B-Tree index has no idea that there are individual elements in the value it indexed.
What we need is a GIN index.
The GIN index is able to support matching of the individual value against the indexed composite value, resulting in a query plan with less than half the cost of the original.
Eliminate Duplicate Indexes
Over time indexes accumulate, and sometimes one gets added which has the exact
same definition as another one. You can use the catalog view pg_indexes
to
get the human-readable SQL definitions of indexes. You can also easily detect
identical definitions:
And here’s the result when run on the stock pagila database:
Superset Indexes
It is also possible that you end up with multiple indexes where one indexes a superset of columns that the other does. This may or may not be desirable – the superset one can result in index-only-scans which is a good thing, but may take up too much space, or maybe the query it was originally intended to optimize is not longer used.
If you wish to automate the detection of such indexes, the pg_catalog table pg_index is a good starting point.
Unused Indexes
As the applications that use the database evolve, so do the queries that they
use. Indexes that were added earlier may no longer be used by any query.
Everytime an index is scanned, it is noted by the statistics manager and a
cumulative count is available in the system catalog view pg_stat_user_indexes
as the value idx_scan
. Monitoring this value over a period of time (say, a month)
gives a good idea of which indexes are unused and can be removed.
Here is the query to get the current scan counts for all indexes in the ‘public’ schema:
with output like this:
Rebuild Indexes With Less Locking
It is not uncommon that indexes need to be recreated. Indexes can also become bloated, and recreating the index can fix that, causing it to become faster to scan. Indexes can also become corrupt. Altering index paramaters also may need recreation of the index.
Enable Paralell Index Creation
In PostgreSQL 11, B-Tree index creation is concurrent. It can make use of multiple parallel workers to speed up the creation of the index. However, you need to make sure that these configuration entries are set suitably:
The default values are unreasonably small. Ideally, these numbers should increase with the number of CPU cores. See the docs for more information.
Create Indexes in the Background
You can also create an index in the background, using the CONCURRENTLY parameter of the CREATE INDEX command:
This is different from doing a regular create index in that it does not require a lock over the table, and therefore does not lock out writes. On the downside, it takes more time and resources to complete.
About pgDash
pgDash is a modern, in-depth monitoring solution designed specifically for PostgreSQL deployments. pgDash shows you information and metrics about every aspect of your PostgreSQL database server, collected using the open-source tool pgmetrics. pgDash provides core reporting and visualization functionality, including collecting and displaying PostgreSQL information and providing time-series graphs, detailed reports, alerting, teams and more.
pgDash can monitor replication, locks, queries and more. It can also provide quick diagnostics and extensive alerts. Learn more here or signup today for a free trial.