What metrics of your PostgreSQL deployment should you be monitoring? This series of blog posts aims to provide a minimal, starting set of essential monitoring actions that you should implement to ensure the health and stability of your Postgres servers.
Typically, data in a database follows the 80-20 rule. 20% of the tables contain most of the data and are accessed or mutated the most. Setting up extra monitoring just for these tables can provide insights that are important yet low-volume.
Here are some table-level metrics worth looking at:
1. Table Size
The actual disk size used by the table has to be monitored. In most cases, the table keeps on growing, so it is the growth rate that is more interesting.
It’s often the case that the growth rate changes after the rollout of a new version of the application, or a baseline change in traffic/load/input patterns of the application itself. Such changes must be investigated, at least to verify that new rate is sustainable by the provisioned hardware.
Action: Monitor the increase in size of the table over each week/month, investigate abrupt changes.
2. Table Bloat
Because of Postgres’ MVCC architecture, older versions of rows lie around in the physical data files for every table, and is termed bloat. The operation to clear out obsolete row versions is called vacuum. Postgres runs a background process called autovacuum, which picks up candidate tables (based on configurable parameters) and vacuums them for you.
Bloat slows down table operations and wastes disk space, and can run away even with autovacuum. Monitoring of bloat, as an absolute number of bytes as well as a percentage (of dead data to total data), is required.
This metric can be monitored either at individual table level, or as an aggregate across selected tables, or at the database level.
Action: Continuously monitor table bloat as bytes and percentage, alert if values exceed a set threshold, VACUUM as necessary.
3. Sequential Scans
When queries are executed that do not optimally use the available indexes, or if the statistical information associated with a table is too out-of-date, Postgres might end up having to go through each row of the table. This is called a sequential scan, and is not very desirable in the general case. What would be better to have is an index scan, where the rows of a table are accessed indirectly via index lookup.
PostgreSQL can tell you how many times a table was scanned sequentially, and how many times an index was used. You can use this to monitor either the number of sequential scans (if you’d like to avoid them totally) or as a percentage of total scans.
Action: Continuously monitor the seq. scans count or percentage, alert if value exceeds a set threshold.
1. Index Size
Indexes can take up considerable disk space. Each index of a table can itself potentially have as much disk footprint as the table itself. It is useful to keep an eye on the total size of indexes in a database, or the indexes of important tables, especially in deployments where indexes can get created through automatic processes.
Unreasonably large indexes can be because of bloat, or just a badly designed index. In either case fixing the cause (either by rebuilding the index or by refactoring the query/index) can yield faster query times, so it is worth investigating large indexes.
Action: Monitor the total size of interesting indexes over each week/month, investigate when unreasonably large.
2. Index Bloat
Indexes can get bloated too. There are far too many factors, including table workload, index type, Postgres version and more, that decides how bloated an index becomes. Bloated indexes can slow down inserts and reduce lookup performance. Monitor the bloat of indexes as both an absolute value (number of bytes) and as a percentage. Indexes will have to be rebuilt when they become too bloated.
Action: Continuously monitor index bloat as bytes and percentage, alert if values exceed a set threshold.
3. Cache Hit Ratio
PostgreSQL caches frequently accessed regions of indexes (and also tables) in memory. If you have tuned your queries not to touch the tables except for retrieving rows, the next step is to ensure maximum cache residency for those important indexes that are really speeding up your queries.
The cache utilization of an index can be measured by the cache hit ratio, which is the percentage of blocks of the index that were read from the cache to the total number of blocks read.
Action: Continuously monitor the cache hit ratio percentage, alert if value falls below a set threshold. Investigate low values for important indexes.
Apart from PostgreSQL metrics, it is important to keep track of a few metrics of the machine or VM that you are running your Postgres on. You can use any popular monitoring solution for this, or even grab and track them yourself.
1. Used Memory
Modern Linux systems have complex memory accounting. We recommend monitoring the “used memory”, which is the memory left after accounting for memory marked as free, as buffers, as cached, and as slab. Buffers and cache will give way under pressure, and so will most (typically over 95%) of slab.
The used memory, however, must be measured over a suitably long period. If you have batch jobs, reports, ETL etc. that run on weekends, then the period would be one week. The real metric you’ll need to monitor is the maximum used memory over this period.
Typically, as your database size grows, this value tends to creep up. You’ll have to ensure the maximum memory usage is within a comfortable limit of available memory, like say 60-80%. Neglecting this would cause your analytics/OLAP workloads to fail for lack of memory.
Action: Monitor the maximum used memory over a week/fornight, alert if it exceeds a set threshold, reprovision.
The used memory is given by
MemUsed = MemTotal - MemFree - MemBuffers - MemCached - MemSlab,
Mem* fields are from
/proc/meminfo. For more information, see
this RedHat article.
2. Load Average
The simple load average value is still the easiest and quickest way to estimate the load on a server. This is especially true for Postgres servers, as each backend is a separate OS process, and having more of them in a runnable state will increase the load average value.
For a given Postgres server, the load average should stay within a reasonable range over a business cycle (like one week, including batch job runs).
Action: Monitor the maximum load average over each day/week, investigate increasing trends.
The 1 minute, 5 minute and 15 minute load averages are the first 3 fields of
the first line in the file
3. Free Disk Space
The last item in our list is the most obvious item to monitor: the amount of free disk space in each filesystem used by your PostgreSQL server, including tablespaces, WAL file directories, backup directories and server log files. In cases where too many (100s of millions) files get created in a single filesystem, ensure that the free inode count is also monitored. Lack of inodes are also reported as low disk space.
Action: Continuously monitor free disk space and inode usage on all relevant filesystems, alert if values fall below a set threshold.
Free disk space in a filesystem is not retrievable directly by reading any
/proc. You can use
stat -f /path/to/mount or even
df to find
out the used, free and reserved disk space for a specific, mounted filesystem.
Here is a full list of all the metrics we have discussed so far in this series. Remember that these are only the minimal, most essential, set of metrics that you must monitor in order to detect when things are about to go wonky with your PostgreSQL deployment.
- Transaction ID Range
- Number of Backends
- Inactive Replication Slots
- Backends Waiting on Locks
- Backends Idling in Transaction
- Replication Lag for Active Connections
- Replication Lag for Replication Slots
- WAL File Count
- Ready-to-archive WAL file count
- Connected Clients
- Table Bloat Across All Tables
- Index Bloat Across All Indexes
- Long Running Transactions
- Oldest Vacuum
- Oldest Analyze
- Table Size
- Table Bloat
- Sequential Scans
- Index Size
- Index Bloat
- Cache Hit Ratio
- Used Memory
- Load Average
- Free Disk Space
Collecting These Metrics
The sections above provide SQL statements to extract the needed metrics from a running Postgres server. If you’d rather not write the scripts yourself, check out the open source tool pgmetrics. It can collect the metrics above, and more, and report them in text and JSON formats.
You can directly send the pgmetrics reports to our commercial offering, pgDash, which stores and processes these reports to display graphs and perform alerting.
pgDash is a modern, in-depth monitoring solution designed specifically for PostgreSQL deployments. pgDash shows you information and metrics about every aspect of your PostgreSQL database server, collected using the open-source tool pgmetrics.
pgDash provides core reporting and visualization functionality, including collecting and displaying PostgreSQL information and providing time-series graphs, detailed reports, diagnostics, alerting, teams and more. Checkout the features here or signup today for a free trial.