PostgreSQL comes with various ways of replicating data between servers, along with configuration options to tweak and tune the replication process to fit your needs. Regardless of what type of replication you set up, it is important to monitor all endpoints of replication to ensure that your data is safe and sound.
Read on to learn more about monitoring replication in Postgres.
Types of Replication
Broadly, there are 2 types of replication in PostgreSQL – physical and logical. Here is a quick overview picture:
Let’s have a quick look at each before we see how to monitor them. In the text below, per convention, the term primary refers to the master or originating server, and standby refers to the slave or receiving server.
Physical Replication
Physical replication has been available in core Postgres for more than a decade. It allows you to replicate the data in a primary server byte for byte into a standby server.
Essentially, physical replication relies on sending WAL (write ahead log) files generated by the primary over to the secondary. When changes happen on the primary server, these are serialized and flushed to disk as a series of WAL files. These files can be “replayed” to get the familiar SQL-structured representation of data. Even the primary server itself writes the record first, then a bit later, replays the record and updates the SQL-structured representation.
Therefore, it stands to reason that you can send over these WAL files to another server and have it replay them instead. And that’s exactly how physical replication works.
The standby server is setup not to accept changes via the normal DDL/DML statements but only via accepting and replaying WAL files. A server configured this way is said to be in recovery mode.
A standby in recovery mode will not (cannot) perform changes to its data using normal SQL transactions. You can however allow read-only SQL transactions on it by making it a hot standby. Because the original changes at the primary have to be brought to the standby and replayed, the queries at the hot standby will potentially see an earlier version of the data. This is the replication lag. The replication lag at each standby is one of the things you should be monitoring.
Log Shipping
Since WAL files are actual, simple physical files typically 16MB in size and
living in the pg_wal (or pg_xlog prior to v10) subdirectory of $PGDATA, you
can simply copy them over to another server with scp
or via NFS or the like.
The standby server is setup to accept these WAL files using the restore_command or dropping them in the pg_wal directory.
Streaming Replication
Rather than copying WAL files through scripts, streaming replication allows the standby to establish a connection the primary and “stream” the contents of the WAL file over this connection.
Streaming replication is not only much cleaner, but also has the advantage of smaller replication lags for low-TPS servers. This is because the WAL files in turn contain smaller records arranged sequentially – in streaming replication each record can be streamed when created, whereas in log shipping you’ll have to wait for the whole WAL file to be assembled, which may take a while.
Streaming replication was introduced in Postgres 9.0.
Physical Replication Slots
In Postgres 9.4, replication slots were introduced. A process receiving changes via streaming replication can create a replication slot on the primary server. The primary will then use this slot to store the location up to where the changes have been sent to the receiver.
If the receiver were to disconnect for a long while and then reconnect, the primary would now retain those WAL files that are required for the receiver to catch up.
Streaming replication is possible with and without the use of replication slots. Using replication slots of course, makes it more reliable.
Logical Replication
Logical replication was introduced in core PostgreSQL in version 10. Unlike physical replication, logical replication brings over only the logical, or SQL-like changes. It aims to be the foundation for currently unavailable features like multi-master replication and zero-downtime major-version upgrades.
Logical replication works only via replication slots. A replication slot can be created for either physical replication or logical replication.
The logical replication data received by a process has to be “decoded” using a “plugin”. For more information see the docs and the wiki.
Subscriptions
The publication / subscription feature built into Postgres 10 is a means of using logical replication. In a way, this feature is a “UI” for logical replication.
Replication Topology
PostgreSQL allows sending one primary’s changes to multiple standbys and even cascading them further. This allows for complex replication topologies (check out the article Replication Topologies in PostgreSQL to learn more).
There are however, a couple of simple rules to keep the story straight:
- a server is either in recovery mode (that is, it is a standby) or not
- a standby will receive incoming WALs and replay them
- a server can send it’s own WALs to 0 or more other servers
These rules are primarily for physical replication. For logical replication, changes can be sent from anywhere to anywhere and they are monitored only via their replication slots.
Monitoring Standby Servers
A standby server has incoming replication, and remains in recovery mode so that it can replay the WALs/records as they come in.
Monitoring Incoming Replication
The stats related to the incoming replication can be monitored using pg_stat_wal_receiver view (available in v9.6 and above). On a standby, it contains values similar to:
This shows that there is a single “WAL receiver” process with PID 2084, which
is currently active, and has received data from 0/3000000
up to 0/11233E20
.
The timeline ID
was 1 when it started, and is 1 even now. The difference between the send and
receipt times of the last message indicates the latency of the connection,
which in this case is less than 1 millisecond. The difference between the two
LSNs (start and end) shows the amount of data received over the connection
(0x11233E20 - 0x03000000 = about 226 MiB). It also says that this connection
does not use a replication slot – if it did, the “slot_name” column would not
be null.
Here is another standby which does use a replication slot:
Monitoring Recovery Status
The pg_stat_wal_receiver only gives stats about the receiving of WAL records, not of it’s replay. For that, we need to use:
This shows that the last received WAL position is ahead of the last replayed WAL position, and there is a replay lag of 0/1A000000 - 0/19FFE928 = 5848 bytes. If there is a replay lag, the difference between the time the query was run and the last transaction replay timestamp is a measure of the replay lag in units of time.
Monitoring Primary Servers
There are two type of things to monitor on primary servers – outgoing replication connections and replication slots.
Monitoring Outgoing Replication
The system view pg_stat_replication returns one row for each open outgoing replication connection.
We can see the WAL locations corresponding to send, write, flush
and replay. In Postgres 10, the write, flush and replay lags are also
available. The value of async
for sync_state shows that the clients are
not synchronous standbys.
Monitoring Physical Replication Slots
The view pg_replication_slots gives one row for each replication slot in the primary. Unlike streaming replication connections, replication slots have to be created and deleted explicitly. When there is an associated client connection, slots are considered active. Inactive slots are dangerous, because the server will retain the WAL files required for that slot forever, at the risk of filling up the disk.
Replicaton slots can also be temporary, which will go away when the replication client connection goes away. Newer versions of pg_basebackup use temporary replication slots.
The physical replication slot “myreplslot1” (value of slot_type is physical
) is active,
not temporary and it’s client has received up to 0/3B44A5D8
.
Monitoring Logical Replication Slots
The logical replication slot “mylogslot” (value of slot_type is logical
)
is also active and permanent. It’s client has received up to 0/3B44A5A0
and has
confirmed that there is no flush lag. The plugin used is “test_decoding” (
because the client for this setup was pg_recvlogical).
Monitoring Using pgmetrics
The open-source tool pgmetrics can query Postgres
servers and produce detailed reports, including for replication. You can
invoke pgmetrics
pretty much like psql
, with the same command-line options
and environment variables.
Here is the pgmetrics output for a primary replicating to two physical standbys and one logical receiver. It also shows the physical and logical replication slots:
On the receiving end, it shows the recovery and the incoming replication status, including the lags in both bytes and time where relevant:
pgmetrics can also generate the results in JSON format, which can be used in scripts for monitoring or alerting.
Monitoring Using pgDash
pgDash is an in-depth monitoring solution designed specifically for PostgreSQL deployments. pgDash shows you information and metrics about every aspect of your PostgreSQL database server, collected using the open-source tool pgmetrics.
In addition displaying the data interactively, it also stores metrics extracted from the pgmetrics report as timeseries data and displays them graphically. Here is how it displays the replication information on the primary side:
And for the standby server:
pgDash provides core reporting and visualization functionality, including collecting and displaying PostgreSQL information and providing time-series graphs and detailed reports. We’re actively working to enhance and expand pgDash to include alerting, baselines, teams, and more.