Distributing Data With PostgreSQL

PostgreSQL comes with the tools needed to move parts of your database off to other disks, other PostgreSQL servers, or even on to other systems like Oracle or MySQL.

Read on to see how you can handle common data distribution use cases with PostgreSQL.

Moving Tables to Other Disks

Want to move a few tables or indexes off to another disk because they’re getting too big? How about moving some indexes to a higher-IOPS EBS volume to speed up some of those queries? Tablespaces provide a way to do this, without having to let the application (or even the application developers!) know.

Initially, all of a Postgres server’s data lies within a single directory (called the data directory). Any SQL objects you create, including databases, live here. This “space” is default tablespace pg_default. You can tell PostgreSQL that it can use another location in the filesystem, say “/mnt/disk2/pgdata” also for creating SQL objects, like this:

CREATE TABLESPACE disk2 LOCATION '/mnt/disk2/pgdata';

Having created the tablespace, you can move tables and indexes into this space:

ALTER TABLE my_big_table SET TABLESPACE disk2;
ALTER INDEX frequent_idx SET TABLESPACE disk2;

PostgreSQL will lock the table or index for the duration of the move, and will physically move over the data from the old tablespace to the new. How long the operation will take will depend mainly on the read/write rates of the source and destination disks.

All regular SQL statements work fine regardless of which tablespace the objects belong to. You can mix objects from any tablespace in a transaction, there are no restrictions or reduction in ACID guarantees.

You can read more about tablespaces in this article and in the docs.

Replicating an Entire Server

Need a read replica for your entire server? Basically another server that is in realtime sync with the main server but can only be read from? Be able to run heavy daily analytics jobs without impacting the app? You need streaming replication.

This feature allows you to maintain a hot standby (Postgres jargon for read replica) of a primary (master) server. Streaming replication is an all-or-nothing proposition though – you get the entire database cluster and cannot pick and choose parts of it. This also means its ideal for maintaining a failover candidate, which can be promoted to master on short notice.

Here’s how to configure the PostgreSQL (v10) of an Ubuntu 18.04 to be a hot standby:

First, let’s create a replication slot at the primary:

[alice@devbox ~]$ sudo -u postgres psql
psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04))
Type "help" for help.

postgres=# SELECT pg_create_physical_replication_slot('repl_slot_1');
 pg_create_physical_replication_slot
-------------------------------------
 (repl_slot_1,)
(1 row)

postgres=#

And then on the slave server, delete the old data and take a fresh backup from the master:

[alice@devbox2 ~]$ cd /var/lib/postgresql/10
[alice@devbox2 10]$ sudo systemctl stop postgresql
[alice@devbox2 10]$ sudo mv main main.old
[alice@devbox2 10]$ sudo -u postgres pg_basebackup -D main -R -S repl_slot_1 -h 192.168.0.100 -U alice

Assuming the user called “alice” can login with the replication protocol on the primary, you now have a consistent snapshot of the master. If we now startup the server on the slave, it should start, catch up and remain in sync with the master:

[alice@devbox2 10]$ sudo systemctl start postgresql
[alice@devbox2 10]$ tail /var/log/postgresql/postgresql-10-main.log
2018-07-10 08:40:53.806 UTC [2346] LOG:  listening on IPv6 address "::1", port 5432
2018-07-10 08:40:53.806 UTC [2346] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2018-07-10 08:40:53.808 UTC [2346] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-07-10 08:40:53.823 UTC [2347] LOG:  database system was interrupted; last known up at 2018-07-10 08:40:29 UTC
2018-07-10 08:40:54.689 UTC [2347] LOG:  entering standby mode
2018-07-10 08:40:54.708 UTC [2347] LOG:  redo starts at 0/44000028
2018-07-10 08:40:54.711 UTC [2347] LOG:  consistent recovery state reached at 0/440000F8
2018-07-10 08:40:54.712 UTC [2346] LOG:  database system is ready to accept read only connections
2018-07-10 08:40:54.788 UTC [2351] LOG:  started streaming WAL from primary at 0/45000000 on timeline 1

You can read more about streaming replication here.

Streaming replication is a heavyweight solution, suitable for cases when you definitely want the whole data replicated. If you want only a part of it though, there are other easier options.

Accessing Tables in Other Postgres Servers

Let’s say you need to move a few tables to another server because they have become too big. Or maybe all the objects in a certain schema should be moved to another server as part of a migration. Foreign Data Wrappers (FDW) can be a solution, or an intermediate step in a longer migration plan.

FDW allows you to create virtual tables on a PostgreSQL server, that are proxies of the real ones on another. For example if there is a database “demodb” with the table “demotable” on the server 192.168.0.100, we can do this in another server:

[alice@devbox2 ~]$ psql
psql (10.4 (Ubuntu 10.4-0ubuntu0.18.04))
Type "help" for help.

alice=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
alice=# CREATE SERVER demoserver FOREIGN DATA WRAPPER postgres_fdw
alice-#   OPTIONS (host '192.168.0.100', dbname 'demodb');
CREATE SERVER
alice=# CREATE USER MAPPING FOR alice SERVER demoserver
alice-#    OPTIONS (user 'alice');
CREATE USER MAPPING
alice=# IMPORT FOREIGN SCHEMA public LIMIT TO (demotable)
alice-#   FROM SERVER demoserver INTO public;
IMPORT FOREIGN SCHEMA
alice=# SELECT * FROM demotable;
  name   | diameter
---------+----------
 mercury |     4879
 venus   |    12104
 earth   |    12756
(3 rows)

First, the extension “postgres_fdw” is created with the CREATE EXTENSION command (this extension is present by default). Then a foreign server, known locally as “demoserver” is setup with it’s connection options. A user mapping is created to map the local user “alice” to the remove user “alice”. Password and other connection options can also be specified here. Finally, the table “demotable” is imported, and data is queries from it.

No special changes are required on the “source” server for FDW. Applications running locally will be able to execute regular SQL transactions without being aware that “public.demotable” is in fact physically present in another server.

The FDW feature is a generic way of accessing data stored in other systems. (Fun fact: it’s based on a SQL standard). You can use FDW to push and pull data from other systems too.

Pushing Data to Other Systems

Let’s say you have triggers that create records in an audit trail table, but you’d like those records to be pushed to say, a Cassandra cluster. Or you have some source-of-truth data in an Oracle database but the app can talk only to your Postgres server.

There are many existing FDW extensions which you can use to interface with systems like Cassandra or Oracle. You can also write your own to suit your needs.

Replicating Selected Tables

What if you want to replicate only one table to a different server? Or maybe a few tables to different datacenters for improved latency for local apps? You can do this with logical replication.

Logical replication was introduced in v10. It’s a lightweight replication mechanism, and can be used independently of streaming replication. This allows tables in independent servers to be kept in sync logically. The “logically” implies that the effect of replication is similar to what you’d get if you replayed the SQL commands at the destination servers – this allows features like the destination table having more columns that the source table.

Here’s how it works. First, at the source end, create a publication:

demodb=# create table planets (name text, diameter int, moons int);
CREATE TABLE
demodb=# insert into planets values ('mercury', 4879, 0), ('venus', 12104, 0), ('earth', 12756, 1);
INSERT 0 3
demodb=# create publication planetfeed for table planets;
CREATE PUBLICATION

The publication can be subscribed to anyone, and they can receive updates to the SQL objects (in this case the table “planets”) being made at the publisher’s end.

Here’s how you can subscribe to the publication:

alice=# create table planets (name text, diameter int, moons int);                                                                         CREATE TABLE
alice=# create subscription planetsub connection 'host=192.168.0.100 dbname=demodb user=alice' publication planetfeed;
NOTICE:  created replication slot "planetsub" on publisher
CREATE SUBSCRIPTION
alice=# select * from planets;
  name   | diameter | moons
---------+----------+-------
 mercury |     4879 |     0
 venus   |    12104 |     0
 earth   |    12756 |     1
(3 rows)

Logical replication allows for simpler, lightweight replication, and can be used in addition to streaming replication if required. Read more about it in the docs.

About pgDash

pgDash is an in-depth monitoring solution designed specifically for PostgreSQL deployments. pgDash shows you information and metrics about every aspect of your PostgreSQL database server, collected using the open-source tool pgmetrics.

pgDash provides core reporting and visualization functionality, including collecting and displaying PostgreSQL information and providing time-series graphs, detailed reports, alerting and more.