back to listing index

Database Soup: PostgreSQL needs a new load balancer

[web search]
Original source (
Clipped on: 2013-02-10

Tuesday, March 27, 2012

PostgreSQL needs a new load balancer

We do a lot of high-availablity PostgreSQL deployments for clients.  As of 9.1, streaming replication (SR) is excellent for this, and can scale sufficiently to scale a client across and AWS node cluster with some simple tools to help manage it.  But where we're keenly feeling the lack is simple load balancing and failover.

We use pgPool a lot for this, and once you've set it up it works.  But pgPool suffers from runaway sporkism: it's a load balancer and a failover tool and multimaster replication and data partitioning and a cache and compatible with SR and Slony and Bucardo.  If you need all of those things, it's great, but if you only need one of them you still get the complexity of all of them.  It also suffers from having been designed around the needs of specific SRA customers, and not really for general use.  We've been tinkering with it for a while, and I just don't see a way to "fix" pgPool for the general use case; we need something new, designed to be simple and limited to the 80% use-case from scratch.

What we really need is simple tool, or rather a pair of tools, designed to only do failover and read load-balancing, and only for PostgreSQL streaming replication.  These tools should be stackable in any combination the user wants, like pgBouncer (and, for that matter, with pgBouncer).  They should provide information via a web service, and be configurable via a web service.

I'll call the first tool "pgFailover".  The purpose of this tool is to manage a master-replica group of servers, including managing both planned and unplanned failovers.  You would only have one active pgFailover node within a group in order to avoid "split-brain" issues.  It will not handle database connections at all.

pgFailover would track a master and several replicas.  The status of each server would be monitored by polling both the replication information on each server, and pg_stat_replication on the master.  This information would be provided to the network by pgFailover via a web service, and pgFailover would accept commands via the same webservice as well as on the local command line.

Based on user-configurable criterial, pgFailover would carry out any of the following operations: failover to a new master; remaster the other replicas; add a new replica, with or without data sync; resync a replica; or shut down a replica.  It would also handle some situations automatically.  In the event that user-configurable conditions of nonresponsiveness are met, it would fail over the master to the "best" replica. The failover replica would be decided based on either the configuration or based on which replica is most caught up according to replication timestamps.  Likewise, replicas would be dropped from the availability list if they stop replicating for a certain period or become nonresponsive. 

The second tool I'll call "pgBalancer", after the unreleased tool from Skype.  pgBalancer would just do load-balancing of database connections across the replicated servers.  It won't deal with failover or monitoring the servers at all; instead, it relies entirely on pgFailover for that.  This allows users to have several separate pgBalancer servers, supporting both high availabilty and complex load-balancing configurations.

Since automated separation of read and write queries is an impossible problem, we won't even try to solve it.  Instead, we'll rely on the idea that the application knows whether it's reading or writing, and provide two separate database connection ports, one for read-write (RW) connections, and one for read-only (RO) connections.  pgBalancer would obtain information on what servers to connect to by either a configuration file, or by querying a pgFailover server for information via its web service.  RO connections would be load-balanced across available servers, based on a simple "least active" plus "round-robin" algorithm.

pgBalancer would also accept a variety of commands via web service, including: suspend a service or all services; disconnect a specific connection or all connections; failover the write node to specific new server; drop a server from the load balancing list; add a server to the load balancing list; give a list of connections; give a list of servers.

If we could build two tools which more-or-less match the specification above, I think they would go a lot further towards supporting the kinds of high-availability, horizontally-scaled PostgreSQL server clusters we want to use for the applications of the future. Time permitting, I'll start work on pgFailover.  A geek can dream, can't he?


  1. Image (Asset 1/7) alt=
    About failover, I am currently working on a OCF resource agent for Pacemaker which would be able to track slaves status and particulary their lag with the master. In case of automatic failover, Pacemaker would then promote the best candidate.

    In PostgreSQL context, Pacemaker sounds really promising and powerfull. I hope I'll be able to have a PoC in one or two week and I'll probably write some docs and a conference at some point.

  2. Image (Asset 2/7) alt=
    Actually, the more sophisticated network load balancers do work for this, except that I have yet to find one which can work together, in an automated way, with a failover tool. Maybe there's something I haven't tried; suggestions?

  3. Image (Asset 3/7) alt=
    Yeah, I've been over this at several different production sites. I don't see any way out of having failing over to a new pgFailover node being manual. There's simply no way for pgFailover to tell the difference between an actual failure of the primary, and a network partition.

    Now, in pgFailover 2.0 (or 3.0), I can imagine having a passive secondary pgFailover node which constantly read status data from the primary node, so that it would be completely up-to-date when the manual failover is triggered.

  • Image (Asset 4/7) alt=
    In Japan, I've heard that many MySQL guys have been using Linux-LVS with keepalived to enable load balancing in MySQL slave nodes. Theoretically speaking, it should be working with PostgreSQL, but I've not finished PoC yet. I'm going to work for the PoC in the next or two months.

    I believe you also can learn something from MHA for MySQL to deal with failover stuff in PostgreSQL.

  • Image (Asset 5/7) alt=
    However, what I really want to move away from are hackish solutions which work some of the time and put a huge configuration/setup burden on admins. We really need something which is "install this RPM, edit this simple config file, and go".

  • Image (Asset 6/7) alt=
    I do think having a utility written in a popular interpreted language (e.g. python) and controllable via web service has a fair amount of value in addition to what repmgr currently provides. On the other hand, so does maximizing my uncompensated time ...

  • Image (Asset 7/7) alt=