• Tom Lane's avatar
    Fix race conditions in synchronous standby management. · f332241a
    Tom Lane authored
    We have repeatedly seen the buildfarm reach the Assert(false) in
    SyncRepGetSyncStandbysPriority.  This apparently is due to failing to
    consider the possibility that the sync_standby_priority values in
    shared memory might be inconsistent; but they will be whenever only
    some of the walsenders have updated their values after a change in
    the synchronous_standby_names setting.  That function is vastly too
    complex for what it does, anyway, so rewriting it seems better than
    trying to apply a band-aid fix.
    
    Furthermore, the API of SyncRepGetSyncStandbys is broken by design:
    it returns a list of WalSnd array indexes, but there is nothing
    guaranteeing that the contents of the WalSnd array remain stable.
    Thus, if some walsender exits and then a new walsender process
    takes over that WalSnd array slot, a caller might make use of
    WAL position data that it should not, potentially leading to
    incorrect decisions about whether to release transactions that
    are waiting for synchronous commit.
    
    To fix, replace SyncRepGetSyncStandbys with a new function
    SyncRepGetCandidateStandbys that copies all the required data
    from shared memory while holding the relevant mutexes.  If the
    associated walsender process then exits, this data is still safe to
    make release decisions with, since we know that that much WAL *was*
    sent to a valid standby server.  This incidentally means that we no
    longer need to treat sync_standby_priority as protected by the
    SyncRepLock rather than the per-walsender mutex.
    
    SyncRepGetSyncStandbys is no longer used by the core code, so remove
    it entirely in HEAD.  However, it seems possible that external code is
    relying on that function, so do not remove it from the back branches.
    Instead, just remove the known-incorrect Assert.  When the bug occurs,
    the function will return a too-short list, which callers should treat
    as meaning there are not enough sync standbys, which seems like a
    reasonably safe fallback until the inconsistent state is resolved.
    Moreover it's bug-compatible with what has been happening in non-assert
    builds.  We cannot do anything about the walsender-replacement race
    condition without an API/ABI break.
    
    The bogus assertion exists back to 9.6, but 9.6 is sufficiently
    different from the later branches that the patch doesn't apply at all.
    I chose to just remove the bogus assertion in 9.6, feeling that the
    probability of a bad outcome from the walsender-replacement race
    condition is too low to justify rewriting the whole patch for 9.6.
    
    Discussion: https://postgr.es/m/21519.1585272409@sss.pgh.pa.us
    f332241a
syncrep.h 3.47 KB