• Alvaro Herrera's avatar
    Fix WAL replay in presence of an incomplete record · 64a8687a
    Alvaro Herrera authored
    Physical replication always ships WAL segment files to replicas once
    they are complete.  This is a problem if one WAL record is split across
    a segment boundary and the primary server crashes before writing down
    the segment with the next portion of the WAL record: WAL writing after
    crash recovery would happily resume at the point where the broken record
    started, overwriting that record ... but any standby or backup may have
    already received a copy of that segment, and they are not rewinding.
    This causes standbys to stop following the primary after the latter
    crashes:
      LOG:  invalid contrecord length 7262 at A8/D9FFFBC8
    because the standby is still trying to read the continuation record
    (contrecord) for the original long WAL record, but it is not there and
    it will never be.  A workaround is to stop the replica, delete the WAL
    file, and restart it -- at which point a fresh copy is brought over from
    the primary.  But that's pretty labor intensive, and I bet many users
    would just give up and re-clone the standby instead.
    
    A fix for this problem was already attempted in commit 515e3d84a0b5, but
    it only addressed the case for the scenario of WAL archiving, so
    streaming replication would still be a problem (as well as other things
    such as taking a filesystem-level backup while the server is down after
    having crashed), and it had performance scalability problems too; so it
    had to be reverted.
    
    This commit fixes the problem using an approach suggested by Andres
    Freund, whereby the initial portion(s) of the split-up WAL record are
    kept, and a special type of WAL record is written where the contrecord
    was lost, so that WAL replay in the replica knows to skip the broken
    parts.  With this approach, we can continue to stream/archive segment
    files as soon as they are complete, and replay of the broken records
    will proceed across the crash point without a hitch.
    
    Because a new type of WAL record is added, users should be careful to
    upgrade standbys first, primaries later. Otherwise they risk the standby
    being unable to start if the primary happens to write such a record.
    
    A new TAP test that exercises this is added, but the portability of it
    is yet to be seen.
    
    This has been wrong since the introduction of physical replication, so
    backpatch all the way back.  In stable branches, keep the new
    XLogReaderState members at the end of the struct, to avoid an ABI
    break.
    
    Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
    Reviewed-by: default avatarKyotaro Horiguchi <horikyota.ntt@gmail.com>
    Reviewed-by: default avatarNathan Bossart <bossartn@amazon.com>
    Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
    64a8687a
xlogreader.h 11.1 KB