• Michael Paquier's avatar
    Fix control file update done in restartpoints still running after promotion · 6dced63b
    Michael Paquier authored
    If a cluster is promoted (aka the control file shows a state different
    than DB_IN_ARCHIVE_RECOVERY) while CreateRestartPoint() is still
    processing, this function could miss an update of the control file for
    "checkPoint" and "checkPointCopy" but still do the recycling and/or
    removal of the past WAL segments, assuming that the to-be-updated LSN
    values should be used as reference points for the cleanup.  This causes
    a follow-up restart attempting crash recovery to fail with a PANIC on a
    missing checkpoint record if the end-of-recovery checkpoint triggered by
    the promotion did not complete while the cluster abruptly stopped or
    crashed before the completion of this checkpoint.  The PANIC would be
    caused by the redo LSN referred in the control file as located in a
    segment already gone, recycled by the previous restartpoint with
    "checkPoint" out-of-sync in the control file.
    
    This commit fixes the update of the control file during restartpoints so
    as "checkPoint" and "checkPointCopy" are updated even if the cluster has
    been promoted while a restartpoint is running, to be on par with the set
    of WAL segments actually recycled in the end of CreateRestartPoint().
    
    7863ee4 has fixed this problem already on master, but the release timing
    of the latest point versions did not let me enough time to study and fix
    that on all the stable branches.
    
    Reported-by: Fujii Masao, Rui Zhao
    Author: Kyotaro Horiguchi
    Reviewed-by: Nathan Bossart, Michael Paquier
    Discussion: https://postgr.es/m/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com
    Backpatch-through: 10
    6dced63b
xlog.c 408 KB