• Michael Paquier's avatar
    Fix a couple of bugs with replication slot advancing feature · f731cfa9
    Michael Paquier authored
    A review of the code has showed up a couple of issues fixed by this
    commit:
    - Physical slots have been using the confirmed LSN position as a start
    comparison point which is always 0/0, instead use the restart LSN
    position (logical slots need to use the confirmed LSN position, which
    was correct).
    - The actual slot update was incorrect for both physical and logical
    slots.  Physical slots need to use their restart_lsn as base comparison
    point (confirmed_flush was used because of previous point), and logical
    slots need to begin reading WAL from restart_lsn (confirmed_flush was
    used as well), while confirmed_flush is compiled depending on the
    decoding context and record read, and is the LSN position returned back
    to the caller.
    - Never return 0/0 if a slot cannot be advanced.  This way, if a slot is
    advanced while the activity is idle, then the same position is returned
    to the caller over and over without raising an error.  Instead return
    the LSN the slot has been advanced to.  With repetitive calls, the same
    position is returned hence caller can directly monitor the difference in
    progress in bytes by doing simply LSN difference calculations, which
    should be monotonic.
    
    Note that as the slot is owned by the backend advancing it, then the
    read of those fields is fine lock-less, while updates need to happen
    while the slot mutex is held, so fix that on the way as well.  Other
    locks for in-memory data of replication slots have been already fixed
    previously.
    
    Some of those issues have been pointed out by Petr and Simon during the
    patch, while I noticed some of them after looking at the code.  This
    also visibly takes of a recently-discovered bug causing assertion
    failures which can be triggered by a two-step slot forwarding which
    first advanced the slot to a WAL page boundary and secondly advanced it
    to the latest position, say 'FF/FFFFFFF' to make sure that the newest
    LSN is used as forward point.  It would have been nice to drop a test
    for that, but the set of operators working on pg_lsn limits it, so this
    is left for a future exercise.
    
    Author: Michael Paquier
    Reviewed-by: Petr Jelinek, Simon Riggs
    Discussion: https://postgr.es/m/CANP8+jLyS=X-CAk59BJnsxKQfjwrmKicHQykyn52Qj-Q=9GLCw@mail.gmail.com
    Discussion: https://www.postgresql.org/message-id/2840048a-1184-417a-9da8-3299d207a1d7%40postgrespro.ru
    f731cfa9
slotfuncs.c 14 KB