• Peter Geoghegan's avatar
    Fix replica backward scan race condition. · 9a9db08a
    Peter Geoghegan authored
    It was possible for the logic used by backward scans (which must reason
    about concurrent page splits/deletions in its own peculiar way) to
    become confused when running on a replica.  Concurrent replay of a WAL
    record that describes the second phase of page deletion could cause
    _bt_walk_left() to get confused.  btree_xlog_unlink_page() simply failed
    to adhere to the same locking protocol that we use on the primary, which
    is obviously wrong once you consider these two disparate functions
    together.  This bug is present in all stable branches.
    
    More concretely, the problem was that nothing stopped _bt_walk_left()
    from observing inconsistencies between the deletion's target page and
    its original sibling pages when running on a replica.  This is true even
    though the second phase of page deletion is supposed to work as a single
    atomic action.  Queries running on replicas raised "could not find left
    sibling of block %u in index %s" can't-happen errors when they went back
    to their scan's "original" page and observed that the page has not been
    marked deleted (even though it really was concurrently deleted).
    
    There is no evidence that this actually happened in the real world.  The
    issue came to light during unrelated feature development work.  Note
    that _bt_walk_left() is the only code that cares about the difference
    between a half-dead page and a fully deleted page that isn't also
    exclusively used by nbtree VACUUM (unless you include contrib/amcheck
    code).  It seems very likely that backward scans are the only thing that
    could become confused by the inconsistency.  Even amcheck's complex
    bt_right_page_check_scankey() dance was unaffected.
    
    To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
    and right sibling pages in that order before releasing any locks (just
    like _bt_unlink_halfdead_page()).  This is the simplest possible
    approach.  There doesn't seem to be any opportunity to be more clever
    about lock acquisition in the REDO routine, and it hardly seems worth
    the trouble in any case.
    
    This fix might enable contrib/amcheck verification of leaf page sibling
    links with only an AccessShareLock on the relation.  An amcheck patch
    from Andrey Borodin was rejected back in January because it clashed with
    btree_xlog_unlink_page()'s lax approach to locking pages.  It now seems
    likely that the real problem was with btree_xlog_unlink_page(), not the
    patch.
    
    This is a low severity, low likelihood bug, so no backpatch.
    
    Author: Michail Nikolaev
    Diagnosed-By: Michail Nikolaev
    Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
    9a9db08a
nbtxlog.c 31 KB