• Andres Freund's avatar
    Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning. · dad1539a
    Andres Freund authored
    Since dc7420c2 the horizon used for pruning is determined "lazily". A more
    accurate horizon is built on-demand, rather than in GetSnapshotData(). If a
    horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls
    for the same tuple, the result can change from RECENTLY_DEAD to DEAD.
    
    heap_page_prune() can process the same tid multiple times (once following an
    update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum()
    of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the
    second, the "tuple is DEAD and doesn't chain to anything else" path in
    heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId
    unused.
    
    Initially not easily visible,
    Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version
    can reuse it. At that point the corruption may become visible, as index
    entries pointing to the "original" redirect item, now point to a unrelated
    tuple.
    
    To fix, compute HTSV for all tuples on a page only once. This fixes the entire
    class of problems of HTSV changing inside heap_page_prune(). However,
    visibility changes can obviously still occur between HTSV checks inside
    heap_page_prune() and outside (e.g. in lazy_scan_prune()).
    
    The computation of HTSV is now done in bulk, in heap_page_prune(), rather than
    on-demand in heap_prune_chain(). Besides being a bit simpler, it also is
    faster: Memory accesses can happen sequentially, rather than in the order of
    HOT chains.
    
    There are other causes of HeapTupleSatisfiesVacuum() results changing between
    two visibility checks for the same tuple, even before dc7420c2. E.g.
    HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction
    aborts between the two checks. None of the these other visibility status
    changes are known to cause corruption, but heap_page_prune()'s approach makes
    it hard to be confident.
    
    A patch implementing a more fundamental redesign of heap_page_prune(), which
    fixes this bug and simplifies pruning substantially, has been proposed by
    Peter Geoghegan in
    https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com
    
    However, that redesign is larger change than desirable for backpatching. As
    the new design still benefits from the batched visibility determination
    introduced in this commit, it makes sense to commit this narrower fix to 14
    and master, and then commit Peter's improvement in master.
    
    The precise sequence required to trigger the bug is complicated and hard to do
    exercise in an isolation test (until we have wait points). Due to that the
    isolation test initially posted at
    https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de
    and updated in
    https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de
    isn't committable.
    
    A followup commit will introduce additional assertions, to detect problems
    like this more easily.
    
    Bug: #17255
    Reported-By: default avatarAlexander Lakhin <exclusion@gmail.com>
    Debugged-By: default avatarAndres Freund <andres@anarazel.de>
    Debugged-By: default avatarPeter Geoghegan <pg@bowt.ie>
    Author: Andres Freund <andres@andres@anarazel.de>
    Reviewed-By: default avatarPeter Geoghegan <pg@bowt.ie>
    Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de
    Backpatch: 14-, the oldest branch containing dc7420c2
    dad1539a
pruneheap.c 32 KB