• Tom Lane's avatar
    Fix multiple problems in WAL replay. · 3bbf668d
    Tom Lane authored
    Most of the replay functions for WAL record types that modify more than
    one page failed to ensure that those pages were locked correctly to ensure
    that concurrent queries could not see inconsistent page states.  This is
    a hangover from coding decisions made long before Hot Standby was added,
    when it was hardly necessary to acquire buffer locks during WAL replay
    at all, let alone hold them for carefully-chosen periods.
    
    The key problem was that RestoreBkpBlocks was written to hold lock on each
    page restored from a full-page image for only as long as it took to update
    that page.  This was guaranteed to break any WAL replay function in which
    there was any update-ordering constraint between pages, because even if the
    nominal order of the pages is the right one, any mixture of full-page and
    non-full-page updates in the same record would result in out-of-order
    updates.  Moreover, it wouldn't work for situations where there's a
    requirement to maintain lock on one page while updating another.  Failure
    to honor an update ordering constraint in this way is thought to be the
    cause of bug #7648 from Daniel Farina: what seems to have happened there
    is that a btree page being split was rewritten from a full-page image
    before the new right sibling page was written, and because lock on the
    original page was not maintained it was possible for hot standby queries to
    try to traverse the page's right-link to the not-yet-existing sibling page.
    
    To fix, get rid of RestoreBkpBlocks as such, and instead create a new
    function RestoreBackupBlock that restores just one full-page image at a
    time.  This function can be invoked by WAL replay functions at the points
    where they would otherwise perform non-full-page updates; in this way, the
    physical order of page updates remains the same no matter which pages are
    replaced by full-page images.  We can then further adjust the logic in
    individual replay functions if it is necessary to hold buffer locks
    for overlapping periods.  A side benefit is that we can simplify the
    handling of concurrency conflict resolution by moving that code into the
    record-type-specfic functions; there's no more need to contort the code
    layout to keep conflict resolution in front of the RestoreBkpBlocks call.
    
    In connection with that, standardize on zero-based numbering rather than
    one-based numbering for referencing the full-page images.  In HEAD, I
    removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4.  They are
    still there in the header files in previous branches, but are no longer
    used by the code.
    
    In addition, fix some other bugs identified in the course of making these
    changes:
    
    spgRedoAddNode could fail to update the parent downlink at all, if the
    parent tuple is in the same page as either the old or new split tuple and
    we're not doing a full-page image: it would get fooled by the LSN having
    been advanced already.  This would result in permanent index corruption,
    not just transient failure of concurrent queries.
    
    Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
    tail page as a candidate for a full-page image; in the worst case this
    could result in torn-page corruption.
    
    heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
    exclusive lock: it did the former in the normal path but the latter for a
    full-page image.  A plain exclusive lock seems sufficient, so change to
    that.
    
    Also, remove gistRedoPageDeleteRecord(), which has been dead code since
    VACUUM FULL was rewritten.
    
    Back-patch to 9.0, where hot standby was introduced.  Note however that 9.0
    had a significantly different WAL-logging scheme for GIST index updates,
    and it doesn't appear possible to make that scheme safe for concurrent hot
    standby queries, because it can leave inconsistent states in the index even
    between WAL records.  Given the lack of complaints from the field, we won't
    work too hard on fixing that branch.
    3bbf668d
gistxlog.c 14.6 KB