• Alvaro Herrera's avatar
    Update FSM on WAL replay of page all-visible/frozen · ab7dbd68
    Alvaro Herrera authored
    We aren't very strict about keeping FSM up to date on WAL replay,
    because per-page freespace values aren't critical in replicas (can't
    write to heap in a replica; and if the replica is promoted, the values
    would be updated by VACUUM anyway).  However, VACUUM since 9.6 can skip
    processing pages marked all-visible or all-frozen, and if such pages are
    recorded in FSM with wrong values, those values are blindly propagated
    to FSM's upper layers by VACUUM's FreeSpaceMapVacuum.  (This rationale
    assumes that crashes are not very frequent, because those would cause
    outdated FSM to occur in the primary.)
    
    Even when the FSM is outdated in standby, things are not too bad
    normally, because, most per-page FSM values will be zero (other than
    those propagated with the base-backup that created the standby); only
    once the remaining free space is less than 0.2*BLCKSZ the per-page value
    is maintained by WAL replay of heap ins/upd/del.  However, if
    wal_log_hints=on causes complete FSM pages to be propagated to a standby
    via full-page images, many too-optimistic per-page values can end up
    being registered in the standby.
    
    Incorrect per-page values aren't critical in most cases, since an
    inserter that is given a page that doesn't actually contain the claimed
    free space will update FSM with the correct value, and retry until it
    finds a usable page.  However, if there are many such updates to do, an
    inserter can spend a long time doing them before a usable page is found;
    in a heavily trafficked insert-only table with many concurrent inserters
    this has been observed to cause several second stalls, causing visible
    application malfunction.
    
    To fix this problem, it seems sufficient to have heap_xlog_visible
    (replay of setting all-visible and all-frozen VM bits for a heap page)
    update the FSM value for the page being processed.  This fixes the
    per-page counters together with making the page skippable to vacuum, so
    when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper
    layers are the correct ones, avoiding the problem.
    
    While at it, apply the same fix to heap_xlog_clean (replay of tuple
    removal by HOT pruning and vacuum).  This makes any space freed by the
    cleaning available earlier than the next vacuum in the promoted replica.
    
    Backpatch to 9.6, where this problem was diagnosed on an insert-only
    table with all-frozen pages, which were introduced as a concept in that
    release.  Theoretically it could apply with all-visible pages to older
    branches, but there's been no report of that and it doesn't backpatch
    cleanly anyway.
    
    Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
    Discussion: https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql
    ab7dbd68
heapam.c 284 KB