Commit b0229f26 authored by Peter Geoghegan's avatar Peter Geoghegan

Fix bug in nbtree VACUUM "skip full scan" feature.

Commit 857f9c36 (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole.  It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).

The most immediate problem was that the special area of the page was
examined without a buffer pin at one point.  More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors.  For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page.  The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.

It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for.  The issue was spotted during
unrelated code review.

To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel().  btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()).  Note that
_bt_pagedel() should never encounter an existing deleted page.

This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel().  However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM).  It seems highly unlikely that such an
extension actually exists, in any case.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
parent 3c800ae0
This diff is collapsed.
...@@ -92,7 +92,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc; ...@@ -92,7 +92,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state, IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid, TransactionId *oldestBtpoXact); BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno, static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno); BlockNumber orig_blkno);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate, static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
...@@ -787,8 +787,14 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan) ...@@ -787,8 +787,14 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
} }
/* /*
* _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup
* btbulkdelete() wasn't called. *
* Called by btvacuumcleanup when btbulkdelete was never called because no
* tuples need to be deleted.
*
* When we return false, VACUUM can even skip the cleanup-only call to
* btvacuumscan (i.e. there will be no btvacuumscan call for this index at
* all). Otherwise, a cleanup-only btvacuumscan call is required.
*/ */
static bool static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info) _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
...@@ -815,8 +821,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) ...@@ -815,8 +821,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
RecentGlobalXmin)) RecentGlobalXmin))
{ {
/* /*
* If oldest btpo.xact in the deleted pages is older than * If any oldest btpo.xact from a previously deleted page in the index
* RecentGlobalXmin, then at least one deleted page can be recycled. * is older than RecentGlobalXmin, then at least one deleted page can
* be recycled -- don't skip cleanup.
*
* Note that btvacuumpage currently doesn't make any effort to
* recognize when a recycled page is already in the FSM (i.e. put
* there by a previous VACUUM operation). We have to be conservative
* because the FSM isn't crash safe. Hopefully recycled pages get
* reused before too long.
*/ */
result = true; result = true;
} }
...@@ -873,20 +886,9 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -873,20 +886,9 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* The ENSURE stuff ensures we clean up shared memory on failure */ /* The ENSURE stuff ensures we clean up shared memory on failure */
PG_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel)); PG_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
{ {
TransactionId oldestBtpoXact;
cycleid = _bt_start_vacuum(rel); cycleid = _bt_start_vacuum(rel);
btvacuumscan(info, stats, callback, callback_state, cycleid, btvacuumscan(info, stats, callback, callback_state, cycleid);
&oldestBtpoXact);
/*
* Update cleanup-related information in metapage. This information is
* used only for cleanup but keeping them up to date can avoid
* unnecessary cleanup even after bulkdelete.
*/
_bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
info->num_heap_tuples);
} }
PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel)); PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
_bt_end_vacuum(rel); _bt_end_vacuum(rel);
...@@ -918,18 +920,12 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -918,18 +920,12 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/ */
if (stats == NULL) if (stats == NULL)
{ {
TransactionId oldestBtpoXact;
/* Check if we need a cleanup */ /* Check if we need a cleanup */
if (!_bt_vacuum_needs_cleanup(info)) if (!_bt_vacuum_needs_cleanup(info))
return NULL; return NULL;
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
btvacuumscan(info, stats, NULL, NULL, 0, &oldestBtpoXact); btvacuumscan(info, stats, NULL, NULL, 0);
/* Update cleanup-related information in the metapage */
_bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
info->num_heap_tuples);
} }
/* /*
...@@ -954,7 +950,9 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -954,7 +950,9 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
* according to the vacuum callback, looking for empty pages that can be * according to the vacuum callback, looking for empty pages that can be
* deleted, and looking for old deleted pages that can be recycled. Both * deleted, and looking for old deleted pages that can be recycled. Both
* btbulkdelete and btvacuumcleanup invoke this (the latter only if no * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
* btbulkdelete call occurred). * btbulkdelete call occurred and _bt_vacuum_needs_cleanup returned true).
* Note that this is also where the metadata used by _bt_vacuum_needs_cleanup
* is maintained.
* *
* The caller is responsible for initially allocating/zeroing a stats struct * The caller is responsible for initially allocating/zeroing a stats struct
* and for obtaining a vacuum cycle ID if necessary. * and for obtaining a vacuum cycle ID if necessary.
...@@ -962,7 +960,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -962,7 +960,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
static void static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state, IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid, TransactionId *oldestBtpoXact) BTCycleId cycleid)
{ {
Relation rel = info->index; Relation rel = info->index;
BTVacState vstate; BTVacState vstate;
...@@ -1046,6 +1044,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -1046,6 +1044,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
MemoryContextDelete(vstate.pagedelcontext); MemoryContextDelete(vstate.pagedelcontext);
/*
* Maintain a count of the oldest btpo.xact and current number of heap
* tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
* The oldest page is typically a page deleted by a previous VACUUM
* operation.
*/
_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
info->num_heap_tuples);
/* /*
* If we found any recyclable pages (and recorded them in the FSM), then * If we found any recyclable pages (and recorded them in the FSM), then
* forcibly update the upper-level FSM pages to ensure that searchers can * forcibly update the upper-level FSM pages to ensure that searchers can
...@@ -1064,9 +1071,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -1064,9 +1071,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* update statistics */ /* update statistics */
stats->num_pages = num_pages; stats->num_pages = num_pages;
stats->pages_free = vstate.totFreePages; stats->pages_free = vstate.totFreePages;
if (oldestBtpoXact)
*oldestBtpoXact = vstate.oldestBtpoXact;
} }
/* /*
...@@ -1137,24 +1141,30 @@ restart: ...@@ -1137,24 +1141,30 @@ restart:
/* Page is valid, see what to do with it */ /* Page is valid, see what to do with it */
if (_bt_page_recyclable(page)) if (_bt_page_recyclable(page))
{ {
/* Okay to recycle this page */ /* Okay to recycle this page (which could be leaf or internal) */
RecordFreeIndexPage(rel, blkno); RecordFreeIndexPage(rel, blkno);
vstate->totFreePages++; vstate->totFreePages++;
stats->pages_deleted++; stats->pages_deleted++;
} }
else if (P_ISDELETED(opaque)) else if (P_ISDELETED(opaque))
{ {
/* Already deleted, but can't recycle yet */ /*
* Already deleted page (which could be leaf or internal). Can't
* recycle yet.
*/
stats->pages_deleted++; stats->pages_deleted++;
/* Update the oldest btpo.xact */ /* Maintain the oldest btpo.xact */
if (!TransactionIdIsValid(vstate->oldestBtpoXact) || if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact)) TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact; vstate->oldestBtpoXact = opaque->btpo.xact;
} }
else if (P_ISHALFDEAD(opaque)) else if (P_ISHALFDEAD(opaque))
{ {
/* Half-dead, try to delete */ /*
* Half-dead leaf page. Try to delete now. Might update
* oldestBtpoXact and pages_deleted below.
*/
delete_now = true; delete_now = true;
} }
else if (P_ISLEAF(opaque)) else if (P_ISLEAF(opaque))
...@@ -1316,10 +1326,11 @@ restart: ...@@ -1316,10 +1326,11 @@ restart:
else else
{ {
/* /*
* If the page has been split during this vacuum cycle, it seems * If the leaf page has been split during this vacuum cycle, it
* worth expending a write to clear btpo_cycleid even if we don't * seems worth expending a write to clear btpo_cycleid even if we
* have any deletions to do. (If we do, _bt_delitems_vacuum takes * don't have any deletions to do. (If we do, _bt_delitems_vacuum
* care of this.) This ensures we won't process the page again. * takes care of this.) This ensures we won't process the page
* again.
* *
* We treat this like a hint-bit update because there's no need to * We treat this like a hint-bit update because there's no need to
* WAL-log it. * WAL-log it.
...@@ -1334,11 +1345,11 @@ restart: ...@@ -1334,11 +1345,11 @@ restart:
} }
/* /*
* If it's now empty, try to delete; else count the live tuples (live * If the leaf page is now empty, try to delete it; else count the
* table TIDs in posting lists are counted as separate live tuples). * live tuples (live table TIDs in posting lists are counted as
* We don't delete when recursing, though, to avoid putting entries * separate live tuples). We don't delete when recursing, though, to
* into freePages out-of-order (doesn't seem worth any extra code to * avoid putting entries into freePages out-of-order (doesn't seem
* handle the case). * worth any extra code to handle the case).
*/ */
if (minoff > maxoff) if (minoff > maxoff)
delete_now = (blkno == orig_blkno); delete_now = (blkno == orig_blkno);
...@@ -1357,16 +1368,11 @@ restart: ...@@ -1357,16 +1368,11 @@ restart:
MemoryContextReset(vstate->pagedelcontext); MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext); oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf); ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
/* count only this page, else may double-count parent */ /* count only this page, else may double-count parent */
if (ndel) if (ndel)
{
stats->pages_deleted++; stats->pages_deleted++;
if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact;
}
MemoryContextSwitchTo(oldcontext); MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */ /* pagedel released buffer, so we shouldn't */
......
...@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf, ...@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf, extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable, OffsetNumber *deletable, int ndeletable,
Relation heapRel); Relation heapRel);
extern int _bt_pagedel(Relation rel, Buffer buf); extern int _bt_pagedel(Relation rel, Buffer leafbuf,
TransactionId *oldestBtpoXact);
/* /*
* prototypes for functions in nbtsearch.c * prototypes for functions in nbtsearch.c
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment