Commit b0229f26 authored by Peter Geoghegan's avatar Peter Geoghegan

Fix bug in nbtree VACUUM "skip full scan" feature.

Commit 857f9c36 (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole.  It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).

The most immediate problem was that the special area of the page was
examined without a buffer pin at one point.  More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors.  For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page.  The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.

It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for.  The issue was spotted during
unrelated code review.

To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel().  btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()).  Note that
_bt_pagedel() should never encounter an existing deleted page.

This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel().  However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM).  It seems highly unlikely that such an
extension actually exists, in any case.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
parent 3c800ae0
...@@ -35,9 +35,11 @@ ...@@ -35,9 +35,11 @@
#include "utils/snapmgr.h" #include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf); static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack); static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
bool *rightsib_empty); bool *rightsib_empty,
TransactionId *oldestBtpoXact);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page, static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable); OffsetNumber *deletable, int ndeletable);
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child, static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
...@@ -1470,27 +1472,35 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack, ...@@ -1470,27 +1472,35 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
} }
/* /*
* _bt_pagedel() -- Delete a page from the b-tree, if legal to do so. * _bt_pagedel() -- Delete a leaf page from the b-tree, if legal to do so.
* *
* This action unlinks the page from the b-tree structure, removing all * This action unlinks the leaf page from the b-tree structure, removing all
* pointers leading to it --- but not touching its own left and right links. * pointers leading to it --- but not touching its own left and right links.
* The page cannot be physically reclaimed right away, since other processes * The page cannot be physically reclaimed right away, since other processes
* may currently be trying to follow links leading to the page; they have to * may currently be trying to follow links leading to the page; they have to
* be allowed to use its right-link to recover. See nbtree/README. * be allowed to use its right-link to recover. See nbtree/README.
* *
* On entry, the target buffer must be pinned and locked (either read or write * On entry, the target buffer must be pinned and locked (either read or write
* lock is OK). This lock and pin will be dropped before exiting. * lock is OK). The page must be an empty leaf page, which may be half-dead
* already (a half-dead page should only be passed to us when an earlier
* VACUUM operation was interrupted, though). Note in particular that caller
* should never pass a buffer containing an existing deleted page here. The
* lock and pin on caller's buffer will be dropped before we return.
* *
* Returns the number of pages successfully deleted (zero if page cannot * Returns the number of pages successfully deleted (zero if page cannot
* be deleted now; could be more than one if parent or sibling pages were * be deleted now; could be more than one if parent or right sibling pages
* deleted too). * were deleted too).
*
* Maintains *oldestBtpoXact for any pages that get deleted. Caller is
* responsible for maintaining *oldestBtpoXact in the case of pages that were
* deleted by a previous VACUUM.
* *
* NOTE: this leaks memory. Rather than trying to clean up everything * NOTE: this leaks memory. Rather than trying to clean up everything
* carefully, it's better to run it in a temp context that can be reset * carefully, it's better to run it in a temp context that can be reset
* frequently. * frequently.
*/ */
int int
_bt_pagedel(Relation rel, Buffer buf) _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
{ {
int ndeleted = 0; int ndeleted = 0;
BlockNumber rightsib; BlockNumber rightsib;
...@@ -1511,14 +1521,21 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1511,14 +1521,21 @@ _bt_pagedel(Relation rel, Buffer buf)
for (;;) for (;;)
{ {
page = BufferGetPage(buf); page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* /*
* Internal pages are never deleted directly, only as part of deleting * Internal pages are never deleted directly, only as part of deleting
* the whole branch all the way down to leaf level. * the whole branch all the way down to leaf level.
*
* Also check for deleted pages here. Caller never passes us a fully
* deleted page. Only VACUUM can delete pages, so there can't have
* been a concurrent deletion. Assume that we reached any deleted
* page encountered here by following a sibling link, and that the
* index is corrupt.
*/ */
if (!P_ISLEAF(opaque)) Assert(!P_ISDELETED(opaque));
if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
{ {
/* /*
* Pre-9.4 page deletion only marked internal pages as half-dead, * Pre-9.4 page deletion only marked internal pages as half-dead,
...@@ -1537,13 +1554,22 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1537,13 +1554,22 @@ _bt_pagedel(Relation rel, Buffer buf)
errmsg("index \"%s\" contains a half-dead internal page", errmsg("index \"%s\" contains a half-dead internal page",
RelationGetRelationName(rel)), RelationGetRelationName(rel)),
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it."))); errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
_bt_relbuf(rel, buf);
if (P_ISDELETED(opaque))
ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("found deleted block %u while following right link in index \"%s\"",
BufferGetBlockNumber(leafbuf),
RelationGetRelationName(rel))));
_bt_relbuf(rel, leafbuf);
return ndeleted; return ndeleted;
} }
/* /*
* We can never delete rightmost pages nor root pages. While at it, * We can never delete rightmost pages nor root pages. While at it,
* check that page is not already deleted and is empty. * check that page is empty, since it's possible that the leafbuf page
* was empty a moment ago, but has since had some inserts.
* *
* To keep the algorithm simple, we also never delete an incompletely * To keep the algorithm simple, we also never delete an incompletely
* split page (they should be rare enough that this doesn't make any * split page (they should be rare enough that this doesn't make any
...@@ -1558,14 +1584,14 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1558,14 +1584,14 @@ _bt_pagedel(Relation rel, Buffer buf)
* to. On subsequent iterations, we know we stepped right from a page * to. On subsequent iterations, we know we stepped right from a page
* that passed these tests, so it's OK. * that passed these tests, so it's OK.
*/ */
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) || if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) || P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
P_INCOMPLETE_SPLIT(opaque)) P_INCOMPLETE_SPLIT(opaque))
{ {
/* Should never fail to delete a half-dead page */ /* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque)); Assert(!P_ISHALFDEAD(opaque));
_bt_relbuf(rel, buf); _bt_relbuf(rel, leafbuf);
return ndeleted; return ndeleted;
} }
...@@ -1603,7 +1629,7 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1603,7 +1629,7 @@ _bt_pagedel(Relation rel, Buffer buf)
* To avoid deadlocks, we'd better drop the leaf page lock * To avoid deadlocks, we'd better drop the leaf page lock
* before going further. * before going further.
*/ */
LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
/* /*
* Fetch the left sibling, to check that it's not marked with * Fetch the left sibling, to check that it's not marked with
...@@ -1627,10 +1653,10 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1627,10 +1653,10 @@ _bt_pagedel(Relation rel, Buffer buf)
* incompletely-split page to be split again. So we don't * incompletely-split page to be split again. So we don't
* need to walk right here. * need to walk right here.
*/ */
if (lopaque->btpo_next == BufferGetBlockNumber(buf) && if (lopaque->btpo_next == BufferGetBlockNumber(leafbuf) &&
P_INCOMPLETE_SPLIT(lopaque)) P_INCOMPLETE_SPLIT(lopaque))
{ {
ReleaseBuffer(buf); ReleaseBuffer(leafbuf);
_bt_relbuf(rel, lbuf); _bt_relbuf(rel, lbuf);
return ndeleted; return ndeleted;
} }
...@@ -1646,16 +1672,26 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1646,16 +1672,26 @@ _bt_pagedel(Relation rel, Buffer buf)
_bt_relbuf(rel, lbuf); _bt_relbuf(rel, lbuf);
/* /*
* Re-lock the leaf page, and start over, to re-check that the * Re-lock the leaf page, and start over to use our stack
* page can still be deleted. * within _bt_mark_page_halfdead. We must do it that way
* because it's possible that leafbuf can no longer be
* deleted. We need to recheck.
*/ */
LockBuffer(buf, BT_WRITE); LockBuffer(leafbuf, BT_WRITE);
continue; continue;
} }
if (!_bt_mark_page_halfdead(rel, buf, stack)) /*
* See if it's safe to delete the leaf page, and determine how
* many parent/internal pages above the leaf level will be
* deleted. If it's safe then _bt_mark_page_halfdead will also
* perform the first phase of deletion, which includes marking the
* leafbuf page half-dead.
*/
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
{ {
_bt_relbuf(rel, buf); _bt_relbuf(rel, leafbuf);
return ndeleted; return ndeleted;
} }
} }
...@@ -1663,23 +1699,32 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1663,23 +1699,32 @@ _bt_pagedel(Relation rel, Buffer buf)
/* /*
* Then unlink it from its siblings. Each call to * Then unlink it from its siblings. Each call to
* _bt_unlink_halfdead_page unlinks the topmost page from the branch, * _bt_unlink_halfdead_page unlinks the topmost page from the branch,
* making it shallower. Iterate until the leaf page is gone. * making it shallower. Iterate until the leafbuf page is deleted.
*
* _bt_unlink_halfdead_page should never fail, since we established
* that deletion is generally safe in _bt_mark_page_halfdead.
*/ */
rightsib_empty = false; rightsib_empty = false;
Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
while (P_ISHALFDEAD(opaque)) while (P_ISHALFDEAD(opaque))
{ {
/* will check for interrupts, once lock is released */ /* Check for interrupts in _bt_unlink_halfdead_page */
if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty)) if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
oldestBtpoXact))
{ {
/* _bt_unlink_halfdead_page already released buffer */ /* _bt_unlink_halfdead_page failed, released buffer */
return ndeleted; return ndeleted;
} }
ndeleted++; ndeleted++;
} }
Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
*oldestBtpoXact));
rightsib = opaque->btpo_next; rightsib = opaque->btpo_next;
_bt_relbuf(rel, buf); _bt_relbuf(rel, leafbuf);
/* /*
* Check here, as calling loops will have locks held, preventing * Check here, as calling loops will have locks held, preventing
...@@ -1705,7 +1750,7 @@ _bt_pagedel(Relation rel, Buffer buf) ...@@ -1705,7 +1750,7 @@ _bt_pagedel(Relation rel, Buffer buf)
if (!rightsib_empty) if (!rightsib_empty)
break; break;
buf = _bt_getbuf(rel, rightsib, BT_WRITE); leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
} }
return ndeleted; return ndeleted;
...@@ -1909,9 +1954,19 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack) ...@@ -1909,9 +1954,19 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* of the whole branch, including the leaf page itself, iterate until the * of the whole branch, including the leaf page itself, iterate until the
* leaf page is deleted. * leaf page is deleted.
* *
* Returns 'false' if the page could not be unlinked (shouldn't happen). * Returns 'false' if the page could not be unlinked (shouldn't happen). If
* If the (current) right sibling of the page is empty, *rightsib_empty is * the right sibling of the current target page is empty, *rightsib_empty is
* set to true. * set to true, allowing caller to delete the target's right sibling page in
* passing. Note that *rightsib_empty is only actually used by caller when
* target page is leafbuf, following last call here for leafbuf/the subtree
* containing leafbuf. (We always set *rightsib_empty for caller, just to be
* consistent.)
*
* We maintain *oldestBtpoXact for pages that are deleted by the current
* VACUUM operation here. This must be handled here because we conservatively
* assume that there needs to be a new call to ReadNewTransactionId() each
* time a page gets deleted. See comments about the underlying assumption
* below.
* *
* Must hold pin and lock on leafbuf at entry (read or write doesn't matter). * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
* On success exit, we'll be holding pin and write lock. On failure exit, * On success exit, we'll be holding pin and write lock. On failure exit,
...@@ -1919,7 +1974,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack) ...@@ -1919,7 +1974,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* to avoid having to reacquire a lock we already released). * to avoid having to reacquire a lock we already released).
*/ */
static bool static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
TransactionId *oldestBtpoXact)
{ {
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf); BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib; BlockNumber leafleftsib;
...@@ -2057,9 +2113,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) ...@@ -2057,9 +2113,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
lbuf = InvalidBuffer; lbuf = InvalidBuffer;
/* /*
* Next write-lock the target page itself. It should be okay to take just * Next write-lock the target page itself. It's okay to take a write lock
* a write lock not a superexclusive lock, since no scans would stop on an * rather than a superexclusive lock, since no scan will stop on an empty
* empty page. * page.
*/ */
LockBuffer(buf, BT_WRITE); LockBuffer(buf, BT_WRITE);
page = BufferGetPage(buf); page = BufferGetPage(buf);
...@@ -2204,6 +2260,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) ...@@ -2204,6 +2260,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
*/ */
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
opaque->btpo_flags &= ~BTP_HALF_DEAD; opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED; opaque->btpo_flags |= BTP_DELETED;
opaque->btpo.xact = ReadNewTransactionId(); opaque->btpo.xact = ReadNewTransactionId();
...@@ -2309,6 +2366,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) ...@@ -2309,6 +2366,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
_bt_relbuf(rel, lbuf); _bt_relbuf(rel, lbuf);
_bt_relbuf(rel, rbuf); _bt_relbuf(rel, rbuf);
if (!TransactionIdIsValid(*oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
*oldestBtpoXact = opaque->btpo.xact;
/* /*
* Release the target, if it was not the leaf block. The leaf is always * Release the target, if it was not the leaf block. The leaf is always
* kept locked. * kept locked.
......
...@@ -92,7 +92,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc; ...@@ -92,7 +92,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state, IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid, TransactionId *oldestBtpoXact); BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno, static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno); BlockNumber orig_blkno);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate, static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
...@@ -787,8 +787,14 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan) ...@@ -787,8 +787,14 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
} }
/* /*
* _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup
* btbulkdelete() wasn't called. *
* Called by btvacuumcleanup when btbulkdelete was never called because no
* tuples need to be deleted.
*
* When we return false, VACUUM can even skip the cleanup-only call to
* btvacuumscan (i.e. there will be no btvacuumscan call for this index at
* all). Otherwise, a cleanup-only btvacuumscan call is required.
*/ */
static bool static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info) _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
...@@ -815,8 +821,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) ...@@ -815,8 +821,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
RecentGlobalXmin)) RecentGlobalXmin))
{ {
/* /*
* If oldest btpo.xact in the deleted pages is older than * If any oldest btpo.xact from a previously deleted page in the index
* RecentGlobalXmin, then at least one deleted page can be recycled. * is older than RecentGlobalXmin, then at least one deleted page can
* be recycled -- don't skip cleanup.
*
* Note that btvacuumpage currently doesn't make any effort to
* recognize when a recycled page is already in the FSM (i.e. put
* there by a previous VACUUM operation). We have to be conservative
* because the FSM isn't crash safe. Hopefully recycled pages get
* reused before too long.
*/ */
result = true; result = true;
} }
...@@ -873,20 +886,9 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -873,20 +886,9 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* The ENSURE stuff ensures we clean up shared memory on failure */ /* The ENSURE stuff ensures we clean up shared memory on failure */
PG_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel)); PG_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
{ {
TransactionId oldestBtpoXact;
cycleid = _bt_start_vacuum(rel); cycleid = _bt_start_vacuum(rel);
btvacuumscan(info, stats, callback, callback_state, cycleid, btvacuumscan(info, stats, callback, callback_state, cycleid);
&oldestBtpoXact);
/*
* Update cleanup-related information in metapage. This information is
* used only for cleanup but keeping them up to date can avoid
* unnecessary cleanup even after bulkdelete.
*/
_bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
info->num_heap_tuples);
} }
PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel)); PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
_bt_end_vacuum(rel); _bt_end_vacuum(rel);
...@@ -918,18 +920,12 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -918,18 +920,12 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/ */
if (stats == NULL) if (stats == NULL)
{ {
TransactionId oldestBtpoXact;
/* Check if we need a cleanup */ /* Check if we need a cleanup */
if (!_bt_vacuum_needs_cleanup(info)) if (!_bt_vacuum_needs_cleanup(info))
return NULL; return NULL;
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
btvacuumscan(info, stats, NULL, NULL, 0, &oldestBtpoXact); btvacuumscan(info, stats, NULL, NULL, 0);
/* Update cleanup-related information in the metapage */
_bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
info->num_heap_tuples);
} }
/* /*
...@@ -954,7 +950,9 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -954,7 +950,9 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
* according to the vacuum callback, looking for empty pages that can be * according to the vacuum callback, looking for empty pages that can be
* deleted, and looking for old deleted pages that can be recycled. Both * deleted, and looking for old deleted pages that can be recycled. Both
* btbulkdelete and btvacuumcleanup invoke this (the latter only if no * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
* btbulkdelete call occurred). * btbulkdelete call occurred and _bt_vacuum_needs_cleanup returned true).
* Note that this is also where the metadata used by _bt_vacuum_needs_cleanup
* is maintained.
* *
* The caller is responsible for initially allocating/zeroing a stats struct * The caller is responsible for initially allocating/zeroing a stats struct
* and for obtaining a vacuum cycle ID if necessary. * and for obtaining a vacuum cycle ID if necessary.
...@@ -962,7 +960,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) ...@@ -962,7 +960,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
static void static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state, IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid, TransactionId *oldestBtpoXact) BTCycleId cycleid)
{ {
Relation rel = info->index; Relation rel = info->index;
BTVacState vstate; BTVacState vstate;
...@@ -1046,6 +1044,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -1046,6 +1044,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
MemoryContextDelete(vstate.pagedelcontext); MemoryContextDelete(vstate.pagedelcontext);
/*
* Maintain a count of the oldest btpo.xact and current number of heap
* tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
* The oldest page is typically a page deleted by a previous VACUUM
* operation.
*/
_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
info->num_heap_tuples);
/* /*
* If we found any recyclable pages (and recorded them in the FSM), then * If we found any recyclable pages (and recorded them in the FSM), then
* forcibly update the upper-level FSM pages to ensure that searchers can * forcibly update the upper-level FSM pages to ensure that searchers can
...@@ -1064,9 +1071,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -1064,9 +1071,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* update statistics */ /* update statistics */
stats->num_pages = num_pages; stats->num_pages = num_pages;
stats->pages_free = vstate.totFreePages; stats->pages_free = vstate.totFreePages;
if (oldestBtpoXact)
*oldestBtpoXact = vstate.oldestBtpoXact;
} }
/* /*
...@@ -1137,24 +1141,30 @@ restart: ...@@ -1137,24 +1141,30 @@ restart:
/* Page is valid, see what to do with it */ /* Page is valid, see what to do with it */
if (_bt_page_recyclable(page)) if (_bt_page_recyclable(page))
{ {
/* Okay to recycle this page */ /* Okay to recycle this page (which could be leaf or internal) */
RecordFreeIndexPage(rel, blkno); RecordFreeIndexPage(rel, blkno);
vstate->totFreePages++; vstate->totFreePages++;
stats->pages_deleted++; stats->pages_deleted++;
} }
else if (P_ISDELETED(opaque)) else if (P_ISDELETED(opaque))
{ {
/* Already deleted, but can't recycle yet */ /*
* Already deleted page (which could be leaf or internal). Can't
* recycle yet.
*/
stats->pages_deleted++; stats->pages_deleted++;
/* Update the oldest btpo.xact */ /* Maintain the oldest btpo.xact */
if (!TransactionIdIsValid(vstate->oldestBtpoXact) || if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact)) TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact; vstate->oldestBtpoXact = opaque->btpo.xact;
} }
else if (P_ISHALFDEAD(opaque)) else if (P_ISHALFDEAD(opaque))
{ {
/* Half-dead, try to delete */ /*
* Half-dead leaf page. Try to delete now. Might update
* oldestBtpoXact and pages_deleted below.
*/
delete_now = true; delete_now = true;
} }
else if (P_ISLEAF(opaque)) else if (P_ISLEAF(opaque))
...@@ -1316,10 +1326,11 @@ restart: ...@@ -1316,10 +1326,11 @@ restart:
else else
{ {
/* /*
* If the page has been split during this vacuum cycle, it seems * If the leaf page has been split during this vacuum cycle, it
* worth expending a write to clear btpo_cycleid even if we don't * seems worth expending a write to clear btpo_cycleid even if we
* have any deletions to do. (If we do, _bt_delitems_vacuum takes * don't have any deletions to do. (If we do, _bt_delitems_vacuum
* care of this.) This ensures we won't process the page again. * takes care of this.) This ensures we won't process the page
* again.
* *
* We treat this like a hint-bit update because there's no need to * We treat this like a hint-bit update because there's no need to
* WAL-log it. * WAL-log it.
...@@ -1334,11 +1345,11 @@ restart: ...@@ -1334,11 +1345,11 @@ restart:
} }
/* /*
* If it's now empty, try to delete; else count the live tuples (live * If the leaf page is now empty, try to delete it; else count the
* table TIDs in posting lists are counted as separate live tuples). * live tuples (live table TIDs in posting lists are counted as
* We don't delete when recursing, though, to avoid putting entries * separate live tuples). We don't delete when recursing, though, to
* into freePages out-of-order (doesn't seem worth any extra code to * avoid putting entries into freePages out-of-order (doesn't seem
* handle the case). * worth any extra code to handle the case).
*/ */
if (minoff > maxoff) if (minoff > maxoff)
delete_now = (blkno == orig_blkno); delete_now = (blkno == orig_blkno);
...@@ -1357,16 +1368,11 @@ restart: ...@@ -1357,16 +1368,11 @@ restart:
MemoryContextReset(vstate->pagedelcontext); MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext); oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf); ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
/* count only this page, else may double-count parent */ /* count only this page, else may double-count parent */
if (ndel) if (ndel)
{
stats->pages_deleted++; stats->pages_deleted++;
if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact;
}
MemoryContextSwitchTo(oldcontext); MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */ /* pagedel released buffer, so we shouldn't */
......
...@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf, ...@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf, extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable, OffsetNumber *deletable, int ndeletable,
Relation heapRel); Relation heapRel);
extern int _bt_pagedel(Relation rel, Buffer buf); extern int _bt_pagedel(Relation rel, Buffer leafbuf,
TransactionId *oldestBtpoXact);
/* /*
* prototypes for functions in nbtsearch.c * prototypes for functions in nbtsearch.c
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment