Commit 5749f6ef authored by Tom Lane's avatar Tom Lane

Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations

into a single mostly-physical-order scan of the index.  This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time).  VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order.  Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.

Original patch by Heikki Linnakangas, rework by Tom Lane.
parent 09cb5c0e
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.11 2006/05/07 01:21:30 tgl Exp $
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.12 2006/05/08 00:00:09 tgl Exp $
This directory contains a correct implementation of Lehman and Yao's
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
......@@ -293,10 +293,32 @@ as part of the atomic update for the delete (either way, the metapage has
to be the last page locked in the update to avoid deadlock risks). This
avoids race conditions if two such operations are executing concurrently.
VACUUM needs to do a linear scan of an index to search for empty leaf
pages and half-dead parent pages that can be deleted, as well as deleted
pages that can be reclaimed because they are older than all open
transactions.
VACUUM needs to do a linear scan of an index to search for deleted pages
that can be reclaimed because they are older than all open transactions.
For efficiency's sake, we'd like to use the same linear scan to search for
deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
in index order, but it is possible to visit them in physical order instead.
The tricky part of this is to avoid missing any deletable tuples in the
presence of concurrent page splits: a page split could easily move some
tuples from a page not yet passed over by the sequential scan to a
lower-numbered page already passed over. (This wasn't a concern for the
index-order scan, because splits always split right.) To implement this,
we provide a "vacuum cycle ID" mechanism that makes it possible to
determine whether a page has been split since the current btbulkdelete
cycle started. If btbulkdelete finds a page that has been split since
it started, and has a right-link pointing to a lower page number, then
it temporarily suspends its sequential scan and visits that page instead.
It must continue to follow right-links and vacuum dead tuples until
reaching a page that either hasn't been split since btbulkdelete started,
or is above the location of the outer sequential scan. Then it can resume
the sequential scan. This ensures that all tuples are visited. It may be
that some tuples are visited twice, but that has no worse effect than an
inaccurate index tuple count (and we can't guarantee an accurate count
anyway in the face of concurrent activity). Note that this still works
if the has-been-recently-split test has a small probability of false
positives, so long as it never gives a false negative. This makes it
possible to implement the test with a small counter value stored on each
index page.
WAL considerations
------------------
......
......@@ -8,7 +8,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.136 2006/04/25 22:46:05 tgl Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.137 2006/05/08 00:00:09 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -700,14 +700,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
/* if we're splitting this page, it won't be the root when we're done */
/* also, clear the SPLIT_END flag in both pages */
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~BTP_ROOT;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END);
ropaque->btpo_flags = lopaque->btpo_flags;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = BufferGetBlockNumber(rbuf);
ropaque->btpo_prev = BufferGetBlockNumber(buf);
ropaque->btpo_next = oopaque->btpo_next;
lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;
/* Since we already have write-lock on both pages, ok to read cycleid */
lopaque->btpo_cycleid = _bt_vacuum_cycleid(rel);
ropaque->btpo_cycleid = lopaque->btpo_cycleid;
/*
* If the page we're splitting is not the rightmost page at its level in
......@@ -836,6 +840,21 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
if (sopaque->btpo_prev != ropaque->btpo_prev)
elog(PANIC, "right sibling's left-link doesn't match");
/*
* Check to see if we can set the SPLIT_END flag in the right-hand
* split page; this can save some I/O for vacuum since it need not
* proceed to the right sibling. We can set the flag if the right
* sibling has a different cycleid: that means it could not be part
* of a group of pages that were all split off from the same ancestor
* page. If you're confused, imagine that page A splits to A B and
* then again, yielding A C B, while vacuum is in progress. Tuples
* originally in A could now be in either B or C, hence vacuum must
* examine both pages. But if D, our right sibling, has a different
* cycleid then it could not contain any tuples that were in A when
* the vacuum started.
*/
if (sopaque->btpo_cycleid != ropaque->btpo_cycleid)
ropaque->btpo_flags |= BTP_SPLIT_END;
}
/*
......@@ -1445,6 +1464,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rootopaque->btpo_flags = BTP_ROOT;
rootopaque->btpo.level =
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
rootopaque->btpo_cycleid = 0;
/* update metapage data */
metad->btm_root = rootblknum;
......
......@@ -9,7 +9,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.96 2006/04/25 22:46:05 tgl Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.97 2006/05/08 00:00:10 tgl Exp $
*
* NOTES
* Postgres btree pages look like ordinary relation pages. The opaque
......@@ -206,6 +206,7 @@ _bt_getroot(Relation rel, int access)
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
rootopaque->btpo.level = 0;
rootopaque->btpo_cycleid = 0;
/* NO ELOG(ERROR) till meta is updated */
START_CRIT_SECTION();
......@@ -544,7 +545,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
* Release the file-extension lock; it's now OK for someone else to
* extend the relation some more. Note that we cannot release this
* lock before we have buffer lock on the new page, or we risk a race
* condition against btvacuumcleanup --- see comments therein.
* condition against btvacuumscan --- see comments therein.
*/
if (needLock)
UnlockRelationForExtension(rel, ExclusiveLock);
......@@ -608,7 +609,7 @@ _bt_pageinit(Page page, Size size)
/*
* _bt_page_recyclable() -- Is an existing page recyclable?
*
* This exists to make sure _bt_getbuf and btvacuumcleanup have the same
* This exists to make sure _bt_getbuf and btvacuumscan have the same
* policy about whether a page is safe to re-use.
*/
bool
......@@ -651,6 +652,7 @@ _bt_delitems(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
......@@ -658,6 +660,13 @@ _bt_delitems(Relation rel, Buffer buf,
/* Fix the page */
PageIndexMultiDelete(page, itemnos, nitems);
/*
* We can clear the vacuum cycle ID since this page has certainly
* been processed by the current vacuum scan.
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_cycleid = 0;
MarkBufferDirty(buf);
/* XLOG stuff */
......
This diff is collapsed.
......@@ -56,7 +56,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtsort.c,v 1.100 2006/03/10 20:18:15 tgl Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtsort.c,v 1.101 2006/05/08 00:00:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -244,6 +244,7 @@ _bt_blnewpage(uint32 level)
opaque->btpo_prev = opaque->btpo_next = P_NONE;
opaque->btpo.level = level;
opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
opaque->btpo_cycleid = 0;
/* Make the P_HIKEY line pointer appear allocated */
((PageHeader) page)->pd_lower += sizeof(ItemIdData);
......
......@@ -8,17 +8,20 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtutils.c,v 1.73 2006/05/07 01:21:30 tgl Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtutils.c,v 1.74 2006/05/08 00:00:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include <time.h>
#include "access/genam.h"
#include "access/nbtree.h"
#include "catalog/catalog.h"
#include "executor/execdebug.h"
#include "miscadmin.h"
static void _bt_mark_scankey_required(ScanKey skey);
......@@ -820,11 +823,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
* delete. We cope with cases where items have moved right due to insertions.
* If an item has moved off the current page due to a split, we'll fail to
* find it and do nothing (this is not an error case --- we assume the item
* will eventually get marked in a future indexscan). Likewise, if the item
* has moved left due to deletions or disappeared itself, we'll not find it,
* but these cases are not worth optimizing. (Since deletions are only done
* by VACUUM, any deletion makes it highly likely that the dead item has been
* removed itself, and therefore searching left is not worthwhile.)
* will eventually get marked in a future indexscan). Note that because we
* hold pin on the target page continuously from initially reading the items
* until applying this function, VACUUM cannot have deleted any items from
* the page, and so there is no need to search left from the recorded offset.
* (This observation also guarantees that the item is still the right one
* to delete, which might otherwise be questionable since heap TIDs can get
* recycled.)
*/
void
_bt_killitems(IndexScanDesc scan, bool haveLock)
......@@ -890,3 +895,187 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)
*/
so->numKilled = 0;
}
/*
* The following routines manage a shared-memory area in which we track
* assignment of "vacuum cycle IDs" to currently-active btree vacuuming
* operations. There is a single counter which increments each time we
* start a vacuum to assign it a cycle ID. Since multiple vacuums could
* be active concurrently, we have to track the cycle ID for each active
* vacuum; this requires at most MaxBackends entries (usually far fewer).
* We assume at most one vacuum can be active for a given index.
*
* Access to the shared memory area is controlled by BtreeVacuumLock.
* In principle we could use a separate lmgr locktag for each index,
* but a single LWLock is much cheaper, and given the short time that
* the lock is ever held, the concurrency hit should be minimal.
*/
typedef struct BTOneVacInfo
{
LockRelId relid; /* global identifier of an index */
BTCycleId cycleid; /* cycle ID for its active VACUUM */
} BTOneVacInfo;
typedef struct BTVacInfo
{
BTCycleId cycle_ctr; /* cycle ID most recently assigned */
int num_vacuums; /* number of currently active VACUUMs */
int max_vacuums; /* allocated length of vacuums[] array */
BTOneVacInfo vacuums[1]; /* VARIABLE LENGTH ARRAY */
} BTVacInfo;
static BTVacInfo *btvacinfo;
/*
* _bt_vacuum_cycleid --- get the active vacuum cycle ID for an index,
* or zero if there is no active VACUUM
*
* Note: for correct interlocking, the caller must already hold pin and
* exclusive lock on each buffer it will store the cycle ID into. This
* ensures that even if a VACUUM starts immediately afterwards, it cannot
* process those pages until the page split is complete.
*/
BTCycleId
_bt_vacuum_cycleid(Relation rel)
{
BTCycleId result = 0;
int i;
/* Share lock is enough since this is a read-only operation */
LWLockAcquire(BtreeVacuumLock, LW_SHARED);
for (i = 0; i < btvacinfo->num_vacuums; i++)
{
BTOneVacInfo *vac = &btvacinfo->vacuums[i];
if (vac->relid.relId == rel->rd_lockInfo.lockRelId.relId &&
vac->relid.dbId == rel->rd_lockInfo.lockRelId.dbId)
{
result = vac->cycleid;
break;
}
}
LWLockRelease(BtreeVacuumLock);
return result;
}
/*
* _bt_start_vacuum --- assign a cycle ID to a just-starting VACUUM operation
*
* Note: the caller must guarantee (via PG_TRY) that it will eventually call
* _bt_end_vacuum, else we'll permanently leak an array slot.
*/
BTCycleId
_bt_start_vacuum(Relation rel)
{
BTCycleId result;
int i;
BTOneVacInfo *vac;
LWLockAcquire(BtreeVacuumLock, LW_EXCLUSIVE);
/* Assign the next cycle ID, being careful to avoid zero */
do {
result = ++(btvacinfo->cycle_ctr);
} while (result == 0);
/* Let's just make sure there's no entry already for this index */
for (i = 0; i < btvacinfo->num_vacuums; i++)
{
vac = &btvacinfo->vacuums[i];
if (vac->relid.relId == rel->rd_lockInfo.lockRelId.relId &&
vac->relid.dbId == rel->rd_lockInfo.lockRelId.dbId)
elog(ERROR, "multiple active vacuums for index \"%s\"",
RelationGetRelationName(rel));
}
/* OK, add an entry */
if (btvacinfo->num_vacuums >= btvacinfo->max_vacuums)
elog(ERROR, "out of btvacinfo slots");
vac = &btvacinfo->vacuums[btvacinfo->num_vacuums];
vac->relid = rel->rd_lockInfo.lockRelId;
vac->cycleid = result;
btvacinfo->num_vacuums++;
LWLockRelease(BtreeVacuumLock);
return result;
}
/*
* _bt_end_vacuum --- mark a btree VACUUM operation as done
*
* Note: this is deliberately coded not to complain if no entry is found;
* this allows the caller to put PG_TRY around the start_vacuum operation.
*/
void
_bt_end_vacuum(Relation rel)
{
int i;
LWLockAcquire(BtreeVacuumLock, LW_EXCLUSIVE);
/* Find the array entry */
for (i = 0; i < btvacinfo->num_vacuums; i++)
{
BTOneVacInfo *vac = &btvacinfo->vacuums[i];
if (vac->relid.relId == rel->rd_lockInfo.lockRelId.relId &&
vac->relid.dbId == rel->rd_lockInfo.lockRelId.dbId)
{
/* Remove it by shifting down the last entry */
*vac = btvacinfo->vacuums[btvacinfo->num_vacuums - 1];
btvacinfo->num_vacuums--;
break;
}
}
LWLockRelease(BtreeVacuumLock);
}
/*
* BTreeShmemSize --- report amount of shared memory space needed
*/
Size
BTreeShmemSize(void)
{
Size size;
size = offsetof(BTVacInfo, vacuums[0]);
size = add_size(size, mul_size(MaxBackends, sizeof(BTOneVacInfo)));
return size;
}
/*
* BTreeShmemInit --- initialize this module's shared memory
*/
void
BTreeShmemInit(void)
{
bool found;
btvacinfo = (BTVacInfo *) ShmemInitStruct("BTree Vacuum State",
BTreeShmemSize(),
&found);
if (!IsUnderPostmaster)
{
/* Initialize shared memory area */
Assert(!found);
/*
* It doesn't really matter what the cycle counter starts at, but
* having it always start the same doesn't seem good. Seed with
* low-order bits of time() instead.
*/
btvacinfo->cycle_ctr = (BTCycleId) time(NULL);
btvacinfo->num_vacuums = 0;
btvacinfo->max_vacuums = MaxBackends;
}
else
Assert(found);
}
......@@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.32 2006/04/13 03:53:05 tgl Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.33 2006/05/08 00:00:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -249,6 +249,7 @@ btree_xlog_split(bool onleft, bool isroot,
pageop->btpo_next = rightsib;
pageop->btpo.level = xlrec->level;
pageop->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
pageop->btpo_cycleid = 0;
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeSplit,
......@@ -281,6 +282,7 @@ btree_xlog_split(bool onleft, bool isroot,
pageop->btpo_next = xlrec->rightblk;
pageop->btpo.level = xlrec->level;
pageop->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
pageop->btpo_cycleid = 0;
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeSplit + xlrec->leftlen,
......@@ -506,6 +508,7 @@ btree_xlog_delete_page(bool ismeta,
pageop->btpo_next = rightsib;
pageop->btpo.xact = FrozenTransactionId;
pageop->btpo_flags = BTP_DELETED;
pageop->btpo_cycleid = 0;
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
......@@ -548,6 +551,7 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
pageop->btpo.level = xlrec->level;
if (xlrec->level == 0)
pageop->btpo_flags |= BTP_LEAF;
pageop->btpo_cycleid = 0;
if (record->xl_len > SizeOfBtreeNewroot)
{
......
......@@ -8,7 +8,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/storage/ipc/ipci.c,v 1.82 2006/03/05 15:58:37 momjian Exp $
* $PostgreSQL: pgsql/src/backend/storage/ipc/ipci.c,v 1.83 2006/05/08 00:00:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -16,6 +16,7 @@
#include "access/clog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
#include "access/xlog.h"
......@@ -88,6 +89,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, SInvalShmemSize());
size = add_size(size, FreeSpaceShmemSize());
size = add_size(size, BgWriterShmemSize());
size = add_size(size, BTreeShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
......@@ -182,6 +184,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
PMSignalInit();
BgWriterShmemInit();
/*
* Set up other modules that need some shared memory space
*/
BTreeShmemInit();
#ifdef EXEC_BACKEND
/*
......
......@@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.97 2006/05/07 01:21:30 tgl Exp $
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.98 2006/05/08 00:00:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -19,6 +19,10 @@
#include "access/sdir.h"
#include "access/xlogutils.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
/*
* BTPageOpaqueData -- At the end of every page, we store a pointer
* to both siblings in the tree. This is used to do forward/backward
......@@ -31,6 +35,16 @@
* and status. If the page is deleted, we replace the level with the
* next-transaction-ID value indicating when it is safe to reclaim the page.
*
* We also store a "vacuum cycle ID". When a page is split while VACUUM is
* processing the index, a nonzero value associated with the VACUUM run is
* stored into both halves of the split page. (If VACUUM is not running,
* both pages receive zero cycleids.) This allows VACUUM to detect whether
* a page was split since it started, with a small probability of false match
* if the page was last split some exact multiple of 65536 VACUUMs ago.
* Also, during a split, the BTP_SPLIT_END flag is cleared in the left
* (original) page, and set in the right page, but only if the next page
* to its right has a different cycleid.
*
* NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
* instead.
*/
......@@ -45,6 +59,7 @@ typedef struct BTPageOpaqueData
TransactionId xact; /* next transaction ID, if deleted */
} btpo;
uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
typedef BTPageOpaqueData *BTPageOpaque;
......@@ -55,6 +70,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
#define BTP_META (1 << 3) /* meta-page */
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
/*
......@@ -492,6 +508,11 @@ extern bool _bt_checkkeys(IndexScanDesc scan,
Page page, OffsetNumber offnum,
ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan, bool haveLock);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
extern void _bt_end_vacuum(Relation rel);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
/*
* prototypes for functions in nbtsort.c
......
......@@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $PostgreSQL: pgsql/src/include/storage/lwlock.h,v 1.27 2006/03/05 15:58:59 momjian Exp $
* $PostgreSQL: pgsql/src/include/storage/lwlock.h,v 1.28 2006/05/08 00:00:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -47,6 +47,7 @@ typedef enum LWLockId
BgWriterCommLock,
TwoPhaseStateLock,
TablespaceCreateLock,
BtreeVacuumLock,
FirstLockMgrLock, /* must be last except for MaxDynamicLWLock */
MaxDynamicLWLock = 1000000000
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment