Commit 70ce5c90 authored by Tom Lane's avatar Tom Lane

Fix "failed to re-find parent key" btree VACUUM failure by revising page

deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
parent 19d0c46d
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.13 2006/07/25 19:13:00 tgl Exp $
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.14 2006/11/01 19:43:17 tgl Exp $
This directory contains a correct implementation of Lehman and Yao's
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
......@@ -201,26 +201,25 @@ When we delete the last remaining child of a parent page, we mark the
parent page "half-dead" as part of the atomic update that deletes the
child page. This implicitly transfers the parent's key space to its right
sibling (which it must have, since we never delete the overall-rightmost
page of a level). No future insertions into the parent level are allowed
to insert keys into the half-dead page --- they must move right to its
sibling, instead. The parent remains empty and can be deleted in a
separate atomic action. (However, if it's the rightmost child of its own
parent, it might have to stay half-dead for awhile, until it's also the
only child.)
Note that an empty leaf page is a valid tree state, but an empty interior
page is not legal (an interior page must have children to delegate its
key space to). So an interior page *must* be marked half-dead as soon
as its last child is deleted.
page of a level). Searches ignore the half-dead page and immediately move
right. We need not worry about insertions into a half-dead page --- insertions
into upper tree levels happen only as a result of splits of child pages, and
the half-dead page no longer has any children that could split. Therefore
the page stays empty even when we don't have lock on it, and we can complete
its deletion in a second atomic action.
The notion of a half-dead page means that the key space relationship between
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. We can tolerate this,
however, because insertions and deletions on upper tree levels are always
done by reference to child page numbers, not keys. The only cost is that
searches may sometimes descend to the half-dead page and then have to move
right, rather than going directly to the sibling page.
parent level may really belong to its right sibling. To prevent any possible
problems, we hold lock on the deleted child page until we have finished
deleting any now-half-dead parent page(s). This prevents any insertions into
the transferred keyspace until the operation is complete. The reason for
doing this is that a sufficiently large number of insertions into the
transferred keyspace, resulting in multiple page splits, could propagate keys
from that keyspace into the parent level, resulting in transiently
out-of-order keys in that level. It is thought that that wouldn't cause any
serious problem, but it seems too risky to allow.
A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
......
......@@ -8,7 +8,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.144 2006/10/04 00:29:48 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.145 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -1337,8 +1337,8 @@ _bt_insert_parent(Relation rel,
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in \"%s\"",
RelationGetRelationName(rel));
elog(ERROR, "failed to re-find parent key in \"%s\" for split pages %u/%u",
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
_bt_insertonpg(rel, pbuf, stack->bts_parent,
......
This diff is collapsed.
......@@ -12,7 +12,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.152 2006/10/04 00:29:49 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.153 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -804,8 +804,7 @@ restart:
if (blkno != orig_blkno)
{
if (_bt_page_recyclable(page) ||
P_ISDELETED(opaque) ||
(opaque->btpo_flags & BTP_HALF_DEAD) ||
P_IGNORE(opaque) ||
!P_ISLEAF(opaque) ||
opaque->btpo_cycleid != vstate->cycleid)
{
......@@ -828,7 +827,7 @@ restart:
/* Already deleted, but can't recycle yet */
stats->pages_deleted++;
}
else if (opaque->btpo_flags & BTP_HALF_DEAD)
else if (P_ISHALFDEAD(opaque))
{
/* Half-dead, try to delete */
delete_now = true;
......@@ -939,7 +938,7 @@ restart:
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf, info->vacuum_full);
ndel = _bt_pagedel(rel, buf, NULL, info->vacuum_full);
/* count only this page, else may double-count parent */
if (ndel)
......
......@@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.38 2006/10/04 00:29:49 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.39 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -22,31 +22,41 @@
* them manually if they are not seen in the WAL log during replay. This
* makes it safe for page insertion to be a multiple-WAL-action process.
*
* Similarly, deletion of an only child page and deletion of its parent page
* form multiple WAL log entries, and we have to be prepared to follow through
* with the deletion if the log ends between.
*
* The data structure is a simple linked list --- this should be good enough,
* since we don't expect a page split to remain incomplete for long.
* since we don't expect a page split or multi deletion to remain incomplete
* for long. In any case we need to respect the order of operations.
*/
typedef struct bt_incomplete_split
typedef struct bt_incomplete_action
{
RelFileNode node; /* the index */
bool is_split; /* T = pending split, F = pending delete */
/* these fields are for a split: */
bool is_root; /* we split the root */
BlockNumber leftblk; /* left half of split */
BlockNumber rightblk; /* right half of split */
bool is_root; /* we split the root */
} bt_incomplete_split;
/* these fields are for a delete: */
BlockNumber delblk; /* parent block to be deleted */
} bt_incomplete_action;
static List *incomplete_splits;
static List *incomplete_actions;
static void
log_incomplete_split(RelFileNode node, BlockNumber leftblk,
BlockNumber rightblk, bool is_root)
{
bt_incomplete_split *split = palloc(sizeof(bt_incomplete_split));
split->node = node;
split->leftblk = leftblk;
split->rightblk = rightblk;
split->is_root = is_root;
incomplete_splits = lappend(incomplete_splits, split);
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
action->node = node;
action->is_split = true;
action->is_root = is_root;
action->leftblk = leftblk;
action->rightblk = rightblk;
incomplete_actions = lappend(incomplete_actions, action);
}
static void
......@@ -54,17 +64,50 @@ forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
{
ListCell *l;
foreach(l, incomplete_splits)
foreach(l, incomplete_actions)
{
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, split->node) &&
downlink == split->rightblk)
if (RelFileNodeEquals(node, action->node) &&
action->is_split &&
downlink == action->rightblk)
{
if (is_root != split->is_root)
if (is_root != action->is_root)
elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
split->is_root, is_root);
incomplete_splits = list_delete_ptr(incomplete_splits, split);
action->is_root, is_root);
incomplete_actions = list_delete_ptr(incomplete_actions, action);
pfree(action);
break; /* need not look further */
}
}
}
static void
log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
{
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
action->node = node;
action->is_split = false;
action->delblk = delblk;
incomplete_actions = lappend(incomplete_actions, action);
}
static void
forget_matching_deletion(RelFileNode node, BlockNumber delblk)
{
ListCell *l;
foreach(l, incomplete_actions)
{
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, action->node) &&
!action->is_split &&
delblk == action->delblk)
{
incomplete_actions = list_delete_ptr(incomplete_actions, action);
pfree(action);
break; /* need not look further */
}
}
......@@ -389,8 +432,7 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
}
static void
btree_xlog_delete_page(bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
Relation reln;
......@@ -427,6 +469,7 @@ btree_xlog_delete_page(bool ismeta,
poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
if (poffset >= PageGetMaxOffsetNumber(page))
{
Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
Assert(poffset == P_FIRSTDATAKEY(pageop));
PageIndexTupleDelete(page, poffset);
pageop->btpo_flags |= BTP_HALF_DEAD;
......@@ -437,6 +480,7 @@ btree_xlog_delete_page(bool ismeta,
IndexTuple itup;
OffsetNumber nextoffset;
Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
itemid = PageGetItemId(page, poffset);
itup = (IndexTuple) PageGetItem(page, itemid);
ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
......@@ -523,7 +567,7 @@ btree_xlog_delete_page(bool ismeta,
UnlockReleaseBuffer(buffer);
/* Update metapage if needed */
if (ismeta)
if (info == XLOG_BTREE_DELETE_PAGE_META)
{
xl_btree_metadata md;
......@@ -533,6 +577,13 @@ btree_xlog_delete_page(bool ismeta,
md.root, md.level,
md.fastroot, md.fastlevel);
}
/* Forget any completed deletion */
forget_matching_deletion(xlrec->target.node, target);
/* If parent became half-dead, remember it for deletion */
if (info == XLOG_BTREE_DELETE_PAGE_HALF)
log_incomplete_deletion(xlrec->target.node, parent);
}
static void
......@@ -620,10 +671,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
btree_xlog_delete(lsn, record);
break;
case XLOG_BTREE_DELETE_PAGE:
btree_xlog_delete_page(false, lsn, record);
break;
case XLOG_BTREE_DELETE_PAGE_META:
btree_xlog_delete_page(true, lsn, record);
case XLOG_BTREE_DELETE_PAGE_HALF:
btree_xlog_delete_page(info, lsn, record);
break;
case XLOG_BTREE_NEWROOT:
btree_xlog_newroot(lsn, record);
......@@ -724,6 +774,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
}
case XLOG_BTREE_DELETE_PAGE:
case XLOG_BTREE_DELETE_PAGE_META:
case XLOG_BTREE_DELETE_PAGE_HALF:
{
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
......@@ -752,7 +803,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
void
btree_xlog_startup(void)
{
incomplete_splits = NIL;
incomplete_actions = NIL;
}
void
......@@ -760,45 +811,60 @@ btree_xlog_cleanup(void)
{
ListCell *l;
foreach(l, incomplete_splits)
foreach(l, incomplete_actions)
{
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
Relation reln;
Buffer lbuf,
rbuf;
Page lpage,
rpage;
BTPageOpaque lpageop,
rpageop;
bool is_only;
reln = XLogOpenRelation(split->node);
lbuf = XLogReadBuffer(reln, split->leftblk, false);
/* failure should be impossible because we wrote this page earlier */
if (!BufferIsValid(lbuf))
elog(PANIC, "btree_xlog_cleanup: left block unfound");
lpage = (Page) BufferGetPage(lbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
rbuf = XLogReadBuffer(reln, split->rightblk, false);
/* failure should be impossible because we wrote this page earlier */
if (!BufferIsValid(rbuf))
elog(PANIC, "btree_xlog_cleanup: right block unfound");
rpage = (Page) BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
/* if the two pages are all of their level, it's a only-page split */
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
_bt_insert_parent(reln, lbuf, rbuf, NULL,
split->is_root, is_only);
reln = XLogOpenRelation(action->node);
if (action->is_split)
{
/* finish an incomplete split */
Buffer lbuf,
rbuf;
Page lpage,
rpage;
BTPageOpaque lpageop,
rpageop;
bool is_only;
lbuf = XLogReadBuffer(reln, action->leftblk, false);
/* failure is impossible because we wrote this page earlier */
if (!BufferIsValid(lbuf))
elog(PANIC, "btree_xlog_cleanup: left block unfound");
lpage = (Page) BufferGetPage(lbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
rbuf = XLogReadBuffer(reln, action->rightblk, false);
/* failure is impossible because we wrote this page earlier */
if (!BufferIsValid(rbuf))
elog(PANIC, "btree_xlog_cleanup: right block unfound");
rpage = (Page) BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
/* if the pages are all of their level, it's a only-page split */
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
_bt_insert_parent(reln, lbuf, rbuf, NULL,
action->is_root, is_only);
}
else
{
/* finish an incomplete deletion (of a half-dead page) */
Buffer buf;
buf = XLogReadBuffer(reln, action->delblk, false);
if (BufferIsValid(buf))
if (_bt_pagedel(reln, buf, NULL, true) == 0)
elog(PANIC, "btree_xlog_cleanup: _bt_pagdel failed");
}
}
incomplete_splits = NIL;
incomplete_actions = NIL;
}
bool
btree_safe_restartpoint(void)
{
if (incomplete_splits)
if (incomplete_actions)
return false;
return true;
}
......@@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.105 2006/10/04 00:30:07 momjian Exp $
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.106 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
......@@ -163,6 +163,7 @@ typedef struct BTMetaPageData
#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
#define P_ISDELETED(opaque) ((opaque)->btpo_flags & BTP_DELETED)
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
......@@ -203,8 +204,10 @@ typedef struct BTMetaPageData
#define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuple */
#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, plus update metapage */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */
#define XLOG_BTREE_NEWROOT 0xA0 /* new root page */
#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes
* parent half-dead */
/*
* All that we need to find changed index tuple
......@@ -501,7 +504,8 @@ extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems);
extern int _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full);
extern int _bt_pagedel(Relation rel, Buffer buf,
BTStack stack, bool vacuum_full);
/*
* prototypes for functions in nbtsearch.c
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment