Commit efada2b8 authored by Heikki Linnakangas's avatar Heikki Linnakangas

Fix race condition in B-tree page deletion.

In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
parent 6c461cb9
......@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted. Also, because
we are matching downlink page numbers and not data keys, we don't have any
problem with possibly misidentifying the parent item.
Page Deletion
-------------
We consider deleting an entire page from the btree only when it's become
completely empty of items. (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or
right to make this happen --- a scan moving in the opposite direction
might miss the items if so.) Also, we *never* delete the rightmost page
on a tree level (this restriction simplifies the traversal algorithms, as
explained below).
To delete an empty page, we acquire write lock on its left sibling (if
any), the target page itself, the right sibling (there must be one), and
the parent page, in that order. The parent page must be found using the
same type of search as used to find the parent during an insertion split.
Then we update the side-links in the siblings, mark the target page
deleted, and remove the downlink from the parent, as well as the parent's
upper bounding key for the target (the one separating it from its right
sibling). This causes the target page's key space to effectively belong
to its right sibling. (Neither the left nor right sibling pages need to
explained below). Page deletion always begins from an empty leaf page. An
internal page can only be deleted as part of a branch leading to a leaf
page, where each internal page has only one child and that child is also to
be deleted.
Deleting a leaf page is a two-stage process. In the first stage, the page
is unlinked from its parent, and marked as half-dead. The parent page must
be found using the same type of search as used to find the parent during an
insertion split. We lock the target and the parent pages, change the
target's downlink to point to the right sibling, and remove its old
downlink. This causes the target page's key space to effectively belong to
its right sibling. (Neither the left nor right sibling pages need to
change their "high key" if any; so there is no problem with possibly not
having enough space to replace a high key.) The side-links in the target
page are not changed.
having enough space to replace a high key.) At the same time, we mark the
target page as half-dead, which causes any subsequent searches to ignore it
and move right (or left, in a backwards scan). This leaves the tree in a
similar state as during a page split: the page has no downlink pointing to
it, but it's still linked to its siblings.
(Note: Lanin and Shasha prefer to make the key space move left, but their
argument for doing so hinges on not having left-links, which we have
......@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree. Since we can't possibly do that
atomically, we forbid this case. That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child.
When we delete the last remaining child of a parent page, we mark the
parent page "half-dead" as part of the atomic update that deletes the
child page. This implicitly transfers the parent's key space to its right
sibling (which it must have, since we never delete the overall-rightmost
page of a level). Searches ignore the half-dead page and immediately move
right. We need not worry about insertions into a half-dead page --- insertions
into upper tree levels happen only as a result of splits of child pages, and
the half-dead page no longer has any children that could split. Therefore
the page stays empty even when we don't have lock on it, and we can complete
its deletion in a second atomic action.
The notion of a half-dead page means that the key space relationship between
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. To prevent any possible
problems, we hold lock on the deleted child page until we have finished
deleting any now-half-dead parent page(s). This prevents any insertions into
the transferred keyspace until the operation is complete. The reason for
doing this is that a sufficiently large number of insertions into the
transferred keyspace, resulting in multiple page splits, could propagate keys
from that keyspace into the parent level, resulting in transiently
out-of-order keys in that level. It is thought that that wouldn't cause any
serious problem, but it seems too risky to allow.
parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).
In the second-stage, the half-dead leaf page is unlinked from its siblings.
We first lock the left sibling (if any) of the target, the target page
itself, and its right sibling (there must be one) in that order. Then we
update the side-links in the siblings, and mark the target page deleted.
When we're about to delete the last remaining child of a parent page, things
are slightly more complicated. In the first stage, we leave the immediate
parent of the leaf page alone, and remove the downlink to the parent page
instead, from the grandparent. If it's the last child of the grandparent
too, we recurse up until we find a parent with more than one child, and
remove the downlink of that page. The leaf page is marked as half-dead, and
the block number of the page whose downlink was removed is stashed in the
half-dead leaf page. This leaves us with a chain of internal pages, with
one downlink each, leading to the half-dead leaf page, and no downlink
pointing to the topmost page in the chain.
While we recurse up to find the topmost parent in the chain, we keep the
leaf page locked, but don't need to hold locks on the intermediate pages
between the leaf and the topmost parent -- insertions into upper tree levels
happen only as a result of splits of child pages, and that can't happen as
long as we're keeping the leaf locked. The internal pages in the chain
cannot acquire new children afterwards either, because the leaf page is
marked as half-dead and won't be split.
Removing the downlink to the top of the to-be-deleted chain effectively
transfers the key space to the right sibling for all the intermediate levels
too, in one atomic operation. A concurrent search might still visit the
intermediate pages, but it will move right when it reaches the half-dead page
at the leaf level.
In the second stage, the topmost page in the chain is unlinked from its
siblings, and the half-dead leaf page is updated to point to the next page
down in the chain. This is repeated until there are no internal pages left
in the chain. Finally, the half-dead leaf page itself is unlinked from its
siblings.
A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
......@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
of the insertion into the parent level. When splitting the root page, the
metapage update is handled as part of the "new root" action.
A page deletion is logged as a single WAL entry covering all four
required page updates (target page, left and right siblings, and parent)
as an atomic event. (Any required fast-root link update is also part
of the WAL entry.) If the parent page becomes half-dead but is not
immediately deleted due to a subsequent crash, there is no loss of
consistency, and the empty page will be picked up by the next VACUUM.
Each step in page deletion are logged as separate WAL entries: marking the
leaf as half-dead and removing the downlink is one record, and unlinking a
page is a second record. If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions. The next
VACUUM will find the half-dead leaf page and continue the deletion.
Scans during Recovery
---------------------
......
This diff is collapsed.
......@@ -1082,7 +1082,7 @@ restart:
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf, NULL);
ndel = _bt_pagedel(rel, buf);
/* count only this page, else may double-count parent */
if (ndel)
......
This diff is collapsed.
......@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
break;
}
case XLOG_BTREE_DELETE_PAGE:
case XLOG_BTREE_DELETE_PAGE_META:
case XLOG_BTREE_DELETE_PAGE_HALF:
case XLOG_BTREE_MARK_PAGE_HALFDEAD:
{
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
appendStringInfoString(buf, "delete_page: ");
appendStringInfoString(buf, "mark_page_halfdead: ");
out_target(buf, &(xlrec->target));
appendStringInfo(buf, "; dead %u; left %u; right %u",
xlrec->deadblk, xlrec->leftblk, xlrec->rightblk);
appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
break;
}
case XLOG_BTREE_UNLINK_PAGE_META:
case XLOG_BTREE_UNLINK_PAGE:
{
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
appendStringInfo(buf, "unlink_page: rel %u/%u/%u; ",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
appendStringInfo(buf, "dead %u; left %u; right %u; btpo_xact %u",
xlrec->deadblk, xlrec->leftsib, xlrec->rightsib, xlrec->btpo_xact);
appendStringInfo(buf, "leaf %u; leafleft %u; leafright %u; downlink %u",
xlrec->leafblk, xlrec->leafleftsib, xlrec->leafrightsib, xlrec->downlink);
break;
}
case XLOG_BTREE_NEWROOT:
......
......@@ -215,11 +215,10 @@ typedef struct BTMetaPageData
#define XLOG_BTREE_SPLIT_L_ROOT 0x50 /* add tuple with split of root */
#define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
#define XLOG_BTREE_NEWROOT 0xA0 /* new root page */
#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes
* parent half-dead */
#define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
#define XLOG_BTREE_VACUUM 0xC0 /* delete entries on a page during
* vacuum */
#define XLOG_BTREE_REUSE_PAGE 0xD0 /* old page is about to be reused from
......@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
/*
* This is what we need to know about deletion of a btree page. The target
* identifies the tuple removed from the parent page (note that we remove
* this tuple's downlink and the *following* tuple's key). Note we do not
* store any content for the deleted page --- it is just rewritten as empty
* during recovery, apart from resetting the btpo.xact.
* This is what we need to know about marking an empty branch for deletion.
* The target identifies the tuple removed from the parent page (note that we
* remove this tuple's downlink and the *following* tuple's key). Note that
* the leaf page is empty, so we don't need to store its content --- it is
* just reinitialized during recovery using the rest of the fields.
*/
typedef struct xl_btree_delete_page
typedef struct xl_btree_mark_page_halfdead
{
xl_btreetid target; /* deleted tuple id in parent page */
BlockNumber deadblk; /* child block being deleted */
BlockNumber leftblk; /* child block's left sibling, if any */
BlockNumber rightblk; /* child block's right sibling */
BlockNumber leafblk; /* leaf block ultimately being deleted */
BlockNumber leftblk; /* leaf block's left sibling, if any */
BlockNumber rightblk; /* leaf block's right sibling */
BlockNumber downlink; /* next child down in the branch */
} xl_btree_mark_page_halfdead;
#define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, downlink) + sizeof(BlockNumber))
/*
* This is what we need to know about deletion of a btree page. Note we do
* not store any content for the deleted page --- it is just rewritten as empty
* during recovery, apart from resetting the btpo.xact.
*/
typedef struct xl_btree_unlink_page
{
RelFileNode node;
BlockNumber deadblk; /* target block being deleted */
BlockNumber leftsib; /* taregt block's left sibling, if any */
BlockNumber rightsib; /* target block's right sibling */
/*
* Information needed to recreate the leaf page, when target is an internal
* page.
*/
BlockNumber leafblk;
BlockNumber leafleftsib;
BlockNumber leafrightsib;
BlockNumber downlink; /* next child down in the branch */
TransactionId btpo_xact; /* value of btpo.xact for use in recovery */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
} xl_btree_delete_page;
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
} xl_btree_unlink_page;
#define SizeOfBtreeDeletePage (offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
#define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
/*
* New root log record. There are zero tuples if this is to establish an
......@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf, BTStack stack);
extern int _bt_pagedel(Relation rel, Buffer buf);
/*
* prototypes for functions in nbtsearch.c
......
......@@ -55,7 +55,7 @@ typedef struct BkpBlock
/*
* Each page of XLOG file has a header like this:
*/
#define XLOG_PAGE_MAGIC 0xD07B /* can be used as WAL version indicator */
#define XLOG_PAGE_MAGIC 0xD07C /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment