Commit efada2b8 authored by Heikki Linnakangas's avatar Heikki Linnakangas

Fix race condition in B-tree page deletion.

In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
parent 6c461cb9
...@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted. Also, because ...@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted. Also, because
we are matching downlink page numbers and not data keys, we don't have any we are matching downlink page numbers and not data keys, we don't have any
problem with possibly misidentifying the parent item. problem with possibly misidentifying the parent item.
Page Deletion
-------------
We consider deleting an entire page from the btree only when it's become We consider deleting an entire page from the btree only when it's become
completely empty of items. (Merging partly-full pages would allow better completely empty of items. (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or space reuse, but it seems impractical to move existing data items left or
right to make this happen --- a scan moving in the opposite direction right to make this happen --- a scan moving in the opposite direction
might miss the items if so.) Also, we *never* delete the rightmost page might miss the items if so.) Also, we *never* delete the rightmost page
on a tree level (this restriction simplifies the traversal algorithms, as on a tree level (this restriction simplifies the traversal algorithms, as
explained below). explained below). Page deletion always begins from an empty leaf page. An
internal page can only be deleted as part of a branch leading to a leaf
To delete an empty page, we acquire write lock on its left sibling (if page, where each internal page has only one child and that child is also to
any), the target page itself, the right sibling (there must be one), and be deleted.
the parent page, in that order. The parent page must be found using the
same type of search as used to find the parent during an insertion split. Deleting a leaf page is a two-stage process. In the first stage, the page
Then we update the side-links in the siblings, mark the target page is unlinked from its parent, and marked as half-dead. The parent page must
deleted, and remove the downlink from the parent, as well as the parent's be found using the same type of search as used to find the parent during an
upper bounding key for the target (the one separating it from its right insertion split. We lock the target and the parent pages, change the
sibling). This causes the target page's key space to effectively belong target's downlink to point to the right sibling, and remove its old
to its right sibling. (Neither the left nor right sibling pages need to downlink. This causes the target page's key space to effectively belong to
its right sibling. (Neither the left nor right sibling pages need to
change their "high key" if any; so there is no problem with possibly not change their "high key" if any; so there is no problem with possibly not
having enough space to replace a high key.) The side-links in the target having enough space to replace a high key.) At the same time, we mark the
page are not changed. target page as half-dead, which causes any subsequent searches to ignore it
and move right (or left, in a backwards scan). This leaves the tree in a
similar state as during a page split: the page has no downlink pointing to
it, but it's still linked to its siblings.
(Note: Lanin and Shasha prefer to make the key space move left, but their (Note: Lanin and Shasha prefer to make the key space move left, but their
argument for doing so hinges on not having left-links, which we have argument for doing so hinges on not having left-links, which we have
...@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes ...@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree. Since we can't possibly do that perhaps all the way up the tree. Since we can't possibly do that
atomically, we forbid this case. That means that the rightmost child of a atomically, we forbid this case. That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child. parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).
When we delete the last remaining child of a parent page, we mark the
parent page "half-dead" as part of the atomic update that deletes the In the second-stage, the half-dead leaf page is unlinked from its siblings.
child page. This implicitly transfers the parent's key space to its right We first lock the left sibling (if any) of the target, the target page
sibling (which it must have, since we never delete the overall-rightmost itself, and its right sibling (there must be one) in that order. Then we
page of a level). Searches ignore the half-dead page and immediately move update the side-links in the siblings, and mark the target page deleted.
right. We need not worry about insertions into a half-dead page --- insertions
into upper tree levels happen only as a result of splits of child pages, and When we're about to delete the last remaining child of a parent page, things
the half-dead page no longer has any children that could split. Therefore are slightly more complicated. In the first stage, we leave the immediate
the page stays empty even when we don't have lock on it, and we can complete parent of the leaf page alone, and remove the downlink to the parent page
its deletion in a second atomic action. instead, from the grandparent. If it's the last child of the grandparent
too, we recurse up until we find a parent with more than one child, and
The notion of a half-dead page means that the key space relationship between remove the downlink of that page. The leaf page is marked as half-dead, and
the half-dead page's level and its parent's level may be a little out of the block number of the page whose downlink was removed is stashed in the
whack: key space that appears to belong to the half-dead page's parent on the half-dead leaf page. This leaves us with a chain of internal pages, with
parent level may really belong to its right sibling. To prevent any possible one downlink each, leading to the half-dead leaf page, and no downlink
problems, we hold lock on the deleted child page until we have finished pointing to the topmost page in the chain.
deleting any now-half-dead parent page(s). This prevents any insertions into
the transferred keyspace until the operation is complete. The reason for While we recurse up to find the topmost parent in the chain, we keep the
doing this is that a sufficiently large number of insertions into the leaf page locked, but don't need to hold locks on the intermediate pages
transferred keyspace, resulting in multiple page splits, could propagate keys between the leaf and the topmost parent -- insertions into upper tree levels
from that keyspace into the parent level, resulting in transiently happen only as a result of splits of child pages, and that can't happen as
out-of-order keys in that level. It is thought that that wouldn't cause any long as we're keeping the leaf locked. The internal pages in the chain
serious problem, but it seems too risky to allow. cannot acquire new children afterwards either, because the leaf page is
marked as half-dead and won't be split.
Removing the downlink to the top of the to-be-deleted chain effectively
transfers the key space to the right sibling for all the intermediate levels
too, in one atomic operation. A concurrent search might still visit the
intermediate pages, but it will move right when it reaches the half-dead page
at the leaf level.
In the second stage, the topmost page in the chain is unlinked from its
siblings, and the half-dead leaf page is updated to point to the next page
down in the chain. This is repeated until there are no internal pages left
in the chain. Finally, the half-dead leaf page itself is unlinked from its
siblings.
A deleted page cannot be reclaimed immediately, since there may be other A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the processes waiting to reference it (ie, search processes that just left the
...@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part ...@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
of the insertion into the parent level. When splitting the root page, the of the insertion into the parent level. When splitting the root page, the
metapage update is handled as part of the "new root" action. metapage update is handled as part of the "new root" action.
A page deletion is logged as a single WAL entry covering all four Each step in page deletion are logged as separate WAL entries: marking the
required page updates (target page, left and right siblings, and parent) leaf as half-dead and removing the downlink is one record, and unlinking a
as an atomic event. (Any required fast-root link update is also part page is a second record. If vacuum is interrupted for some reason, or the
of the WAL entry.) If the parent page becomes half-dead but is not system crashes, the tree is consistent for searches and insertions. The next
immediately deleted due to a subsequent crash, there is no loss of VACUUM will find the half-dead leaf page and continue the deletion.
consistency, and the empty page will be picked up by the next VACUUM.
Scans during Recovery Scans during Recovery
--------------------- ---------------------
......
This diff is collapsed.
...@@ -1082,7 +1082,7 @@ restart: ...@@ -1082,7 +1082,7 @@ restart:
MemoryContextReset(vstate->pagedelcontext); MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext); oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf, NULL); ndel = _bt_pagedel(rel, buf);
/* count only this page, else may double-count parent */ /* count only this page, else may double-count parent */
if (ndel) if (ndel)
......
This diff is collapsed.
...@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec) ...@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode); xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
break; break;
} }
case XLOG_BTREE_DELETE_PAGE: case XLOG_BTREE_MARK_PAGE_HALFDEAD:
case XLOG_BTREE_DELETE_PAGE_META:
case XLOG_BTREE_DELETE_PAGE_HALF:
{ {
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec; xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
appendStringInfoString(buf, "delete_page: "); appendStringInfoString(buf, "mark_page_halfdead: ");
out_target(buf, &(xlrec->target)); out_target(buf, &(xlrec->target));
appendStringInfo(buf, "; dead %u; left %u; right %u", appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
xlrec->deadblk, xlrec->leftblk, xlrec->rightblk); xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
break;
}
case XLOG_BTREE_UNLINK_PAGE_META:
case XLOG_BTREE_UNLINK_PAGE:
{
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
appendStringInfo(buf, "unlink_page: rel %u/%u/%u; ",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
appendStringInfo(buf, "dead %u; left %u; right %u; btpo_xact %u",
xlrec->deadblk, xlrec->leftsib, xlrec->rightsib, xlrec->btpo_xact);
appendStringInfo(buf, "leaf %u; leafleft %u; leafright %u; downlink %u",
xlrec->leafblk, xlrec->leafleftsib, xlrec->leafrightsib, xlrec->downlink);
break; break;
} }
case XLOG_BTREE_NEWROOT: case XLOG_BTREE_NEWROOT:
......
...@@ -215,11 +215,10 @@ typedef struct BTMetaPageData ...@@ -215,11 +215,10 @@ typedef struct BTMetaPageData
#define XLOG_BTREE_SPLIT_L_ROOT 0x50 /* add tuple with split of root */ #define XLOG_BTREE_SPLIT_L_ROOT 0x50 /* add tuple with split of root */
#define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */ #define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */ #define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */ #define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */ #define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
#define XLOG_BTREE_NEWROOT 0xA0 /* new root page */ #define XLOG_BTREE_NEWROOT 0xA0 /* new root page */
#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes #define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
* parent half-dead */
#define XLOG_BTREE_VACUUM 0xC0 /* delete entries on a page during #define XLOG_BTREE_VACUUM 0xC0 /* delete entries on a page during
* vacuum */ * vacuum */
#define XLOG_BTREE_REUSE_PAGE 0xD0 /* old page is about to be reused from #define XLOG_BTREE_REUSE_PAGE 0xD0 /* old page is about to be reused from
...@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum ...@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber)) #define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
/* /*
* This is what we need to know about deletion of a btree page. The target * This is what we need to know about marking an empty branch for deletion.
* identifies the tuple removed from the parent page (note that we remove * The target identifies the tuple removed from the parent page (note that we
* this tuple's downlink and the *following* tuple's key). Note we do not * remove this tuple's downlink and the *following* tuple's key). Note that
* store any content for the deleted page --- it is just rewritten as empty * the leaf page is empty, so we don't need to store its content --- it is
* during recovery, apart from resetting the btpo.xact. * just reinitialized during recovery using the rest of the fields.
*/ */
typedef struct xl_btree_delete_page typedef struct xl_btree_mark_page_halfdead
{ {
xl_btreetid target; /* deleted tuple id in parent page */ xl_btreetid target; /* deleted tuple id in parent page */
BlockNumber deadblk; /* child block being deleted */ BlockNumber leafblk; /* leaf block ultimately being deleted */
BlockNumber leftblk; /* child block's left sibling, if any */ BlockNumber leftblk; /* leaf block's left sibling, if any */
BlockNumber rightblk; /* child block's right sibling */ BlockNumber rightblk; /* leaf block's right sibling */
BlockNumber downlink; /* next child down in the branch */
} xl_btree_mark_page_halfdead;
#define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, downlink) + sizeof(BlockNumber))
/*
* This is what we need to know about deletion of a btree page. Note we do
* not store any content for the deleted page --- it is just rewritten as empty
* during recovery, apart from resetting the btpo.xact.
*/
typedef struct xl_btree_unlink_page
{
RelFileNode node;
BlockNumber deadblk; /* target block being deleted */
BlockNumber leftsib; /* taregt block's left sibling, if any */
BlockNumber rightsib; /* target block's right sibling */
/*
* Information needed to recreate the leaf page, when target is an internal
* page.
*/
BlockNumber leafblk;
BlockNumber leafleftsib;
BlockNumber leafrightsib;
BlockNumber downlink; /* next child down in the branch */
TransactionId btpo_xact; /* value of btpo.xact for use in recovery */ TransactionId btpo_xact; /* value of btpo.xact for use in recovery */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */ /* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
} xl_btree_delete_page; } xl_btree_unlink_page;
#define SizeOfBtreeDeletePage (offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId)) #define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
/* /*
* New root log record. There are zero tuples if this is to establish an * New root log record. There are zero tuples if this is to establish an
...@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf, ...@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
extern void _bt_delitems_vacuum(Relation rel, Buffer buf, extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, OffsetNumber *itemnos, int nitems,
BlockNumber lastBlockVacuumed); BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf, BTStack stack); extern int _bt_pagedel(Relation rel, Buffer buf);
/* /*
* prototypes for functions in nbtsearch.c * prototypes for functions in nbtsearch.c
......
...@@ -55,7 +55,7 @@ typedef struct BkpBlock ...@@ -55,7 +55,7 @@ typedef struct BkpBlock
/* /*
* Each page of XLOG file has a header like this: * Each page of XLOG file has a header like this:
*/ */
#define XLOG_PAGE_MAGIC 0xD07B /* can be used as WAL version indicator */ #define XLOG_PAGE_MAGIC 0xD07C /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData typedef struct XLogPageHeaderData
{ {
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment