Fix race condition in B-tree page deletion.

In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.

Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.
efada2b8 · Heikki Linnakangas · 6c461cb9 · efada2b8 · efada2b8 · efada2b8
Commit efada2b8 authored Mar 14, 2014 by Heikki Linnakangas
7 changed files
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted.  Also, because
 we are matching downlink page numbers and not data keys, we don't have any
 problem with possibly misidentifying the parent item.

+Page Deletion
+-------------
+
 We consider deleting an entire page from the btree only when it's become
 completely empty of items.  (Merging partly-full pages would allow better
 space reuse, but it seems impractical to move existing data items left or
 right to make this happen --- a scan moving in the opposite direction
 might miss the items if so.)  Also, we *never* delete the rightmost page
 on a tree level (this restriction simplifies the traversal algorithms, as
-explained below).
-
-To delete an empty page, we acquire write lock on its left sibling (if
-any), the target page itself, the right sibling (there must be one), and
-the parent page, in that order.  The parent page must be found using the
-same type of search as used to find the parent during an insertion split.
-Then we update the side-links in the siblings, mark the target page
-deleted, and remove the downlink from the parent, as well as the parent's
-upper bounding key for the target (the one separating it from its right
-sibling).  This causes the target page's key space to effectively belong
-to its right sibling.  (Neither the left nor right sibling pages need to
+explained below).  Page deletion always begins from an empty leaf page.  An
+internal page can only be deleted as part of a branch leading to a leaf
+page, where each internal page has only one child and that child is also to
+be deleted.
+
+Deleting a leaf page is a two-stage process.  In the first stage, the page
+is unlinked from its parent, and marked as half-dead.  The parent page must
+be found using the same type of search as used to find the parent during an
+insertion split.  We lock the target and the parent pages, change the
+target's downlink to point to the right sibling, and remove its old
+downlink.  This causes the target page's key space to effectively belong to
+its right sibling.  (Neither the left nor right sibling pages need to
 change their "high key" if any; so there is no problem with possibly not
-having enough space to replace a high key.)  The side-links in the target
-page are not changed.
+having enough space to replace a high key.)  At the same time, we mark the
+target page as half-dead, which causes any subsequent searches to ignore it
+and move right (or left, in a backwards scan).  This leaves the tree in a
+similar state as during a page split: the page has no downlink pointing to
+it, but it's still linked to its siblings.

 (Note: Lanin and Shasha prefer to make the key space move left, but their
 argument for doing so hinges on not having left-links, which we have
@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
 too, meaning we'd have to make bounding-key updates in its parent, and
 perhaps all the way up the tree.  Since we can't possibly do that
 atomically, we forbid this case.  That means that the rightmost child of a
-parent node can't be deleted unless it's the only remaining child.
-
-When we delete the last remaining child of a parent page, we mark the
-parent page "half-dead" as part of the atomic update that deletes the
-child page.  This implicitly transfers the parent's key space to its right
-sibling (which it must have, since we never delete the overall-rightmost
-page of a level).  Searches ignore the half-dead page and immediately move
-right.  We need not worry about insertions into a half-dead page --- insertions
-into upper tree levels happen only as a result of splits of child pages, and
-the half-dead page no longer has any children that could split.  Therefore
-the page stays empty even when we don't have lock on it, and we can complete
-its deletion in a second atomic action.
-
-The notion of a half-dead page means that the key space relationship between
-the half-dead page's level and its parent's level may be a little out of
-whack: key space that appears to belong to the half-dead page's parent on the
-parent level may really belong to its right sibling.  To prevent any possible
-problems, we hold lock on the deleted child page until we have finished
-deleting any now-half-dead parent page(s).  This prevents any insertions into
-the transferred keyspace until the operation is complete.  The reason for
-doing this is that a sufficiently large number of insertions into the
-transferred keyspace, resulting in multiple page splits, could propagate keys
-from that keyspace into the parent level, resulting in transiently
-out-of-order keys in that level.  It is thought that that wouldn't cause any
-serious problem, but it seems too risky to allow.
+parent node can't be deleted unless it's the only remaining child, in which
+case we will delete the parent too (see below).
+
+In the second-stage, the half-dead leaf page is unlinked from its siblings.
+We first lock the left sibling (if any) of the target, the target page
+itself, and its right sibling (there must be one) in that order.  Then we
+update the side-links in the siblings, and mark the target page deleted.
+
+When we're about to delete the last remaining child of a parent page, things
+are slightly more complicated.  In the first stage, we leave the immediate
+parent of the leaf page alone, and remove the downlink to the parent page
+instead, from the grandparent.  If it's the last child of the grandparent
+too, we recurse up until we find a parent with more than one child, and
+remove the downlink of that page.  The leaf page is marked as half-dead, and
+the block number of the page whose downlink was removed is stashed in the
+half-dead leaf page.  This leaves us with a chain of internal pages, with
+one downlink each, leading to the half-dead leaf page, and no downlink
+pointing to the topmost page in the chain.
+
+While we recurse up to find the topmost parent in the chain, we keep the
+leaf page locked, but don't need to hold locks on the intermediate pages
+between the leaf and the topmost parent -- insertions into upper tree levels
+happen only as a result of splits of child pages, and that can't happen as
+long as we're keeping the leaf locked.  The internal pages in the chain
+cannot acquire new children afterwards either, because the leaf page is
+marked as half-dead and won't be split.
+
+Removing the downlink to the top of the to-be-deleted chain effectively
+transfers the key space to the right sibling for all the intermediate levels
+too, in one atomic operation.  A concurrent search might still visit the
+intermediate pages, but it will move right when it reaches the half-dead page
+at the leaf level.
+
+In the second stage, the topmost page in the chain is unlinked from its
+siblings, and the half-dead leaf page is updated to point to the next page
+down in the chain.  This is repeated until there are no internal pages left
+in the chain.  Finally, the half-dead leaf page itself is unlinked from its
+siblings.

 A deleted page cannot be reclaimed immediately, since there may be other
 processes waiting to reference it (ie, search processes that just left the
@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
 of the insertion into the parent level.  When splitting the root page, the
 metapage update is handled as part of the "new root" action.

-A page deletion is logged as a single WAL entry covering all four
-required page updates (target page, left and right siblings, and parent)
-as an atomic event.  (Any required fast-root link update is also part
-of the WAL entry.)  If the parent page becomes half-dead but is not
-immediately deleted due to a subsequent crash, there is no loss of
-consistency, and the empty page will be picked up by the next VACUUM.
+Each step in page deletion are logged as separate WAL entries: marking the
+leaf as half-dead and removing the downlink is one record, and unlinking a
+page is a second record.  If vacuum is interrupted for some reason, or the
+system crashes, the tree is consistent for searches and insertions.  The next
+VACUUM will find the half-dead leaf page and continue the deletion.

 Scans during Recovery
 ---------------------

--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1082,7 +1082,7 @@ restart:
 		MemoryContextReset(vstate->pagedelcontext);
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);

-		ndel = _bt_pagedel(rel, buf, NULL);
+		ndel = _bt_pagedel(rel, buf);

 		/* count only this page, else may double-count parent */
 		if (ndel)

--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
 								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
 				break;
 			}
-		case XLOG_BTREE_DELETE_PAGE:
-		case XLOG_BTREE_DELETE_PAGE_META:
-		case XLOG_BTREE_DELETE_PAGE_HALF:
+		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
 			{
-				xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
+				xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;

-				appendStringInfoString(buf, "delete_page: ");
+				appendStringInfoString(buf, "mark_page_halfdead: ");
 				out_target(buf, &(xlrec->target));
-				appendStringInfo(buf, "; dead %u; left %u; right %u",
-							xlrec->deadblk, xlrec->leftblk, xlrec->rightblk);
+				appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
+								 xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
+				break;
+			}
+		case XLOG_BTREE_UNLINK_PAGE_META:
+		case XLOG_BTREE_UNLINK_PAGE:
+			{
+				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
+
+				appendStringInfo(buf, "unlink_page: rel %u/%u/%u; ",
+								 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
+				appendStringInfo(buf, "dead %u; left %u; right %u; btpo_xact %u",
+								 xlrec->deadblk, xlrec->leftsib, xlrec->rightsib, xlrec->btpo_xact);
+				appendStringInfo(buf, "leaf %u; leafleft %u; leafright %u; downlink %u",
+								 xlrec->leafblk, xlrec->leafleftsib, xlrec->leafrightsib, xlrec->downlink);
 				break;
 			}
 		case XLOG_BTREE_NEWROOT:

--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -215,11 +215,10 @@ typedef struct BTMetaPageData
 #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
 #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
-#define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
-#define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
+#define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
+#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
 #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
-#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
-												 * parent half-dead */
+#define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
 #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during
 										 * vacuum */
 #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum
 #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))

 /*
- * This is what we need to know about deletion of a btree page.  The target
- * identifies the tuple removed from the parent page (note that we remove
- * this tuple's downlink and the *following* tuple's key).	Note we do not
- * store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about marking an empty branch for deletion.
+ * The target identifies the tuple removed from the parent page (note that we
+ * remove this tuple's downlink and the *following* tuple's key).  Note that
+ * the leaf page is empty, so we don't need to store its content --- it is
+ * just reinitialized during recovery using the rest of the fields.
 */
-typedef struct xl_btree_delete_page
+typedef struct xl_btree_mark_page_halfdead
 {
 	xl_btreetid target;			/* deleted tuple id in parent page */
-	BlockNumber deadblk;		/* child block being deleted */
-	BlockNumber leftblk;		/* child block's left sibling, if any */
-	BlockNumber rightblk;		/* child block's right sibling */
+	BlockNumber leafblk;		/* leaf block ultimately being deleted */
+	BlockNumber leftblk;		/* leaf block's left sibling, if any */
+	BlockNumber rightblk;		/* leaf block's right sibling */
+	BlockNumber downlink;		/* next child down in the branch */
+} xl_btree_mark_page_halfdead;
+
+#define SizeOfBtreeMarkPageHalfDead	(offsetof(xl_btree_mark_page_halfdead, downlink) + sizeof(BlockNumber))
+
+/*
+ * This is what we need to know about deletion of a btree page.  Note we do
+ * not store any content for the deleted page --- it is just rewritten as empty
+ * during recovery, apart from resetting the btpo.xact.
+ */
+typedef struct xl_btree_unlink_page
+{
+	RelFileNode	node;
+	BlockNumber deadblk;		/* target block being deleted */
+	BlockNumber leftsib;		/* taregt block's left sibling, if any */
+	BlockNumber rightsib;		/* target block's right sibling */
+
+	/*
+	 * Information needed to recreate the leaf page, when target is an internal
+	 * page.
+	 */
+	BlockNumber	leafblk;
+	BlockNumber	leafleftsib;
+	BlockNumber	leafrightsib;
+	BlockNumber	downlink;		/* next child down in the branch */
+
 	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
-	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
-} xl_btree_delete_page;
+	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
+} xl_btree_unlink_page;

-#define SizeOfBtreeDeletePage	(offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))

 /*
 * New root log record.  There are zero tuples if this is to establish an
@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
 					BlockNumber lastBlockVacuumed);
-extern int	_bt_pagedel(Relation rel, Buffer buf, BTStack stack);
+extern int	_bt_pagedel(Relation rel, Buffer buf);

 /*
 * prototypes for functions in nbtsearch.c

--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -55,7 +55,7 @@ typedef struct BkpBlock
 /*
 * Each page of XLOG file has a header like this:
 */
-#define XLOG_PAGE_MAGIC 0xD07B	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD07C	/* can be used as WAL version indicator */

 typedef struct XLogPageHeaderData
 {