Commit ac4ab97e authored by Heikki Linnakangas's avatar Heikki Linnakangas

Fix race condition in GIN posting tree page deletion.

If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.

Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.

Backpatch to all supported versions.
parent 636b868f
...@@ -210,6 +210,56 @@ fit on one pending-list page must have those pages to itself, even if this ...@@ -210,6 +210,56 @@ fit on one pending-list page must have those pages to itself, even if this
results in wasting much of the space on the preceding page and the last results in wasting much of the space on the preceding page and the last
page for the tuple.) page for the tuple.)
Concurrency
-----------
The entry tree and each posting tree is a B-tree, with right-links connecting
sibling pages at the same level. This is the same structure that is used in
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
scanning a GIN trees backwards, so we don't need left-links.
To avoid deadlocks, B-tree pages must always be locked in the same order:
left to right, and bottom to top. When searching, the tree is traversed from
top to bottom, so the lock on the parent page must be released before
descending to the next level. Concurrent page splits move the keyspace to
right, so after following a downlink, the page actually containing the key
we're looking for might be somewhere to the right of the page we landed on.
In that case, we follow the right-links until we find the page we're looking
for.
To delete a page, the page's left sibling, the target page, and its parent,
are locked in that order, and the page is marked as deleted. However, a
concurrent search might already have read a pointer to the page, and might be
just about to follow it. A page can be reached via the right-link of its left
sibling, or via its downlink in the parent.
To prevent a backend from reaching a deleted page via a right-link, when
following a right-link the lock on the previous page is not released until
the lock on next page has been acquired.
The downlink is more tricky. A search descending the tree must release the
lock on the parent page before locking the child, or it could deadlock with
a concurrent split of the child page; a page split locks the parent, while
already holding a lock on the child page. However, posting trees are only
fully searched from left to right, starting from the leftmost leaf. (The
tree-structure is only needed by insertions, to quickly find the correct
insert location). So as long as we don't delete the leftmost page on each
level, a search can never follow a downlink to page that's about to be
deleted.
The previous paragraph's reasoning only applies to searches, and only to
posting trees. To protect from inserters following a downlink to a deleted
page, vacuum simply locks out all concurrent insertions to the posting tree,
by holding a super-exclusive lock on the posting tree root. Inserters hold a
pin on the root page, but searches do not, so while new searches cannot begin
while root page is locked, any already-in-progress scans can continue
concurrently with vacuum. In the entry tree, we never delete pages.
(This is quite different from the mechanism the btree indexam uses to make
page-deletions safe; it stamps the deleted pages with an XID and keeps the
deleted pages around with the right-link intact until all concurrent scans
have finished.)
Limitations Limitations
----------- -----------
......
...@@ -112,10 +112,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack) ...@@ -112,10 +112,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
/* rightmost page */ /* rightmost page */
break; break;
stack->buffer = ginStepRight(stack->buffer, btree->index, access);
stack->blkno = rightlink; stack->blkno = rightlink;
LockBuffer(stack->buffer, GIN_UNLOCK);
stack->buffer = ReleaseAndReadBuffer(stack->buffer, btree->index, stack->blkno);
LockBuffer(stack->buffer, access);
page = BufferGetPage(stack->buffer); page = BufferGetPage(stack->buffer);
} }
...@@ -148,6 +146,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack) ...@@ -148,6 +146,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
} }
} }
/*
* Step right from current page.
*
* The next page is locked first, before releasing the current page. This is
* crucial to protect from concurrent page deletion (see comment in
* ginDeletePage).
*/
Buffer
ginStepRight(Buffer buffer, Relation index, int lockmode)
{
Buffer nextbuffer;
Page page = BufferGetPage(buffer);
bool isLeaf = GinPageIsLeaf(page);
bool isData = GinPageIsData(page);
BlockNumber blkno = GinPageGetOpaque(page)->rightlink;
nextbuffer = ReadBuffer(index, blkno);
LockBuffer(nextbuffer, lockmode);
UnlockReleaseBuffer(buffer);
/* Sanity check that the page we stepped to is of similar kind. */
page = BufferGetPage(nextbuffer);
if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page))
elog(ERROR, "right sibling of GIN page is of different type");
/*
* Given the proper lock sequence above, we should never land on a
* deleted page.
*/
if (GinPageIsDeleted(page))
elog(ERROR, "right sibling of GIN page was deleted");
return nextbuffer;
}
void void
freeGinBtreeStack(GinBtreeStack *stack) freeGinBtreeStack(GinBtreeStack *stack)
{ {
...@@ -235,12 +268,12 @@ ginFindParents(GinBtree btree, GinBtreeStack *stack, ...@@ -235,12 +268,12 @@ ginFindParents(GinBtree btree, GinBtreeStack *stack,
while ((offset = btree->findChildPtr(btree, page, stack->blkno, InvalidOffsetNumber)) == InvalidOffsetNumber) while ((offset = btree->findChildPtr(btree, page, stack->blkno, InvalidOffsetNumber)) == InvalidOffsetNumber)
{ {
blkno = GinPageGetOpaque(page)->rightlink; blkno = GinPageGetOpaque(page)->rightlink;
LockBuffer(buffer, GIN_UNLOCK);
ReleaseBuffer(buffer);
if (blkno == InvalidBlockNumber) if (blkno == InvalidBlockNumber)
{
UnlockReleaseBuffer(buffer);
break; break;
buffer = ReadBuffer(btree->index, blkno); }
LockBuffer(buffer, GIN_EXCLUSIVE); buffer = ginStepRight(buffer, btree->index, GIN_EXCLUSIVE);
page = BufferGetPage(buffer); page = BufferGetPage(buffer);
} }
...@@ -444,23 +477,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats) ...@@ -444,23 +477,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
{ {
BlockNumber rightlink = GinPageGetOpaque(page)->rightlink; BlockNumber rightlink = GinPageGetOpaque(page)->rightlink;
LockBuffer(parent->buffer, GIN_UNLOCK);
if (rightlink == InvalidBlockNumber) if (rightlink == InvalidBlockNumber)
{ {
/* /*
* rightmost page, but we don't find parent, we should use * rightmost page, but we don't find parent, we should use
* plain search... * plain search...
*/ */
LockBuffer(parent->buffer, GIN_UNLOCK);
ginFindParents(btree, stack, rootBlkno); ginFindParents(btree, stack, rootBlkno);
parent = stack->parent; parent = stack->parent;
Assert(parent != NULL); Assert(parent != NULL);
break; break;
} }
parent->buffer = ginStepRight(parent->buffer, btree->index, GIN_EXCLUSIVE);
parent->blkno = rightlink; parent->blkno = rightlink;
parent->buffer = ReleaseAndReadBuffer(parent->buffer, btree->index, parent->blkno);
LockBuffer(parent->buffer, GIN_EXCLUSIVE);
page = BufferGetPage(parent->buffer); page = BufferGetPage(parent->buffer);
} }
......
...@@ -105,16 +105,11 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack) ...@@ -105,16 +105,11 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
/* /*
* We scanned the whole page, so we should take right page * We scanned the whole page, so we should take right page
*/ */
stack->blkno = GinPageGetOpaque(page)->rightlink;
if (GinPageRightMost(page)) if (GinPageRightMost(page))
return false; /* no more pages */ return false; /* no more pages */
LockBuffer(stack->buffer, GIN_UNLOCK); stack->buffer = ginStepRight(stack->buffer, btree->index, GIN_SHARE);
stack->buffer = ReleaseAndReadBuffer(stack->buffer, stack->blkno = BufferGetBlockNumber(stack->buffer);
btree->index,
stack->blkno);
LockBuffer(stack->buffer, GIN_SHARE);
stack->off = FirstOffsetNumber; stack->off = FirstOffsetNumber;
} }
...@@ -132,7 +127,6 @@ scanPostingTree(Relation index, GinScanEntry scanEntry, ...@@ -132,7 +127,6 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
GinPostingTreeScan *gdi; GinPostingTreeScan *gdi;
Buffer buffer; Buffer buffer;
Page page; Page page;
BlockNumber blkno;
/* Descend to the leftmost leaf page */ /* Descend to the leftmost leaf page */
gdi = ginPrepareScanPostingTree(index, rootPostingTree, TRUE); gdi = ginPrepareScanPostingTree(index, rootPostingTree, TRUE);
...@@ -162,10 +156,7 @@ scanPostingTree(Relation index, GinScanEntry scanEntry, ...@@ -162,10 +156,7 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
if (GinPageRightMost(page)) if (GinPageRightMost(page))
break; /* no more pages */ break; /* no more pages */
blkno = GinPageGetOpaque(page)->rightlink; buffer = ginStepRight(buffer, index, GIN_SHARE);
LockBuffer(buffer, GIN_UNLOCK);
buffer = ReleaseAndReadBuffer(buffer, index, blkno);
LockBuffer(buffer, GIN_SHARE);
} }
UnlockReleaseBuffer(buffer); UnlockReleaseBuffer(buffer);
...@@ -543,7 +534,6 @@ static void ...@@ -543,7 +534,6 @@ static void
entryGetNextItem(GinState *ginstate, GinScanEntry entry) entryGetNextItem(GinState *ginstate, GinScanEntry entry)
{ {
Page page; Page page;
BlockNumber blkno;
for (;;) for (;;)
{ {
...@@ -561,23 +551,18 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry) ...@@ -561,23 +551,18 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
* It's needed to go by right link. During that we should refind * It's needed to go by right link. During that we should refind
* first ItemPointer greater that stored * first ItemPointer greater that stored
*/ */
if (GinPageRightMost(page))
blkno = GinPageGetOpaque(page)->rightlink;
LockBuffer(entry->buffer, GIN_UNLOCK);
if (blkno == InvalidBlockNumber)
{ {
ReleaseBuffer(entry->buffer); UnlockReleaseBuffer(entry->buffer);
ItemPointerSetInvalid(&entry->curItem); ItemPointerSetInvalid(&entry->curItem);
entry->buffer = InvalidBuffer; entry->buffer = InvalidBuffer;
entry->isFinished = TRUE; entry->isFinished = TRUE;
return; return;
} }
entry->buffer = ReleaseAndReadBuffer(entry->buffer, entry->buffer = ginStepRight(entry->buffer,
ginstate->index, ginstate->index,
blkno); GIN_SHARE);
LockBuffer(entry->buffer, GIN_SHARE);
page = BufferGetPage(entry->buffer); page = BufferGetPage(entry->buffer);
entry->offset = InvalidOffsetNumber; entry->offset = InvalidOffsetNumber;
......
...@@ -240,6 +240,9 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, ...@@ -240,6 +240,9 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
return hasVoidPage; return hasVoidPage;
} }
/*
* Delete a posting tree page.
*/
static void static void
ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno, ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno,
BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot) BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot)
...@@ -249,39 +252,35 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn ...@@ -249,39 +252,35 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
Buffer pBuffer; Buffer pBuffer;
Page page, Page page,
parentPage; parentPage;
BlockNumber rightlink;
dBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, deleteBlkno, /*
RBM_NORMAL, gvs->strategy); * Lock the pages in the same order as an insertion would, to avoid
* deadlocks: left, then right, then parent.
if (leftBlkno != InvalidBlockNumber) */
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno, lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
RBM_NORMAL, gvs->strategy); RBM_NORMAL, gvs->strategy);
else dBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, deleteBlkno,
lBuffer = InvalidBuffer; RBM_NORMAL, gvs->strategy);
pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno, pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno,
RBM_NORMAL, gvs->strategy); RBM_NORMAL, gvs->strategy);
LockBuffer(lBuffer, GIN_EXCLUSIVE);
LockBuffer(dBuffer, GIN_EXCLUSIVE); LockBuffer(dBuffer, GIN_EXCLUSIVE);
if (!isParentRoot) /* parent is already locked by if (!isParentRoot) /* parent is already locked by
* LockBufferForCleanup() */ * LockBufferForCleanup() */
LockBuffer(pBuffer, GIN_EXCLUSIVE); LockBuffer(pBuffer, GIN_EXCLUSIVE);
if (leftBlkno != InvalidBlockNumber)
LockBuffer(lBuffer, GIN_EXCLUSIVE);
START_CRIT_SECTION(); START_CRIT_SECTION();
if (leftBlkno != InvalidBlockNumber) /* Unlink the page by changing left sibling's rightlink */
{
BlockNumber rightlink;
page = BufferGetPage(dBuffer); page = BufferGetPage(dBuffer);
rightlink = GinPageGetOpaque(page)->rightlink; rightlink = GinPageGetOpaque(page)->rightlink;
page = BufferGetPage(lBuffer); page = BufferGetPage(lBuffer);
GinPageGetOpaque(page)->rightlink = rightlink; GinPageGetOpaque(page)->rightlink = rightlink;
}
/* Delete downlink from parent */
parentPage = BufferGetPage(pBuffer); parentPage = BufferGetPage(pBuffer);
#ifdef USE_ASSERT_CHECKING #ifdef USE_ASSERT_CHECKING
do do
...@@ -363,10 +362,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn ...@@ -363,10 +362,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
if (!isParentRoot) if (!isParentRoot)
LockBuffer(pBuffer, GIN_UNLOCK); LockBuffer(pBuffer, GIN_UNLOCK);
ReleaseBuffer(pBuffer); ReleaseBuffer(pBuffer);
if (leftBlkno != InvalidBlockNumber)
UnlockReleaseBuffer(lBuffer); UnlockReleaseBuffer(lBuffer);
UnlockReleaseBuffer(dBuffer); UnlockReleaseBuffer(dBuffer);
END_CRIT_SECTION(); END_CRIT_SECTION();
...@@ -435,10 +431,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, ...@@ -435,10 +431,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
if (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber) if (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber)
{ {
/* the page is empty */ /* we never delete the left- or rightmost branch */
if (!(me->leftBlkno == InvalidBlockNumber && GinPageRightMost(page))) if (me->leftBlkno != InvalidBlockNumber && !GinPageRightMost(page))
{ {
/* we never delete right most branch */
Assert(!isRoot); Assert(!isRoot);
ginDeletePage(gvs, blkno, me->leftBlkno, me->parent->blkno, myoff, me->parent->isRoot); ginDeletePage(gvs, blkno, me->leftBlkno, me->parent->blkno, myoff, me->parent->isRoot);
meDelete = TRUE; meDelete = TRUE;
......
...@@ -516,6 +516,7 @@ typedef struct GinBtreeData ...@@ -516,6 +516,7 @@ typedef struct GinBtreeData
extern GinBtreeStack *ginPrepareFindLeafPage(GinBtree btree, BlockNumber blkno); extern GinBtreeStack *ginPrepareFindLeafPage(GinBtree btree, BlockNumber blkno);
extern GinBtreeStack *ginFindLeafPage(GinBtree btree, GinBtreeStack *stack); extern GinBtreeStack *ginFindLeafPage(GinBtree btree, GinBtreeStack *stack);
extern Buffer ginStepRight(Buffer buffer, Relation index, int lockmode);
extern void freeGinBtreeStack(GinBtreeStack *stack); extern void freeGinBtreeStack(GinBtreeStack *stack);
extern void ginInsertValue(GinBtree btree, GinBtreeStack *stack, extern void ginInsertValue(GinBtree btree, GinBtreeStack *stack,
GinStatsData *buildStats); GinStatsData *buildStats);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment