Commit 9de3aa65 authored by Heikki Linnakangas's avatar Heikki Linnakangas

Rewrite the GiST insertion logic so that we don't need the post-recovery

cleanup stage to finish incomplete inserts or splits anymore. There was two
reasons for the cleanup step:

1. When a new tuple was inserted to a leaf page, the downlink in the parent
needed to be updated to contain (ie. to be consistent with) the new key.
Updating the parent in turn might require recursively updating the parent of
the parent. We now handle that by updating the parent while traversing down
the tree, so that when we insert the leaf tuple, all the parents are already
consistent with the new key, and the tree is consistent at every step.

2. When a page is split, we need to insert the downlink for the new right
page(s), and update the downlink for the original page to not include keys
that moved to the right page(s). We now handle that by setting a new flag,
F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is
set, scans always follow the rightlink, regardless of the NSN mechanism used
to detect concurrent page splits. That way the tree is consistent right after
split, even though the downlink is still missing. This is very similar to the
way B-tree splits are handled. When the downlink is inserted in the parent,
the flag is cleared. To keep the insertion algorithm simple, when an
insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it
finishes the split before doing anything else.

These changes allow removing the whole "invalid tuple" mechanism, but I
retained the scan code to still follow invalid tuples correctly. While we
don't create any such tuples anymore, we want to handle them gracefully in
case you pg_upgrade a GiST index that has them. If we encounter any on an
insert, though, we just throw an error saying that you need to REINDEX.

The issue that got me into doing this is that if you did a checkpoint while
an insert or split was in progress, and the checkpoint finishes quickly so
that there is no WAL record related to the insert between RedoRecPtr and the
checkpoint record, recovery from that checkpoint would not know to finish
the incomplete insert. IOW, we have the same issue we solved with the
rm_safe_restartpoint mechanism during normal operation too. It's highly
unlikely to happen in practice, and this fix is far too large to backpatch,
so we're just going to live with in previous versions, but this refactoring
fixes it going forward.

With this patch, you don't get the annoying
'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices
anymore if you crash at an unfortunate moment.
parent 7a1ca897
...@@ -709,33 +709,4 @@ my_distance(PG_FUNCTION_ARGS) ...@@ -709,33 +709,4 @@ my_distance(PG_FUNCTION_ARGS)
</sect1> </sect1>
<sect1 id="gist-recovery">
<title>Crash Recovery</title>
<para>
Usually, replay of the WAL log is sufficient to restore the integrity
of a GiST index following a database crash. However, there are some
corner cases in which the index state is not fully rebuilt. The index
will still be functionally correct, but there might be some performance
degradation. When this occurs, the index can be repaired by
<command>VACUUM</>ing its table, or by rebuilding the index using
<command>REINDEX</>. In some cases a plain <command>VACUUM</> is
not sufficient, and either <command>VACUUM FULL</> or <command>REINDEX</>
is needed. The need for one of these procedures is indicated by occurrence
of this log message during crash recovery:
<programlisting>
LOG: index NNN/NNN/NNN needs VACUUM or REINDEX to finish crash recovery
</programlisting>
or this log message during routine index insertions:
<programlisting>
LOG: index "FOO" needs VACUUM or REINDEX to finish crash recovery
</programlisting>
If a plain <command>VACUUM</> finds itself unable to complete recovery
fully, it will return a notice:
<programlisting>
NOTICE: index "FOO" needs VACUUM FULL or REINDEX to finish crash recovery
</programlisting>
</para>
</sect1>
</chapter> </chapter>
...@@ -108,43 +108,71 @@ Penalty is used for choosing a subtree to insert; method PickSplit is used for ...@@ -108,43 +108,71 @@ Penalty is used for choosing a subtree to insert; method PickSplit is used for
the node splitting algorithm; method Union is used for propagating changes the node splitting algorithm; method Union is used for propagating changes
upward to maintain the tree properties. upward to maintain the tree properties.
NOTICE: We modified original INSERT algorithm for performance reason. In To insert a tuple, we first have to find a suitable leaf page to insert to.
particularly, it is now a single-pass algorithm. The algorithm walks down the tree, starting from the root, along the path
of smallest Penalty. At each step:
Function findLeaf is used to identify subtree for insertion. Page, in which
insertion is proceeded, is locked as well as its parent page. Functions 1. Has this page been split since we looked at the parent? If so, it's
findParent and findPath are used to find parent pages, which could be changed possible that we should be inserting to the other half instead, so retreat
because of concurrent access. Function pageSplit is recurrent and could split back to the parent.
page by more than 2 pages, which could be necessary if keys have different 2. If this is a leaf node, we've found our target node.
lengths or more than one key are inserted (in such situation, user defined 3. Otherwise use Penalty to pick a new target subtree.
function pickSplit cannot guarantee free space on page). 4. Check the key representing the target subtree. If it doesn't already cover
the key we're inserting, replace it with the Union of the old downlink key
findLeaf(new-key) and the key being inserted. (Actually, we always call Union, and just skip
push(stack, [root, 0]) //page, LSN the replacement if the Unioned key is the same as the existing key)
while(true) 5. Replacing the key in step 4 might cause the page to be split. In that case,
ptr = top of stack propagate the change upwards and restart the algorithm from the first parent
latch( ptr->page, S-mode ) that didn't need to be split.
ptr->lsn = ptr->page->lsn 6. Walk down to the target subtree, and goto 1.
if ( exists ptr->parent AND ptr->parent->lsn < ptr->page->nsn )
unlatch( ptr->page ) This differs from the insertion algorithm in the original paper. In the
pop stack original paper, you first walk down the tree until you reach a leaf page, and
else if ( ptr->page is not leaf ) then you adjust the downlink in the parent, and propagating the adjustment up,
push( stack, [get_best_child(ptr->page, new-key), 0] ) all the way up to the root in the worst case. But we adjust the downlinks to
unlatch( ptr->page ) cover the new key already when we walk down, so that when we reach the leaf
else page, we don't need to update the parents anymore, except to insert the
unlatch( ptr->page ) downlinks if we have to split the page. This makes crash recovery simpler:
latch( ptr->page, X-mode ) after inserting a key to the page, the tree is immediately self-consistent
if ( ptr->page is not leaf ) without having to update the parents. Even if we split a page and crash before
//the only root page can become a non-leaf inserting the downlink to the parent, the tree is self-consistent because the
unlatch( ptr->page ) right half of the split is accessible via the rightlink of the left page
else if ( ptr->parent->lsn < ptr->page->nsn ) (which replaced the original page).
unlatch( ptr->page )
pop stack Note that the algorithm can walk up and down the tree before reaching a leaf
else page, if internal pages need to split while adjusting the downlinks for the
return stack new key. Eventually, you should reach the bottom, and proceed with the
end insertion of the new tuple.
end
end Once we've found the target page to insert to, we check if there's room
for the new tuple. If there is, the tuple is inserted, and we're done.
If it doesn't fit, however, the page needs to be split. Note that it is
possible that a page needs to be split into more than two pages, if keys have
different lengths or more than one key is being inserted at a time (which can
happen when inserting downlinks for a page split that resulted in more than
two pages at the lower level). After splitting a page, the parent page needs
to be updated. The downlink for the new page needs to be inserted, and the
downlink for the old page, which became the left half of the split, needs to
be updated to only cover those tuples that stayed on the left page. Inserting
the downlink in the parent can again lead to a page split, recursing up to the
root page in the worst case.
gistplacetopage is the workhorse function that performs one step of the
insertion. If the tuple fits, it inserts it to the given page, otherwise
it splits the page, and constructs the new downlink tuples for the split
pages. The caller must then call gistplacetopage() on the parent page to
insert the downlink tuples. The parent page that holds the downlink to
the child might have migrated as a result of concurrent splits of the
parent, gistfindCorrectParent() is used to find the parent page.
Splitting the root page works slightly differently. At root split,
gistplacetopage() allocates the new child pages and replaces the old root
page with the new root containing downlinks to the new children, all in one
operation.
findPath is a subroutine of findParent, used when the correct parent page
can't be found by following the rightlinks at the parent level:
findPath( stack item ) findPath( stack item )
push stack, [root, 0, 0] // page, LSN, parent push stack, [root, 0, 0] // page, LSN, parent
...@@ -165,9 +193,13 @@ findPath( stack item ) ...@@ -165,9 +193,13 @@ findPath( stack item )
pop stack pop stack
end end
gistFindCorrectParent is used to re-find the parent of a page during
insertion. It might have migrated to the right since we traversed down the
tree because of page splits.
findParent( stack item ) findParent( stack item )
parent = item->parent parent = item->parent
latch( parent->page, X-mode )
if ( parent->page->lsn != parent->lsn ) if ( parent->page->lsn != parent->lsn )
while(true) while(true)
search parent tuple on parent->page, if found the return search parent tuple on parent->page, if found the return
...@@ -181,9 +213,13 @@ findParent( stack item ) ...@@ -181,9 +213,13 @@ findParent( stack item )
end end
newstack = findPath( item->parent ) newstack = findPath( item->parent )
replace part of stack to new one replace part of stack to new one
latch( parent->page, X-mode )
return findParent( item ) return findParent( item )
end end
pageSplit function decides how to distribute keys to the new pages after
page split:
pageSplit(page, allkeys) pageSplit(page, allkeys)
(lkeys, rkeys) = pickSplit( allkeys ) (lkeys, rkeys) = pickSplit( allkeys )
if ( page is root ) if ( page is root )
...@@ -204,39 +240,44 @@ pageSplit(page, allkeys) ...@@ -204,39 +240,44 @@ pageSplit(page, allkeys)
return newkeys return newkeys
placetopage(page, keysarray)
if ( no space left on page )
keysarray = pageSplit(page, [ extract_keys(page), keysarray])
last page in chain gets old NSN,
original and others - new NSN equals to LSN
if ( page is root )
make new root with keysarray
end
else
put keysarray on page
if ( length of keysarray > 1 )
keysarray = [ union(keysarray) ]
end
end
insert(new-key) Concurrency control
stack = findLeaf(new-key) -------------------
keysarray = [new-key] As a rule of thumb, if you need to hold a lock on multiple pages at the
ptr = top of stack same time, the locks should be acquired in the following order: child page
while(true) before parent, and left-to-right at the same level. Always acquiring the
findParent( ptr ) //findParent latches parent page locks in the same order avoids deadlocks.
keysarray = placetopage(ptr->page, keysarray)
unlatch( ptr->page ) The search algorithm only looks at and locks one page at a time. Consequently
pop stack; there's a race condition between a search and a page split. A page split
ptr = top of stack happens in two phases: 1. The page is split 2. The downlink is inserted to the
if (length of keysarray == 1) parent. If a search looks at the parent page between those steps, before the
newboundingkey = union(oldboundingkey, keysarray) downlink is inserted, it will still find the new right half by following the
if (newboundingkey == oldboundingkey) rightlink on the left half. But it must not follow the rightlink if it saw the
unlatch ptr->page downlink in the parent, or the page will be visited twice!
break loop
end A split initially marks the left page with the F_FOLLOW_RIGHT flag. If a scan
end sees that flag set, it knows that the right page is missing the downlink, and
end should be visited too. When split inserts the downlink to the parent, it
clears the F_FOLLOW_RIGHT flag in the child, and sets the NSN field in the
child page header to match the LSN of the insertion on the parent. If the
F_FOLLOW_RIGHT flag is not set, a scan compares the NSN on the child and the
LSN it saw in the parent. If NSN < LSN, the scan looked at the parent page
before the downlink was inserted, so it should follow the rightlink. Otherwise
the scan saw the downlink in the parent page, and will/did follow that as
usual.
A scan can't normally see a page with the F_FOLLOW_RIGHT flag set, because
a page split keeps the child pages locked until the downlink has been inserted
to the parent and the flag cleared again. But if a crash happens in the middle
of a page split, before the downlinks are inserted into the parent, that will
leave a page with F_FOLLOW_RIGHT in the tree. Scans handle that just fine,
but we'll eventually want to fix that for performance reasons. And more
importantly, dealing with pages with missing downlink pointers in the parent
would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
Authors: Authors:
Teodor Sigaev <teodor@sigaev.ru> Teodor Sigaev <teodor@sigaev.ru>
......
This diff is collapsed.
...@@ -254,9 +254,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, ...@@ -254,9 +254,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
page = BufferGetPage(buffer); page = BufferGetPage(buffer);
opaque = GistPageGetOpaque(page); opaque = GistPageGetOpaque(page);
/* check if page split occurred since visit to parent */ /*
* Check if we need to follow the rightlink. We need to follow it if the
* page was concurrently split since we visited the parent (in which case
* parentlsn < nsn), or if the the system crashed after a page split but
* before the downlink was inserted into the parent.
*/
if (!XLogRecPtrIsInvalid(pageItem->data.parentlsn) && if (!XLogRecPtrIsInvalid(pageItem->data.parentlsn) &&
XLByteLT(pageItem->data.parentlsn, opaque->nsn) && (GistFollowRight(page) ||
XLByteLT(pageItem->data.parentlsn, opaque->nsn)) &&
opaque->rightlink != InvalidBlockNumber /* sanity check */ ) opaque->rightlink != InvalidBlockNumber /* sanity check */ )
{ {
/* There was a page split, follow right link to add pages */ /* There was a page split, follow right link to add pages */
......
...@@ -499,58 +499,6 @@ gistSplitHalf(GIST_SPLITVEC *v, int len) ...@@ -499,58 +499,6 @@ gistSplitHalf(GIST_SPLITVEC *v, int len)
v->spl_left[v->spl_nleft++] = i; v->spl_left[v->spl_nleft++] = i;
} }
/*
* if it was invalid tuple then we need special processing.
* We move all invalid tuples on right page.
*
* if there is no place on left page, gistSplit will be called one more
* time for left page.
*
* Normally, we never exec this code, but after crash replay it's possible
* to get 'invalid' tuples (probability is low enough)
*/
static void
gistSplitByInvalid(GISTSTATE *giststate, GistSplitVector *v, IndexTuple *itup, int len)
{
int i;
static OffsetNumber offInvTuples[MaxOffsetNumber];
int nOffInvTuples = 0;
for (i = 1; i <= len; i++)
if (GistTupleIsInvalid(itup[i - 1]))
offInvTuples[nOffInvTuples++] = i;
if (nOffInvTuples == len)
{
/* corner case, all tuples are invalid */
v->spl_rightvalid = v->spl_leftvalid = false;
gistSplitHalf(&v->splitVector, len);
}
else
{
GistSplitUnion gsvp;
v->splitVector.spl_right = offInvTuples;
v->splitVector.spl_nright = nOffInvTuples;
v->spl_rightvalid = false;
v->splitVector.spl_left = (OffsetNumber *) palloc(len * sizeof(OffsetNumber));
v->splitVector.spl_nleft = 0;
for (i = 1; i <= len; i++)
if (!GistTupleIsInvalid(itup[i - 1]))
v->splitVector.spl_left[v->splitVector.spl_nleft++] = i;
v->spl_leftvalid = true;
gsvp.equiv = NULL;
gsvp.attr = v->spl_lattr;
gsvp.len = v->splitVector.spl_nleft;
gsvp.entries = v->splitVector.spl_left;
gsvp.isnull = v->spl_lisnull;
gistunionsubkeyvec(giststate, itup, &gsvp, 0);
}
}
/* /*
* trys to split page by attno key, in a case of null * trys to split page by attno key, in a case of null
* values move its to separate page. * values move its to separate page.
...@@ -568,12 +516,6 @@ gistSplitByKey(Relation r, Page page, IndexTuple *itup, int len, GISTSTATE *gist ...@@ -568,12 +516,6 @@ gistSplitByKey(Relation r, Page page, IndexTuple *itup, int len, GISTSTATE *gist
Datum datum; Datum datum;
bool IsNull; bool IsNull;
if (!GistPageIsLeaf(page) && GistTupleIsInvalid(itup[i - 1]))
{
gistSplitByInvalid(giststate, v, itup, len);
return;
}
datum = index_getattr(itup[i - 1], attno + 1, giststate->tupdesc, &IsNull); datum = index_getattr(itup[i - 1], attno + 1, giststate->tupdesc, &IsNull);
gistdentryinit(giststate, attno, &(entryvec->vector[i]), gistdentryinit(giststate, attno, &(entryvec->vector[i]),
datum, r, page, i, datum, r, page, i,
...@@ -582,8 +524,6 @@ gistSplitByKey(Relation r, Page page, IndexTuple *itup, int len, GISTSTATE *gist ...@@ -582,8 +524,6 @@ gistSplitByKey(Relation r, Page page, IndexTuple *itup, int len, GISTSTATE *gist
offNullTuples[nOffNullTuples++] = i; offNullTuples[nOffNullTuples++] = i;
} }
v->spl_leftvalid = v->spl_rightvalid = true;
if (nOffNullTuples == len) if (nOffNullTuples == len)
{ {
/* /*
......
...@@ -152,7 +152,7 @@ gistfillitupvec(IndexTuple *vec, int veclen, int *memlen) ...@@ -152,7 +152,7 @@ gistfillitupvec(IndexTuple *vec, int veclen, int *memlen)
* invalid tuple. Resulting Datums aren't compressed. * invalid tuple. Resulting Datums aren't compressed.
*/ */
bool void
gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startkey, gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startkey,
Datum *attr, bool *isnull) Datum *attr, bool *isnull)
{ {
...@@ -180,10 +180,6 @@ gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startke ...@@ -180,10 +180,6 @@ gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startke
Datum datum; Datum datum;
bool IsNull; bool IsNull;
if (GistTupleIsInvalid(itvec[j]))
return FALSE; /* signals that union with invalid tuple =>
* result is invalid */
datum = index_getattr(itvec[j], i + 1, giststate->tupdesc, &IsNull); datum = index_getattr(itvec[j], i + 1, giststate->tupdesc, &IsNull);
if (IsNull) if (IsNull)
continue; continue;
...@@ -218,8 +214,6 @@ gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startke ...@@ -218,8 +214,6 @@ gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startke
isnull[i] = FALSE; isnull[i] = FALSE;
} }
} }
return TRUE;
} }
/* /*
...@@ -231,8 +225,7 @@ gistunion(Relation r, IndexTuple *itvec, int len, GISTSTATE *giststate) ...@@ -231,8 +225,7 @@ gistunion(Relation r, IndexTuple *itvec, int len, GISTSTATE *giststate)
{ {
memset(isnullS, TRUE, sizeof(bool) * giststate->tupdesc->natts); memset(isnullS, TRUE, sizeof(bool) * giststate->tupdesc->natts);
if (!gistMakeUnionItVec(giststate, itvec, len, 0, attrS, isnullS)) gistMakeUnionItVec(giststate, itvec, len, 0, attrS, isnullS);
return gist_form_invalid_tuple(InvalidBlockNumber);
return gistFormTuple(giststate, r, attrS, isnullS, false); return gistFormTuple(giststate, r, attrS, isnullS, false);
} }
...@@ -328,9 +321,6 @@ gistgetadjusted(Relation r, IndexTuple oldtup, IndexTuple addtup, GISTSTATE *gis ...@@ -328,9 +321,6 @@ gistgetadjusted(Relation r, IndexTuple oldtup, IndexTuple addtup, GISTSTATE *gis
IndexTuple newtup = NULL; IndexTuple newtup = NULL;
int i; int i;
if (GistTupleIsInvalid(oldtup) || GistTupleIsInvalid(addtup))
return gist_form_invalid_tuple(ItemPointerGetBlockNumber(&(oldtup->t_tid)));
gistDeCompressAtt(giststate, r, oldtup, NULL, gistDeCompressAtt(giststate, r, oldtup, NULL,
(OffsetNumber) 0, oldentries, oldisnull); (OffsetNumber) 0, oldentries, oldisnull);
...@@ -401,14 +391,6 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ ...@@ -401,14 +391,6 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */
int j; int j;
IndexTuple itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, i)); IndexTuple itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, i));
if (!GistPageIsLeaf(p) && GistTupleIsInvalid(itup))
{
ereport(LOG,
(errmsg("index \"%s\" needs VACUUM or REINDEX to finish crash recovery",
RelationGetRelationName(r))));
continue;
}
sum_grow = 0; sum_grow = 0;
for (j = 0; j < r->rd_att->natts; j++) for (j = 0; j < r->rd_att->natts; j++)
{ {
...@@ -521,7 +503,11 @@ gistFormTuple(GISTSTATE *giststate, Relation r, ...@@ -521,7 +503,11 @@ gistFormTuple(GISTSTATE *giststate, Relation r,
} }
res = index_form_tuple(giststate->tupdesc, compatt, isnull); res = index_form_tuple(giststate->tupdesc, compatt, isnull);
GistTupleSetValid(res); /*
* The offset number on tuples on internal pages is unused. For historical
* reasons, it is set 0xffff.
*/
ItemPointerSetOffsetNumber( &(res->t_tid), 0xffff);
return res; return res;
} }
......
...@@ -26,13 +26,6 @@ ...@@ -26,13 +26,6 @@
#include "utils/memutils.h" #include "utils/memutils.h"
typedef struct GistBulkDeleteResult
{
IndexBulkDeleteResult std; /* common state */
bool needReindex;
} GistBulkDeleteResult;
/* /*
* VACUUM cleanup: update FSM * VACUUM cleanup: update FSM
*/ */
...@@ -40,7 +33,7 @@ Datum ...@@ -40,7 +33,7 @@ Datum
gistvacuumcleanup(PG_FUNCTION_ARGS) gistvacuumcleanup(PG_FUNCTION_ARGS)
{ {
IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0); IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
GistBulkDeleteResult *stats = (GistBulkDeleteResult *) PG_GETARG_POINTER(1); IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
Relation rel = info->index; Relation rel = info->index;
BlockNumber npages, BlockNumber npages,
blkno; blkno;
...@@ -56,10 +49,10 @@ gistvacuumcleanup(PG_FUNCTION_ARGS) ...@@ -56,10 +49,10 @@ gistvacuumcleanup(PG_FUNCTION_ARGS)
/* Set up all-zero stats if gistbulkdelete wasn't called */ /* Set up all-zero stats if gistbulkdelete wasn't called */
if (stats == NULL) if (stats == NULL)
{ {
stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult)); stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
/* use heap's tuple count */ /* use heap's tuple count */
stats->std.num_index_tuples = info->num_heap_tuples; stats->num_index_tuples = info->num_heap_tuples;
stats->std.estimated_count = info->estimated_count; stats->estimated_count = info->estimated_count;
/* /*
* XXX the above is wrong if index is partial. Would it be OK to just * XXX the above is wrong if index is partial. Would it be OK to just
...@@ -67,11 +60,6 @@ gistvacuumcleanup(PG_FUNCTION_ARGS) ...@@ -67,11 +60,6 @@ gistvacuumcleanup(PG_FUNCTION_ARGS)
*/ */
} }
if (stats->needReindex)
ereport(NOTICE,
(errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash recovery",
RelationGetRelationName(rel))));
/* /*
* Need lock unless it's local to this backend. * Need lock unless it's local to this backend.
*/ */
...@@ -112,10 +100,10 @@ gistvacuumcleanup(PG_FUNCTION_ARGS) ...@@ -112,10 +100,10 @@ gistvacuumcleanup(PG_FUNCTION_ARGS)
IndexFreeSpaceMapVacuum(info->index); IndexFreeSpaceMapVacuum(info->index);
/* return statistics */ /* return statistics */
stats->std.pages_free = totFreePages; stats->pages_free = totFreePages;
if (needLock) if (needLock)
LockRelationForExtension(rel, ExclusiveLock); LockRelationForExtension(rel, ExclusiveLock);
stats->std.num_pages = RelationGetNumberOfBlocks(rel); stats->num_pages = RelationGetNumberOfBlocks(rel);
if (needLock) if (needLock)
UnlockRelationForExtension(rel, ExclusiveLock); UnlockRelationForExtension(rel, ExclusiveLock);
...@@ -135,7 +123,7 @@ pushStackIfSplited(Page page, GistBDItem *stack) ...@@ -135,7 +123,7 @@ pushStackIfSplited(Page page, GistBDItem *stack)
GISTPageOpaque opaque = GistPageGetOpaque(page); GISTPageOpaque opaque = GistPageGetOpaque(page);
if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) && if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
XLByteLT(stack->parentlsn, opaque->nsn) && (GistFollowRight(page) || XLByteLT(stack->parentlsn, opaque->nsn)) &&
opaque->rightlink != InvalidBlockNumber /* sanity check */ ) opaque->rightlink != InvalidBlockNumber /* sanity check */ )
{ {
/* split page detected, install right link to the stack */ /* split page detected, install right link to the stack */
...@@ -162,7 +150,7 @@ Datum ...@@ -162,7 +150,7 @@ Datum
gistbulkdelete(PG_FUNCTION_ARGS) gistbulkdelete(PG_FUNCTION_ARGS)
{ {
IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0); IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
GistBulkDeleteResult *stats = (GistBulkDeleteResult *) PG_GETARG_POINTER(1); IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
IndexBulkDeleteCallback callback = (IndexBulkDeleteCallback) PG_GETARG_POINTER(2); IndexBulkDeleteCallback callback = (IndexBulkDeleteCallback) PG_GETARG_POINTER(2);
void *callback_state = (void *) PG_GETARG_POINTER(3); void *callback_state = (void *) PG_GETARG_POINTER(3);
Relation rel = info->index; Relation rel = info->index;
...@@ -171,10 +159,10 @@ gistbulkdelete(PG_FUNCTION_ARGS) ...@@ -171,10 +159,10 @@ gistbulkdelete(PG_FUNCTION_ARGS)
/* first time through? */ /* first time through? */
if (stats == NULL) if (stats == NULL)
stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult)); stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
/* we'll re-count the tuples each time */ /* we'll re-count the tuples each time */
stats->std.estimated_count = false; stats->estimated_count = false;
stats->std.num_index_tuples = 0; stats->num_index_tuples = 0;
stack = (GistBDItem *) palloc0(sizeof(GistBDItem)); stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
stack->blkno = GIST_ROOT_BLKNO; stack->blkno = GIST_ROOT_BLKNO;
...@@ -232,10 +220,10 @@ gistbulkdelete(PG_FUNCTION_ARGS) ...@@ -232,10 +220,10 @@ gistbulkdelete(PG_FUNCTION_ARGS)
{ {
todelete[ntodelete] = i - ntodelete; todelete[ntodelete] = i - ntodelete;
ntodelete++; ntodelete++;
stats->std.tuples_removed += 1; stats->tuples_removed += 1;
} }
else else
stats->std.num_index_tuples += 1; stats->num_index_tuples += 1;
} }
if (ntodelete) if (ntodelete)
...@@ -250,22 +238,13 @@ gistbulkdelete(PG_FUNCTION_ARGS) ...@@ -250,22 +238,13 @@ gistbulkdelete(PG_FUNCTION_ARGS)
if (RelationNeedsWAL(rel)) if (RelationNeedsWAL(rel))
{ {
XLogRecData *rdata;
XLogRecPtr recptr; XLogRecPtr recptr;
gistxlogPageUpdate *xlinfo;
rdata = formUpdateRdata(rel->rd_node, buffer, recptr = gistXLogUpdate(rel->rd_node, buffer,
todelete, ntodelete, todelete, ntodelete,
NULL, 0, NULL, 0, InvalidBuffer);
NULL);
xlinfo = (gistxlogPageUpdate *) rdata->next->data;
recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_UPDATE, rdata);
PageSetLSN(page, recptr); PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID); PageSetTLI(page, ThisTimeLineID);
pfree(xlinfo);
pfree(rdata);
} }
else else
PageSetLSN(page, GetXLogRecPtrForTemp()); PageSetLSN(page, GetXLogRecPtrForTemp());
...@@ -293,7 +272,11 @@ gistbulkdelete(PG_FUNCTION_ARGS) ...@@ -293,7 +272,11 @@ gistbulkdelete(PG_FUNCTION_ARGS)
stack->next = ptr; stack->next = ptr;
if (GistTupleIsInvalid(idxtuple)) if (GistTupleIsInvalid(idxtuple))
stats->needReindex = true; ereport(LOG,
(errmsg("index \"%s\" contains an inner tuple marked as invalid",
RelationGetRelationName(rel)),
errdetail("This is caused by an incomplete page split at crash recovery before upgrading to 9.1."),
errhint("Please REINDEX it.")));
} }
} }
......
This diff is collapsed.
...@@ -40,6 +40,6 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = { ...@@ -40,6 +40,6 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = {
{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint}, {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
{"Hash", hash_redo, hash_desc, NULL, NULL, NULL}, {"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint}, {"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, gist_safe_restartpoint}, {"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL},
{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL} {"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
}; };
...@@ -58,9 +58,10 @@ ...@@ -58,9 +58,10 @@
/* /*
* Page opaque data in a GiST index page. * Page opaque data in a GiST index page.
*/ */
#define F_LEAF (1 << 0) #define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) #define F_DELETED (1 << 1) /* the page has been deleted */
#define F_TUPLES_DELETED (1 << 2) #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
typedef XLogRecPtr GistNSN; typedef XLogRecPtr GistNSN;
...@@ -132,6 +133,10 @@ typedef struct GISTENTRY ...@@ -132,6 +133,10 @@ typedef struct GISTENTRY
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED) #define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED) #define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
/* /*
* Vector of GISTENTRY structs; user-defined methods union and picksplit * Vector of GISTENTRY structs; user-defined methods union and picksplit
* take it as one of their arguments * take it as one of their arguments
......
...@@ -132,9 +132,9 @@ typedef GISTScanOpaqueData *GISTScanOpaque; ...@@ -132,9 +132,9 @@ typedef GISTScanOpaqueData *GISTScanOpaque;
/* XLog stuff */ /* XLog stuff */
#define XLOG_GIST_PAGE_UPDATE 0x00 #define XLOG_GIST_PAGE_UPDATE 0x00
#define XLOG_GIST_NEW_ROOT 0x20 /* #define XLOG_GIST_NEW_ROOT 0x20 */ /* not used anymore */
#define XLOG_GIST_PAGE_SPLIT 0x30 #define XLOG_GIST_PAGE_SPLIT 0x30
#define XLOG_GIST_INSERT_COMPLETE 0x40 /* #define XLOG_GIST_INSERT_COMPLETE 0x40 */ /* not used anymore */
#define XLOG_GIST_CREATE_INDEX 0x50 #define XLOG_GIST_CREATE_INDEX 0x50
#define XLOG_GIST_PAGE_DELETE 0x60 #define XLOG_GIST_PAGE_DELETE 0x60
...@@ -144,9 +144,10 @@ typedef struct gistxlogPageUpdate ...@@ -144,9 +144,10 @@ typedef struct gistxlogPageUpdate
BlockNumber blkno; BlockNumber blkno;
/* /*
* It used to identify completeness of insert. Sets to leaf itup * If this operation completes a page split, by inserting a downlink for
* the split page, leftchild points to the left half of the split.
*/ */
ItemPointerData key; BlockNumber leftchild;
/* number of deleted offsets */ /* number of deleted offsets */
uint16 ntodelete; uint16 ntodelete;
...@@ -160,11 +161,12 @@ typedef struct gistxlogPageSplit ...@@ -160,11 +161,12 @@ typedef struct gistxlogPageSplit
{ {
RelFileNode node; RelFileNode node;
BlockNumber origblkno; /* splitted page */ BlockNumber origblkno; /* splitted page */
BlockNumber origrlink; /* rightlink of the page before split */
GistNSN orignsn; /* NSN of the page before split */
bool origleaf; /* was splitted page a leaf page? */ bool origleaf; /* was splitted page a leaf page? */
uint16 npage;
/* see comments on gistxlogPageUpdate */ BlockNumber leftchild; /* like in gistxlogPageUpdate */
ItemPointerData key; uint16 npage; /* # of pages in the split */
/* /*
* follow: 1. gistxlogPage and array of IndexTupleData per page * follow: 1. gistxlogPage and array of IndexTupleData per page
...@@ -177,12 +179,6 @@ typedef struct gistxlogPage ...@@ -177,12 +179,6 @@ typedef struct gistxlogPage
int num; /* number of index tuples following */ int num; /* number of index tuples following */
} gistxlogPage; } gistxlogPage;
typedef struct gistxlogInsertComplete
{
RelFileNode node;
/* follows ItemPointerData key to clean */
} gistxlogInsertComplete;
typedef struct gistxlogPageDelete typedef struct gistxlogPageDelete
{ {
RelFileNode node; RelFileNode node;
...@@ -206,7 +202,6 @@ typedef struct SplitedPageLayout ...@@ -206,7 +202,6 @@ typedef struct SplitedPageLayout
* GISTInsertStack used for locking buffers and transfer arguments during * GISTInsertStack used for locking buffers and transfer arguments during
* insertion * insertion
*/ */
typedef struct GISTInsertStack typedef struct GISTInsertStack
{ {
/* current page */ /* current page */
...@@ -215,7 +210,7 @@ typedef struct GISTInsertStack ...@@ -215,7 +210,7 @@ typedef struct GISTInsertStack
Page page; Page page;
/* /*
* log sequence number from page->lsn to recognize page update and * log sequence number from page->lsn to recognize page update and
* compare it with page's nsn to recognize page split * compare it with page's nsn to recognize page split
*/ */
GistNSN lsn; GistNSN lsn;
...@@ -223,9 +218,8 @@ typedef struct GISTInsertStack ...@@ -223,9 +218,8 @@ typedef struct GISTInsertStack
/* child's offset */ /* child's offset */
OffsetNumber childoffnum; OffsetNumber childoffnum;
/* pointer to parent and child */ /* pointer to parent */
struct GISTInsertStack *parent; struct GISTInsertStack *parent;
struct GISTInsertStack *child;
/* for gistFindPath */ /* for gistFindPath */
struct GISTInsertStack *next; struct GISTInsertStack *next;
...@@ -238,12 +232,10 @@ typedef struct GistSplitVector ...@@ -238,12 +232,10 @@ typedef struct GistSplitVector
Datum spl_lattr[INDEX_MAX_KEYS]; /* Union of subkeys in Datum spl_lattr[INDEX_MAX_KEYS]; /* Union of subkeys in
* spl_left */ * spl_left */
bool spl_lisnull[INDEX_MAX_KEYS]; bool spl_lisnull[INDEX_MAX_KEYS];
bool spl_leftvalid;
Datum spl_rattr[INDEX_MAX_KEYS]; /* Union of subkeys in Datum spl_rattr[INDEX_MAX_KEYS]; /* Union of subkeys in
* spl_right */ * spl_right */
bool spl_risnull[INDEX_MAX_KEYS]; bool spl_risnull[INDEX_MAX_KEYS];
bool spl_rightvalid;
bool *spl_equiv; /* equivalent tuples which can be freely bool *spl_equiv; /* equivalent tuples which can be freely
* distributed between left and right pages */ * distributed between left and right pages */
...@@ -252,28 +244,40 @@ typedef struct GistSplitVector ...@@ -252,28 +244,40 @@ typedef struct GistSplitVector
typedef struct typedef struct
{ {
Relation r; Relation r;
IndexTuple *itup; /* in/out, points to compressed entry */
int ituplen; /* length of itup */
Size freespace; /* free space to be left */ Size freespace; /* free space to be left */
GISTInsertStack *stack;
bool needInsertComplete;
/* pointer to heap tuple */ GISTInsertStack *stack;
ItemPointerData key;
} GISTInsertState; } GISTInsertState;
/* root page of a gist index */ /* root page of a gist index */
#define GIST_ROOT_BLKNO 0 #define GIST_ROOT_BLKNO 0
/* /*
* mark tuples on inner pages during recovery * Before PostgreSQL 9.1, we used rely on so-called "invalid tuples" on inner
* pages to finish crash recovery of incomplete page splits. If a crash
* happened in the middle of a page split, so that the downlink pointers were
* not yet inserted, crash recovery inserted a special downlink pointer. The
* semantics of an invalid tuple was that it if you encounter one in a scan,
* it must always be followed, because we don't know if the tuples on the
* child page match or not.
*
* We no longer create such invalid tuples, we now mark the left-half of such
* an incomplete split with the F_FOLLOW_RIGHT flag instead, and finish the
* split properly the next time we need to insert on that page. To retain
* on-disk compatibility for the sake of pg_upgrade, we still store 0xffff as
* the offset number of all inner tuples. If we encounter any invalid tuples
* with 0xfffe during insertion, we throw an error, though scans still handle
* them. You should only encounter invalid tuples if you pg_upgrade a pre-9.1
* gist index which already has invalid tuples in it because of a crash. That
* should be rare, and you are recommended to REINDEX anyway if you have any
* invalid tuples in an index, so throwing an error is as far as we go with
* supporting that.
*/ */
#define TUPLE_IS_VALID 0xffff #define TUPLE_IS_VALID 0xffff
#define TUPLE_IS_INVALID 0xfffe #define TUPLE_IS_INVALID 0xfffe
#define GistTupleIsInvalid(itup) ( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID ) #define GistTupleIsInvalid(itup) ( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID )
#define GistTupleSetValid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID ) #define GistTupleSetValid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID )
#define GistTupleSetInvalid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_INVALID )
/* gist.c */ /* gist.c */
extern Datum gistbuild(PG_FUNCTION_ARGS); extern Datum gistbuild(PG_FUNCTION_ARGS);
...@@ -281,8 +285,6 @@ extern Datum gistinsert(PG_FUNCTION_ARGS); ...@@ -281,8 +285,6 @@ extern Datum gistinsert(PG_FUNCTION_ARGS);
extern MemoryContext createTempGistContext(void); extern MemoryContext createTempGistContext(void);
extern void initGISTstate(GISTSTATE *giststate, Relation index); extern void initGISTstate(GISTSTATE *giststate, Relation index);
extern void freeGISTstate(GISTSTATE *giststate); extern void freeGISTstate(GISTSTATE *giststate);
extern void gistmakedeal(GISTInsertState *state, GISTSTATE *giststate);
extern void gistnewroot(Relation r, Buffer buffer, IndexTuple *itup, int len, ItemPointer key);
extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup, extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
int len, GISTSTATE *giststate); int len, GISTSTATE *giststate);
...@@ -294,18 +296,17 @@ extern void gist_redo(XLogRecPtr lsn, XLogRecord *record); ...@@ -294,18 +296,17 @@ extern void gist_redo(XLogRecPtr lsn, XLogRecord *record);
extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec); extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void gist_xlog_startup(void); extern void gist_xlog_startup(void);
extern void gist_xlog_cleanup(void); extern void gist_xlog_cleanup(void);
extern bool gist_safe_restartpoint(void);
extern IndexTuple gist_form_invalid_tuple(BlockNumber blkno);
extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
OffsetNumber *todelete, int ntodelete,
IndexTuple *itup, int ituplen, ItemPointer key);
extern XLogRecData *formSplitRdata(RelFileNode node, extern XLogRecPtr gistXLogUpdate(RelFileNode node, Buffer buffer,
BlockNumber blkno, bool page_is_leaf, OffsetNumber *todelete, int ntodelete,
ItemPointer key, SplitedPageLayout *dist); IndexTuple *itup, int ntup,
Buffer leftchild);
extern XLogRecPtr gistxlogInsertCompletion(RelFileNode node, ItemPointerData *keys, int len); extern XLogRecPtr gistXLogSplit(RelFileNode node,
BlockNumber blkno, bool page_is_leaf,
SplitedPageLayout *dist,
BlockNumber origrlink, GistNSN oldnsn,
Buffer leftchild);
/* gistget.c */ /* gistget.c */
extern Datum gistgettuple(PG_FUNCTION_ARGS); extern Datum gistgettuple(PG_FUNCTION_ARGS);
...@@ -357,7 +358,7 @@ extern void gistdentryinit(GISTSTATE *giststate, int nkey, GISTENTRY *e, ...@@ -357,7 +358,7 @@ extern void gistdentryinit(GISTSTATE *giststate, int nkey, GISTENTRY *e,
extern float gistpenalty(GISTSTATE *giststate, int attno, extern float gistpenalty(GISTSTATE *giststate, int attno,
GISTENTRY *key1, bool isNull1, GISTENTRY *key1, bool isNull1,
GISTENTRY *key2, bool isNull2); GISTENTRY *key2, bool isNull2);
extern bool gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startkey, extern void gistMakeUnionItVec(GISTSTATE *giststate, IndexTuple *itvec, int len, int startkey,
Datum *attr, bool *isnull); Datum *attr, bool *isnull);
extern bool gistKeyIsEQ(GISTSTATE *giststate, int attno, Datum a, Datum b); extern bool gistKeyIsEQ(GISTSTATE *giststate, int attno, Datum a, Datum b);
extern void gistDeCompressAtt(GISTSTATE *giststate, Relation r, IndexTuple tuple, Page p, extern void gistDeCompressAtt(GISTSTATE *giststate, Relation r, IndexTuple tuple, Page p,
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment