Commit bc3087b6 authored by Peter Geoghegan's avatar Peter Geoghegan

Harmonize nbtree page split point code.

An nbtree split point can be thought of as a point between two adjoining
tuples from an imaginary version of the page being split that includes
the incoming/new item (in addition to the items that really are on the
page).  These adjoining tuples are called the lastleft and firstright
tuples.

The variables that represent split points contained a field called
firstright, which is an offset number of the first data item from the
original page that goes on the new right page.  The corresponding tuple
from origpage was usually the same thing as the actual firstright tuple,
but not always: the firstright tuple is sometimes the new/incoming item
instead.  This situation seems unnecessarily confusing.

Make things clearer by renaming the origpage offset returned by
_bt_findsplitloc() to "firstrightoff".  We now have a firstright tuple
and a firstrightoff offset number which are comparable to the
newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers
respectively.  Also make sure that we are consistent about how we
describe nbtree page split point state.

Push the responsibility for dealing with pg_upgrade'd !heapkeyspace
indexes down to lower level code, relieving _bt_split() from dealing
with it directly.  This means that we always have a palloc'd left page
high key on the leaf level, no matter what.  This enables simplifying
some of the code (and code comments) within _bt_split().

Finally, restructure the page split code to make it clearer why suffix
truncation (which only takes place during leaf page splits) is
completely different to the first data item truncation that takes place
during internal page splits.  Tuples are marked as having fewer
attributes stored in both cases, and the firstright tuple is truncated
in both cases, so it's easy to imagine somebody missing the distinction.
parent 8f00d84a
...@@ -1121,7 +1121,7 @@ bt_target_page_check(BtreeCheckState *state) ...@@ -1121,7 +1121,7 @@ bt_target_page_check(BtreeCheckState *state)
* designated purpose. Enforce the lower limit for pivot tuples when * designated purpose. Enforce the lower limit for pivot tuples when
* an explicit heap TID isn't actually present. (In all other cases * an explicit heap TID isn't actually present. (In all other cases
* suffix truncation is guaranteed to generate a pivot tuple that's no * suffix truncation is guaranteed to generate a pivot tuple that's no
* larger than the first right tuple provided to it by its caller.) * larger than the firstright tuple provided to it by its caller.)
*/ */
lowersizelimit = skey->heapkeyspace && lowersizelimit = skey->heapkeyspace &&
(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL); (P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
......
This diff is collapsed.
...@@ -269,7 +269,8 @@ static Page _bt_blnewpage(uint32 level); ...@@ -269,7 +269,8 @@ static Page _bt_blnewpage(uint32 level);
static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level); static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
static void _bt_slideleft(Page page); static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize, static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off); IndexTuple itup, OffsetNumber itup_off,
bool newfirstdataitem);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state, static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup, Size truncextra); IndexTuple itup, Size truncextra);
static void _bt_sort_dedup_finish_pending(BTWriteState *wstate, static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
...@@ -750,26 +751,24 @@ _bt_slideleft(Page page) ...@@ -750,26 +751,24 @@ _bt_slideleft(Page page)
/* /*
* Add an item to a page being built. * Add an item to a page being built.
* *
* The main difference between this routine and a bare PageAddItem call * This is very similar to nbtinsert.c's _bt_pgaddtup(), but this variant
* is that this code knows that the leftmost data item on a non-leaf btree * raises an error directly.
* page has a key that must be treated as minus infinity. Therefore, it
* truncates away all attributes.
* *
* This is almost like nbtinsert.c's _bt_pgaddtup(), but we can't use * Note that our nbtsort.c caller does not know yet if the page will be
* that because it assumes that P_RIGHTMOST() will return the correct * rightmost. Offset P_FIRSTKEY is always assumed to be the first data key by
* answer for the page. Here, we don't know yet if the page will be * caller. Page that turns out to be the rightmost on its level is fixed by
* rightmost. Offset P_FIRSTKEY is always the first data key. * calling _bt_slideleft().
*/ */
static void static void
_bt_sortaddtup(Page page, _bt_sortaddtup(Page page,
Size itemsize, Size itemsize,
IndexTuple itup, IndexTuple itup,
OffsetNumber itup_off) OffsetNumber itup_off,
bool newfirstdataitem)
{ {
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
IndexTupleData trunctuple; IndexTupleData trunctuple;
if (!P_ISLEAF(opaque) && itup_off == P_FIRSTKEY) if (newfirstdataitem)
{ {
trunctuple = *itup; trunctuple = *itup;
trunctuple.t_info = sizeof(IndexTupleData); trunctuple.t_info = sizeof(IndexTupleData);
...@@ -867,12 +866,13 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, ...@@ -867,12 +866,13 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
* Every newly built index will treat heap TID as part of the keyspace, * Every newly built index will treat heap TID as part of the keyspace,
* which imposes the requirement that new high keys must occasionally have * which imposes the requirement that new high keys must occasionally have
* a heap TID appended within _bt_truncate(). That may leave a new pivot * a heap TID appended within _bt_truncate(). That may leave a new pivot
* tuple one or two MAXALIGN() quantums larger than the original first * tuple one or two MAXALIGN() quantums larger than the original
* right tuple it's derived from. v4 deals with the problem by decreasing * firstright tuple it's derived from. v4 deals with the problem by
* the limit on the size of tuples inserted on the leaf level by the same * decreasing the limit on the size of tuples inserted on the leaf level
* small amount. Enforce the new v4+ limit on the leaf level, and the old * by the same small amount. Enforce the new v4+ limit on the leaf level,
* limit on internal levels, since pivot tuples may need to make use of * and the old limit on internal levels, since pivot tuples may need to
* the reserved space. This should never fail on internal pages. * make use of the reserved space. This should never fail on internal
* pages.
*/ */
if (unlikely(itupsz > BTMaxItemSize(npage))) if (unlikely(itupsz > BTMaxItemSize(npage)))
_bt_check_third_page(wstate->index, wstate->heap, isleaf, npage, _bt_check_third_page(wstate->index, wstate->heap, isleaf, npage,
...@@ -925,7 +925,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, ...@@ -925,7 +925,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Assert(last_off > P_FIRSTKEY); Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off); ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii); oitup = (IndexTuple) PageGetItem(opage, ii);
_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY); _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY,
!isleaf);
/* /*
* Move 'last' into the high key position on opage. _bt_blnewpage() * Move 'last' into the high key position on opage. _bt_blnewpage()
...@@ -1054,7 +1055,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, ...@@ -1054,7 +1055,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
* Add the new item into the current page. * Add the new item into the current page.
*/ */
last_off = OffsetNumberNext(last_off); last_off = OffsetNumberNext(last_off);
_bt_sortaddtup(npage, itupsz, itup, last_off); _bt_sortaddtup(npage, itupsz, itup, last_off,
!isleaf && last_off == P_FIRSTKEY);
state->btps_page = npage; state->btps_page = npage;
state->btps_blkno = nblkno; state->btps_blkno = nblkno;
......
This diff is collapsed.
...@@ -2346,17 +2346,12 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright, ...@@ -2346,17 +2346,12 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
ScanKey scankey; ScanKey scankey;
/* /*
* Be consistent about the representation of BTREE_VERSION 2/3 tuples * _bt_compare() treats truncated key attributes as having the value minus
* across Postgres versions; don't allow new pivot tuples to have * infinity, which would break searches within !heapkeyspace indexes. We
* truncated key attributes there. _bt_compare() treats truncated key * must still truncate away non-key attribute values, though.
* attributes as having the value minus infinity, which would break
* searches within !heapkeyspace indexes.
*/ */
if (!itup_key->heapkeyspace) if (!itup_key->heapkeyspace)
{
Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
return nkeyatts; return nkeyatts;
}
scankey = itup_key->scankeys; scankey = itup_key->scankeys;
keepnatts = 1; keepnatts = 1;
......
...@@ -251,7 +251,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, bool posting, ...@@ -251,7 +251,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
} }
static void static void
btree_xlog_split(bool onleft, XLogReaderState *record) btree_xlog_split(bool newitemonleft, XLogReaderState *record)
{ {
XLogRecPtr lsn = record->EndRecPtr; XLogRecPtr lsn = record->EndRecPtr;
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record); xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
...@@ -323,7 +323,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record) ...@@ -323,7 +323,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datapos = XLogRecGetBlockData(record, 0, &datalen); datapos = XLogRecGetBlockData(record, 0, &datalen);
if (onleft || xlrec->postingoff != 0) if (newitemonleft || xlrec->postingoff != 0)
{ {
newitem = (IndexTuple) datapos; newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem)); newitemsz = MAXALIGN(IndexTupleSize(newitem));
...@@ -368,7 +368,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record) ...@@ -368,7 +368,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
elog(PANIC, "failed to add high key to left page after split"); elog(PANIC, "failed to add high key to left page after split");
leftoff = OffsetNumberNext(leftoff); leftoff = OffsetNumberNext(leftoff);
for (off = P_FIRSTDATAKEY(lopaque); off < xlrec->firstright; off++) for (off = P_FIRSTDATAKEY(lopaque); off < xlrec->firstrightoff; off++)
{ {
ItemId itemid; ItemId itemid;
Size itemsz; Size itemsz;
...@@ -377,7 +377,8 @@ btree_xlog_split(bool onleft, XLogReaderState *record) ...@@ -377,7 +377,8 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
/* Add replacement posting list when required */ /* Add replacement posting list when required */
if (off == replacepostingoff) if (off == replacepostingoff)
{ {
Assert(onleft || xlrec->firstright == xlrec->newitemoff); Assert(newitemonleft ||
xlrec->firstrightoff == xlrec->newitemoff);
if (PageAddItem(newlpage, (Item) nposting, if (PageAddItem(newlpage, (Item) nposting,
MAXALIGN(IndexTupleSize(nposting)), leftoff, MAXALIGN(IndexTupleSize(nposting)), leftoff,
false, false) == InvalidOffsetNumber) false, false) == InvalidOffsetNumber)
...@@ -387,7 +388,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record) ...@@ -387,7 +388,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
} }
/* add the new item if it was inserted on left page */ /* add the new item if it was inserted on left page */
else if (onleft && off == xlrec->newitemoff) else if (newitemonleft && off == xlrec->newitemoff)
{ {
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff, if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber) false, false) == InvalidOffsetNumber)
...@@ -405,7 +406,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record) ...@@ -405,7 +406,7 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
} }
/* cope with possibility that newitem goes at the end */ /* cope with possibility that newitem goes at the end */
if (onleft && off == xlrec->newitemoff) if (newitemonleft && off == xlrec->newitemoff)
{ {
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff, if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber) false, false) == InvalidOffsetNumber)
......
...@@ -39,8 +39,8 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -39,8 +39,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{ {
xl_btree_split *xlrec = (xl_btree_split *) rec; xl_btree_split *xlrec = (xl_btree_split *) rec;
appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d", appendStringInfo(buf, "level %u, firstrightoff %d, newitemoff %d, postingoff %d",
xlrec->level, xlrec->firstright, xlrec->level, xlrec->firstrightoff,
xlrec->newitemoff, xlrec->postingoff); xlrec->newitemoff, xlrec->postingoff);
break; break;
} }
......
...@@ -99,9 +99,9 @@ typedef struct xl_btree_insert ...@@ -99,9 +99,9 @@ typedef struct xl_btree_insert
* left or right split page (and thus, whether the new item is stored or not). * left or right split page (and thus, whether the new item is stored or not).
* We always log the left page high key because suffix truncation can generate * We always log the left page high key because suffix truncation can generate
* a new leaf high key using user-defined code. This is also necessary on * a new leaf high key using user-defined code. This is also necessary on
* internal pages, since the first right item that the left page's high key * internal pages, since the firstright item that the left page's high key was
* was based on will have been truncated to zero attributes in the right page * based on will have been truncated to zero attributes in the right page (the
* (the original is unavailable from the right page). * separator key is unavailable from the right page).
* *
* Backup Blk 0: original page / new left page * Backup Blk 0: original page / new left page
* *
...@@ -153,7 +153,7 @@ typedef struct xl_btree_insert ...@@ -153,7 +153,7 @@ typedef struct xl_btree_insert
typedef struct xl_btree_split typedef struct xl_btree_split
{ {
uint32 level; /* tree level of page being split */ uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */ OffsetNumber firstrightoff; /* first origpage item on rightpage */
OffsetNumber newitemoff; /* new item's offset */ OffsetNumber newitemoff; /* new item's offset */
uint16 postingoff; /* offset inside orig posting tuple */ uint16 postingoff; /* offset inside orig posting tuple */
} xl_btree_split; } xl_btree_split;
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment