Commit 9e85183b authored by Tom Lane's avatar Tom Lane

Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for

duplicate keys by letting search go to the left rather than right when an
equal key is seen at an upper tree level.  Fix poor choice of page split
point (leading to insertion failures) that was forced by chaining logic.
Don't store leftmost key in non-leaf pages, since it's not necessary.
Don't create root page until something is first stored in the index, so an
unused index is now 8K not 16K.  (Doesn't seem to be as easy to get rid of
the metadata page, unfortunately.)  Massive cleanup of unreadable code,
fix poor, obsolete, and just plain wrong documentation and comments.
See src/backend/access/nbtree/README for the gory details.
parent c9537ca8
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $ $Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
This directory contains a correct implementation of Lehman and Yao's This directory contains a correct implementation of Lehman and Yao's
btree management algorithm that supports concurrent access for Postgres. high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
We have made the following changes in order to incorporate their algorithm We have made the following changes in order to incorporate their algorithm
into Postgres: into Postgres:
+ The requirement that all btree keys be unique is too onerous, + The requirement that all btree keys be unique is too onerous,
but the algorithm won't work correctly without it. As a result, but the algorithm won't work correctly without it. Fortunately, it is
this implementation adds an OID (guaranteed to be unique) to only necessary that keys be unique on a single tree level, because L&Y
every key in the index. This guarantees uniqueness within a set only use the assumption of key uniqueness when re-finding a key in a
of duplicates. Space overhead is four bytes. parent node (to determine where to insert the key for a split page).
Therefore, we can use the link field to disambiguate multiple
For this reason, when we're passed an index tuple to store by the occurrences of the same user key: only one entry in the parent level
common access method code, we allocate a larger one and copy the will be pointing at the page we had split. (Indeed we need not look at
supplied tuple into it. No Postgres code outside of the btree the real "key" at all, just at the link field.) We can distinguish
access method knows about this xid or sequence number. items at the leaf level in the same way, by examining their links to
heap tuples; we'd never have two items for the same heap tuple.
+ Lehman and Yao don't require read locks, but assume that in-
memory copies of tree nodes are unshared. Postgres shares + Lehman and Yao assume that the key range for a subtree S is described
in-memory buffers among backends. As a result, we do page- by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
level read locking on btree nodes in order to guarantee that node. This does not work for nonunique keys (for example, if we have
no record is modified while we are examining it. This reduces enough equal keys to spread across several leaf pages, there *must* be
concurrency but guaranteees correct behavior. some equal bounding keys in the first level up). Therefore we assume
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
+ Read locks on a page are held for as long as a scan has a pointer bounding key in an upper tree level must descend to the left of that
to the page. However, locks are always surrendered before the key to ensure it finds any equal keys in the preceding page. An
sibling page lock is acquired (for readers), so we remain deadlock- insertion that sees the high key of its target page is equal to the key
free. I will do a formal proof if I get bored anytime soon. to be inserted has a choice whether or not to move right, since the new
key could go on either page. (Currently, we try to find a page where
there is room for the new key without a split.)
+ Lehman and Yao don't require read locks, but assume that in-memory
copies of tree nodes are unshared. Postgres shares in-memory buffers
among backends. As a result, we do page-level read locking on btree
nodes in order to guarantee that no record is modified while we are
examining it. This reduces concurrency but guaranteees correct
behavior. An advantage is that when trading in a read lock for a
write lock, we need not re-read the page after getting the write lock.
Since we're also holding a pin on the shared buffer containing the
page, we know that buffer still contains the page and is up-to-date.
+ We support the notion of an ordered "scan" of an index as well as
insertions, deletions, and simple lookups. A scan in the forward
direction is no problem, we just use the right-sibling pointers that
L&Y require anyway. (Thus, once we have descended the tree to the
correct start point for the scan, the scan looks only at leaf pages
and never at higher tree levels.) To support scans in the backward
direction, we also store a "left sibling" link much like the "right
sibling". (This adds an extra step to the L&Y split algorithm: while
holding the write lock on the page being split, we also lock its former
right sibling to update that page's left-link. This is safe since no
writer of that page can be interested in acquiring a write lock on our
page.) A backwards scan has one additional bit of complexity: after
following the left-link we must account for the possibility that the
left sibling page got split before we could read it. So, we have to
move right until we find a page whose right-link matches the page we
came from.
+ Read locks on a page are held for as long as a scan has a pointer
to the page. However, locks are always surrendered before the
sibling page lock is acquired (for readers), so we remain deadlock-
free. I will do a formal proof if I get bored anytime soon.
NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
on the current page of a scan before control leaves nbtree. When we
come back to resume the scan, we have to re-grab the read lock and
then move right if the current item moved (see _bt_restscan()).
+ Lehman and Yao fail to discuss what must happen when the root page
becomes full and must be split. Our implementation is to split the
root in the same way that any other page would be split, then construct
a new root page holding pointers to both of the resulting pages (which
now become siblings on level 2 of the tree). The new root page is then
installed by altering the root pointer in the meta-data page (see
below). This works because the root is not treated specially in any
other way --- in particular, searches will move right using its link
pointer if the link is set. Therefore, searches will find the data
that's been moved into the right sibling even if they read the metadata
page before it got updated. This is the same reasoning that makes a
split of a non-root page safe. The locking considerations are similar too.
+ Lehman and Yao assume fixed-size keys, but we must deal with
variable-size keys. Therefore there is not a fixed maximum number of
keys per page; we just stuff in as many as will fit. When we split a
page, we try to equalize the number of bytes, not items, assigned to
each of the resulting pages. Note we must include the incoming item in
this calculation, otherwise it is possible to find that the incoming
item doesn't fit on the split page where it needs to go!
In addition, the following things are handy to know: In addition, the following things are handy to know:
+ Page zero of every btree is a meta-data page. This page stores + Page zero of every btree is a meta-data page. This page stores
the location of the root page, a pointer to a list of free the location of the root page, a pointer to a list of free
pages, and other stuff that's handy to know. pages, and other stuff that's handy to know. (Currently, we
never shrink btree indexes so there are never any free pages.)
+ This algorithm doesn't really work, since it requires ordered
writes, and UNIX doesn't support ordered writes. + The algorithm assumes we can fit at least three items per page
(a "high key" and two real data items). Therefore it's unsafe
+ There's one other case where we may screw up in this to accept items larger than 1/3rd page size. Larger items would
implementation. When we start a scan, we descend the tree work sometimes, but could cause failures later on depending on
to the key nearest the one in the qual, and once we get there, what else gets put on their page.
position ourselves correctly for the qual type (eg, <, >=, etc).
If we happen to step off a page, decide we want to get back to + This algorithm doesn't guarantee btree consistency after a kernel crash
it, and fetch the page again, and if some bad person has split or hardware failure. To do that, we'd need ordered writes, and UNIX
the page and moved the last tuple we saw off of it, then the doesn't support ordered writes (short of fsync'ing every update, which
code complains about botched concurrency in an elog(WARN, ...) is too high a price). Rebuilding corrupted indexes during restart
and gives up the ghost. This is the ONLY violation of Lehman seems more attractive.
and Yao's guarantee of correct behavior that I am aware of in
this code. + On deletions, we need to adjust the position of active scans on
the index. The code in nbtscan.c handles this. We don't need to
do this for insertions or splits because _bt_restscan can find the
new position of the previously-found item. NOTE that nbtscan.c
only copes with deletions issued by the current backend. This
essentially means that concurrent deletions are not supported, but
that's true already in the Lehman and Yao algorithm. nbtscan.c
exists only to support VACUUM and allow it to delete items while
it's scanning the index.
Notes about data representation:
+ The right-sibling link required by L&Y is kept in the page "opaque
data" area, as is the left-sibling link and some flags.
+ We also keep a parent link in the opaque data, but this link is not
very trustworthy because it is not updated when the parent page splits.
Thus, it points to some page on the parent level, but possibly a page
well to the left of the page's actual current parent. In most cases
we do not need this link at all. Normally we return to a parent page
using a stack of entries that are made as we descend the tree, as in L&Y.
There is exactly one case where the stack will not help: concurrent
root splits. If an inserter process needs to split what had been the
root when it started its descent, but finds that that page is no longer
the root (because someone else split it meanwhile), then it uses the
parent link to move up to the next level. This is OK because we do fix
the parent link in a former root page when splitting it. This logic
will work even if the root is split multiple times (even up to creation
of multiple new levels) before an inserter returns to it. The same
could not be said of finding the new root via the metapage, since that
would work only for a single level of added root.
+ The Postgres disk block data format (an array of items) doesn't fit
Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
so we have to play some games.
+ On a page that is not rightmost in its tree level, the "high key" is
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused. A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.
+ On a leaf page, the data items are simply links to (TIDs of) tuples
in the relation being indexed, with the associated key values.
+ On a non-leaf page, the data items are down-links to child pages with
bounding keys. The key in each data item is the *lower* bound for
keys on that child page, so logically the key is to the left of that
downlink. The high key (if present) is the upper bound for the last
downlink. The first data item on each such page has no lower bound
--- or lower bound of minus infinity, if you prefer. The comparison
routines must treat it accordingly. The actual key stored in the
item is irrelevant, and need not be stored at all. This arrangement
corresponds to the fact that an L&Y non-leaf page has one more pointer
than key.
Notes to operator class implementors: Notes to operator class implementors:
With this implementation, we require the user to supply us with + With this implementation, we require the user to supply us with
a procedure for pg_amproc. This procedure should take two keys a procedure for pg_amproc. This procedure should take two keys
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
respectively. See the contents of that relation for the btree respectively. See the contents of that relation for the btree
access method for some samples. access method for some samples.
Notes to mao for implementation document:
On deletions, we need to adjust the position of active scans on
the index. The code in nbtscan.c handles this. We don't need to
do this for splits because of the way splits are handled; if they
happen behind us, we'll automatically go to the next page, and if
they happen in front of us, we're not affected by them. For
insertions, if we inserted a tuple behind the current scan location
on the current scan page, we move one space ahead.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.59 2000/06/08 22:36:52 momjian Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.60 2000/07/21 06:42:32 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -19,53 +19,76 @@ ...@@ -19,53 +19,76 @@
#include "access/nbtree.h" #include "access/nbtree.h"
static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, BTStack stack, int keysz, ScanKey scankey, BTItem btitem, BTItem afteritem); typedef struct
static Buffer _bt_split(Relation rel, Size keysz, ScanKey scankey, {
Buffer buf, OffsetNumber firstright); /* context data for _bt_checksplitloc */
static OffsetNumber _bt_findsplitloc(Relation rel, Size keysz, ScanKey scankey, Size newitemsz; /* size of new item to be inserted */
Page page, OffsetNumber start, bool non_leaf; /* T if splitting an internal node */
OffsetNumber maxoff, Size llimit);
bool have_split; /* found a valid split? */
/* these fields valid only if have_split is true */
bool newitemonleft; /* new item on left or right of best split */
OffsetNumber firstright; /* best split point */
int best_delta; /* best size delta so far */
} FindSplitData;
static TransactionId _bt_check_unique(Relation rel, BTItem btitem,
Relation heapRel, Buffer buf,
ScanKey itup_scankey);
static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf,
BTStack stack,
int keysz, ScanKey scankey,
BTItem btitem,
OffsetNumber afteritem);
static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz,
BTItem newitem, bool newitemonleft,
OffsetNumber *itup_off, BlockNumber *itup_blkno);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
bool *newitemonleft);
static void _bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
int leftfree, int rightfree,
bool newitemonleft, Size firstrightitemsz);
static Buffer _bt_getstackbuf(Relation rel, BTStack stack);
static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf); static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
static OffsetNumber _bt_pgaddtup(Relation rel, Buffer buf, int keysz, ScanKey itup_scankey, Size itemsize, BTItem btitem, BTItem afteritem); static void _bt_pgaddtup(Relation rel, Page page,
static bool _bt_goesonpg(Relation rel, Buffer buf, Size keysz, ScanKey scankey, BTItem afteritem); Size itemsize, BTItem btitem,
static void _bt_updateitem(Relation rel, Size keysz, Buffer buf, BTItem oldItem, BTItem newItem); OffsetNumber itup_off, const char *where);
static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey); static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
static int32 _bt_tuplecompare(Relation rel, Size keysz, ScanKey scankey, int keysz, ScanKey scankey);
IndexTuple tuple1, IndexTuple tuple2);
/* /*
* _bt_doinsert() -- Handle insertion of a single btitem in the tree. * _bt_doinsert() -- Handle insertion of a single btitem in the tree.
* *
* This routine is called by the public interface routines, btbuild * This routine is called by the public interface routines, btbuild
* and btinsert. By here, btitem is filled in, and has a unique * and btinsert. By here, btitem is filled in, including the TID.
* (xid, seqno) pair.
*/ */
InsertIndexResult InsertIndexResult
_bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel) _bt_doinsert(Relation rel, BTItem btitem,
bool index_is_unique, Relation heapRel)
{ {
IndexTuple itup = &(btitem->bti_itup);
int natts = rel->rd_rel->relnatts;
ScanKey itup_scankey; ScanKey itup_scankey;
IndexTuple itup;
BTStack stack; BTStack stack;
Buffer buf; Buffer buf;
BlockNumber blkno;
int natts = rel->rd_rel->relnatts;
InsertIndexResult res; InsertIndexResult res;
Buffer buffer;
itup = &(btitem->bti_itup);
/* we need a scan key to do our search, so build one */ /* we need a scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup); itup_scankey = _bt_mkscankey(rel, itup);
top:
/* find the page containing this key */ /* find the page containing this key */
stack = _bt_search(rel, natts, itup_scankey, &buf); stack = _bt_search(rel, natts, itup_scankey, &buf, BT_WRITE);
/* trade in our read lock for a write lock */ /* trade in our read lock for a write lock */
LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BT_WRITE); LockBuffer(buf, BT_WRITE);
l1:
/* /*
* If the page was split between the time that we surrendered our read * If the page was split between the time that we surrendered our read
* lock and acquired our write lock, then this page may no longer be * lock and acquired our write lock, then this page may no longer be
...@@ -73,176 +96,212 @@ l1: ...@@ -73,176 +96,212 @@ l1:
* need to move right in the tree. See Lehman and Yao for an * need to move right in the tree. See Lehman and Yao for an
* excruciatingly precise description. * excruciatingly precise description.
*/ */
buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE); buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE);
blkno = BufferGetBlockNumber(buf);
/* if we're not allowing duplicates, make sure the key isn't */ /*
/* already in the node */ * If we're not allowing duplicates, make sure the key isn't
* already in the index. XXX this belongs somewhere else, likely
*/
if (index_is_unique) if (index_is_unique)
{ {
OffsetNumber offset, TransactionId xwait;
maxoff;
Page page;
page = BufferGetPage(buf); xwait = _bt_check_unique(rel, btitem, heapRel, buf, itup_scankey);
maxoff = PageGetMaxOffsetNumber(page);
if (TransactionIdIsValid(xwait))
{
/* Have to wait for the other guy ... */
_bt_relbuf(rel, buf, BT_WRITE);
XactLockTableWait(xwait);
/* start over... */
_bt_freestack(stack);
goto top;
}
}
/* do the insertion */
res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, btitem, 0);
/* be tidy */
_bt_freestack(stack);
_bt_freeskey(itup_scankey);
return res;
}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
*
* Returns NullTransactionId if there is no conflict, else an xact ID we
* must wait for to see if it commits a conflicting tuple. If an actual
* conflict is detected, no return --- just elog().
*/
static TransactionId
_bt_check_unique(Relation rel, BTItem btitem, Relation heapRel,
Buffer buf, ScanKey itup_scankey)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int natts = rel->rd_rel->relnatts;
OffsetNumber offset,
maxoff;
Page page;
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool chtup = true;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
/*
* Find first item >= proposed new item. Note we could also get
* a pointer to end-of-page here.
*/
offset = _bt_binsrch(rel, buf, natts, itup_scankey);
offset = _bt_binsrch(rel, buf, natts, itup_scankey, BT_DESCENT); /*
* Scan over all equal tuples, looking for live conflicts.
*/
for (;;)
{
HeapTupleData htup;
Buffer buffer;
BTItem cbti;
BlockNumber nblkno;
/* make sure the offset we're given points to an actual */ /*
/* key on the page before trying to compare it */ * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's
if (!PageIsEmpty(page) && offset <= maxoff) * how we handling NULLs - and so we must not use _bt_compare
* in real comparison, but only for ordering/finding items on
* pages. - vadim 03/24/97
*
* make sure the offset points to an actual key
* before trying to compare it...
*/
if (offset <= maxoff)
{ {
TupleDesc itupdesc; if (! _bt_isequal(itupdesc, page, offset, natts, itup_scankey))
BTItem cbti; break; /* we're past all the equal tuples */
HeapTupleData htup;
BTPageOpaque opaque;
Buffer nbuf;
BlockNumber nblkno;
bool chtup = true;
itupdesc = RelationGetDescr(rel);
nbuf = InvalidBuffer;
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* /*
* _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's * Have to check is inserted heap tuple deleted one (i.e.
* how we handling NULLs - and so we must not use _bt_compare * just moved to another place by vacuum)! We only need to
* in real comparison, but only for ordering/finding items on * do this once, but don't want to do it at all unless
* pages. - vadim 03/24/97 * we see equal tuples, so as not to slow down unequal case.
*
* while ( !_bt_compare (rel, itupdesc, page, natts,
* itup_scankey, offset) )
*/ */
while (_bt_isequal(itupdesc, page, offset, natts, itup_scankey)) if (chtup)
{ /* they're equal */ {
htup.t_self = btitem->bti_itup.t_tid;
/*
* Have to check is inserted heap tuple deleted one (i.e.
* just moved to another place by vacuum)!
*/
if (chtup)
{
htup.t_self = btitem->bti_itup.t_tid;
heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
if (htup.t_data == NULL) /* YES! */
break;
/* Live tuple was inserted */
ReleaseBuffer(buffer);
chtup = false;
}
cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
htup.t_self = cbti->bti_itup.t_tid;
heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
if (htup.t_data != NULL) /* it is a duplicate */ if (htup.t_data == NULL) /* YES! */
{ break;
TransactionId xwait = /* Live tuple is being inserted, so continue checking */
ReleaseBuffer(buffer);
chtup = false;
}
cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
htup.t_self = cbti->bti_itup.t_tid;
heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
if (htup.t_data != NULL) /* it is a duplicate */
{
TransactionId xwait =
(TransactionIdIsValid(SnapshotDirty->xmin)) ? (TransactionIdIsValid(SnapshotDirty->xmin)) ?
SnapshotDirty->xmin : SnapshotDirty->xmax; SnapshotDirty->xmin : SnapshotDirty->xmax;
/* /*
* If this tuple is being updated by other transaction * If this tuple is being updated by other transaction
* then we have to wait for its commit/abort. * then we have to wait for its commit/abort.
*/ */
ReleaseBuffer(buffer); ReleaseBuffer(buffer);
if (TransactionIdIsValid(xwait)) if (TransactionIdIsValid(xwait))
{ {
if (nbuf != InvalidBuffer)
_bt_relbuf(rel, nbuf, BT_READ);
_bt_relbuf(rel, buf, BT_WRITE);
XactLockTableWait(xwait);
buf = _bt_getbuf(rel, blkno, BT_WRITE);
goto l1;/* continue from the begin */
}
elog(ERROR, "Cannot insert a duplicate key into unique index %s", RelationGetRelationName(rel));
}
/* htup null so no buffer to release */
/* get next offnum */
if (offset < maxoff)
offset = OffsetNumberNext(offset);
else
{ /* move right ? */
if (P_RIGHTMOST(opaque))
break;
if (!_bt_isequal(itupdesc, page, P_HIKEY,
natts, itup_scankey))
break;
/*
* min key of the right page is the same, ooh - so
* many dead duplicates...
*/
nblkno = opaque->btpo_next;
if (nbuf != InvalidBuffer) if (nbuf != InvalidBuffer)
_bt_relbuf(rel, nbuf, BT_READ); _bt_relbuf(rel, nbuf, BT_READ);
for (nbuf = InvalidBuffer;;) /* Tell _bt_doinsert to wait... */
{ return xwait;
nbuf = _bt_getbuf(rel, nblkno, BT_READ);
page = BufferGetPage(nbuf);
maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
offset = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
if (!PageIsEmpty(page) && offset <= maxoff)
{ /* Found some key */
break;
}
else
{ /* Empty or "pseudo"-empty page - get next */
nblkno = opaque->btpo_next;
_bt_relbuf(rel, nbuf, BT_READ);
nbuf = InvalidBuffer;
if (nblkno == P_NONE)
break;
}
}
if (nbuf == InvalidBuffer)
break;
} }
/*
* Otherwise we have a definite conflict.
*/
elog(ERROR, "Cannot insert a duplicate key into unique index %s",
RelationGetRelationName(rel));
} }
/* htup null so no buffer to release */
}
/*
* Advance to next tuple to continue checking.
*/
if (offset < maxoff)
offset = OffsetNumberNext(offset);
else
{
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
if (!_bt_isequal(itupdesc, page, P_HIKEY,
natts, itup_scankey))
break;
nblkno = opaque->btpo_next;
if (nbuf != InvalidBuffer) if (nbuf != InvalidBuffer)
_bt_relbuf(rel, nbuf, BT_READ); _bt_relbuf(rel, nbuf, BT_READ);
nbuf = _bt_getbuf(rel, nblkno, BT_READ);
page = BufferGetPage(nbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
} }
} }
/* do the insertion */ if (nbuf != InvalidBuffer)
res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, _bt_relbuf(rel, nbuf, BT_READ);
btitem, (BTItem) NULL);
/* be tidy */ return NullTransactionId;
_bt_freestack(stack);
_bt_freeskey(itup_scankey);
return res;
} }
/* /*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index. * _bt_insertonpg() -- Insert a tuple on a particular page in the index.
* *
* This recursive procedure does the following things: * This recursive procedure does the following things:
* *
* + if necessary, splits the target page. * + finds the right place to insert the tuple.
* + finds the right place to insert the tuple (taking into * + if necessary, splits the target page (making sure that the
* account any changes induced by a split). * split is equitable as far as post-insert free space goes).
* + inserts the tuple. * + inserts the tuple.
* + if the page was split, pops the parent stack, and finds the * + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking * right place to insert the new child pointer (by walking
* right using information stored in the parent stack). * right using information stored in the parent stack).
* + invoking itself with the appropriate tuple for the right * + invokes itself with the appropriate tuple for the right
* child page on the parent. * child page on the parent.
* *
* On entry, we must have the right buffer on which to do the * On entry, we must have the right buffer on which to do the
* insertion, and the buffer must be pinned and locked. On return, * insertion, and the buffer must be pinned and locked. On return,
* we will have dropped both the pin and the write lock on the buffer. * we will have dropped both the pin and the write lock on the buffer.
* *
* If 'afteritem' is >0 then the new tuple must be inserted after the
* existing item of that number, noplace else. If 'afteritem' is 0
* then the procedure finds the exact spot to insert it by searching.
* (keysz and scankey parameters are used ONLY if afteritem == 0.)
*
* NOTE: if the new key is equal to one or more existing keys, we can
* legitimately place it anywhere in the series of equal keys --- in fact,
* if the new key is equal to the page's "high key" we can place it on
* the next page. If it is equal to the high key, and there's not room
* to insert the new tuple on the current page without splitting, then
* we move right hoping to find more free space and avoid a split.
* Ordinarily, though, we'll insert it before the existing equal keys
* because of the way _bt_binsrch() works.
*
* The locking interactions in this code are critical. You should * The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition, * grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this * you need to understand how we disambiguate duplicate keys in this
* implementation, in order to be able to find our location using * implementation, in order to be able to find our location using
* L&Y "move right" operations. Since we may insert duplicate user * L&Y "move right" operations. Since we may insert duplicate user
* keys, and since these dups may propogate up the tree, we use the * keys, and since these dups may propagate up the tree, we use the
* 'afteritem' parameter to position ourselves correctly for the * 'afteritem' parameter to position ourselves correctly for the
* insertion on internal pages. * insertion on internal pages.
*----------
*/ */
static InsertIndexResult static InsertIndexResult
_bt_insertonpg(Relation rel, _bt_insertonpg(Relation rel,
...@@ -251,17 +310,16 @@ _bt_insertonpg(Relation rel, ...@@ -251,17 +310,16 @@ _bt_insertonpg(Relation rel,
int keysz, int keysz,
ScanKey scankey, ScanKey scankey,
BTItem btitem, BTItem btitem,
BTItem afteritem) OffsetNumber afteritem)
{ {
InsertIndexResult res; InsertIndexResult res;
Page page; Page page;
BTPageOpaque lpageop; BTPageOpaque lpageop;
BlockNumber itup_blkno;
OffsetNumber itup_off; OffsetNumber itup_off;
BlockNumber itup_blkno;
OffsetNumber newitemoff;
OffsetNumber firstright = InvalidOffsetNumber; OffsetNumber firstright = InvalidOffsetNumber;
Size itemsz; Size itemsz;
bool do_split = false;
bool keys_equal = false;
page = BufferGetPage(buf); page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page); lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
...@@ -285,355 +343,117 @@ _bt_insertonpg(Relation rel, ...@@ -285,355 +343,117 @@ _bt_insertonpg(Relation rel,
(PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData)); (PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData));
/* /*
* If we have to insert item on the leftmost page which is the first * Determine exactly where new item will go.
* page in the chain of duplicates then: 1. if scankey == hikey (i.e.
* - new duplicate item) then insert it here; 2. if scankey < hikey
* then: 2.a if there is duplicate key(s) here - we force splitting;
* 2.b else - we may "eat" this page from duplicates chain.
*/ */
if (lpageop->btpo_flags & BTP_CHAIN) if (afteritem > 0)
{ {
OffsetNumber maxoff = PageGetMaxOffsetNumber(page); newitemoff = afteritem + 1;
ItemId hitemid;
BTItem hitem;
Assert(!P_RIGHTMOST(lpageop));
hitemid = PageGetItemId(page, P_HIKEY);
hitem = (BTItem) PageGetItem(page, hitemid);
if (maxoff > P_HIKEY &&
!_bt_itemcmp(rel, keysz, scankey, hitem,
(BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)),
BTEqualStrategyNumber))
elog(FATAL, "btree: bad key on the page in the chain of duplicates");
if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
BTEqualStrategyNumber))
{
if (!P_LEFTMOST(lpageop))
elog(FATAL, "btree: attempt to insert bad key on the non-leftmost page in the chain of duplicates");
if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
BTLessStrategyNumber))
elog(FATAL, "btree: attempt to insert higher key on the leftmost page in the chain of duplicates");
if (maxoff > P_HIKEY) /* have duplicate(s) */
{
firstright = P_FIRSTKEY;
do_split = true;
}
else
/* "eat" page */
{
Buffer pbuf;
Page ppage;
itup_blkno = BufferGetBlockNumber(buf);
itup_off = PageAddItem(page, (Item) btitem, itemsz,
P_FIRSTKEY, LP_USED);
if (itup_off == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item");
lpageop->btpo_flags &= ~BTP_CHAIN;
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
ppage = BufferGetPage(pbuf);
PageIndexTupleDelete(ppage, stack->bts_offset);
pfree(stack->bts_btitem);
stack->bts_btitem = _bt_formitem(&(btitem->bti_itup));
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
itup_blkno, P_HIKEY);
_bt_wrtbuf(rel, buf);
res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
keysz, scankey, stack->bts_btitem,
NULL);
ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
return res;
}
}
else
{
keys_equal = true;
if (PageGetFreeSpace(page) < itemsz)
do_split = true;
}
} }
else if (PageGetFreeSpace(page) < itemsz) else
do_split = true;
else if (PageGetFreeSpace(page) < 3 * itemsz + 2 * sizeof(ItemIdData))
{
OffsetNumber offnum = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
if (offnum < maxoff) /* can't split unless at least 2 items... */
{
ItemId itid;
BTItem previtem,
chkitem;
Size maxsize;
Size currsize;
/* find largest group of identically-keyed items on page */
itid = PageGetItemId(page, offnum);
previtem = (BTItem) PageGetItem(page, itid);
maxsize = currsize = (ItemIdGetLength(itid) + sizeof(ItemIdData));
for (offnum = OffsetNumberNext(offnum);
offnum <= maxoff; offnum = OffsetNumberNext(offnum))
{
itid = PageGetItemId(page, offnum);
chkitem = (BTItem) PageGetItem(page, itid);
if (!_bt_itemcmp(rel, keysz, scankey,
previtem, chkitem,
BTEqualStrategyNumber))
{
if (currsize > maxsize)
maxsize = currsize;
currsize = 0;
previtem = chkitem;
}
currsize += (ItemIdGetLength(itid) + sizeof(ItemIdData));
}
if (currsize > maxsize)
maxsize = currsize;
/* Decide to split if largest group is > 1/2 page size */
maxsize += sizeof(PageHeaderData) +
MAXALIGN(sizeof(BTPageOpaqueData));
if (maxsize >= PageGetPageSize(page) / 2)
do_split = true;
}
}
if (do_split)
{ {
Buffer rbuf;
Page rpage;
BTItem ritem;
BlockNumber rbknum;
BTPageOpaque rpageop;
Buffer pbuf;
Page ppage;
BTPageOpaque ppageop;
BlockNumber bknum = BufferGetBlockNumber(buf);
BTItem lowLeftItem;
OffsetNumber maxoff;
bool shifted = false;
bool left_chained = (lpageop->btpo_flags & BTP_CHAIN) ? true : false;
bool is_root = lpageop->btpo_flags & BTP_ROOT;
/* /*
* Instead of splitting leaf page in the chain of duplicates by * If we will need to split the page to put the item here,
* new duplicate, insert it into some right page. * check whether we can put the tuple somewhere to the right,
* instead. Keep scanning until we find enough free space or
* reach the last page where the tuple can legally go.
*/ */
if ((lpageop->btpo_flags & BTP_CHAIN) && while (PageGetFreeSpace(page) < itemsz &&
(lpageop->btpo_flags & BTP_LEAF) && keys_equal) !P_RIGHTMOST(lpageop) &&
_bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0)
{ {
rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE); /* step right one page */
rpage = BufferGetPage(rbuf); BlockNumber rblkno = lpageop->btpo_next;
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
/*
* some checks
*/
if (!P_RIGHTMOST(rpageop)) /* non-rightmost page */
{ /* If we have the same hikey here then
* it's yet another page in chain. */
if (_bt_skeycmp(rel, keysz, scankey, rpage,
PageGetItemId(rpage, P_HIKEY),
BTEqualStrategyNumber))
{
if (!(rpageop->btpo_flags & BTP_CHAIN))
elog(FATAL, "btree: lost page in the chain of duplicates");
}
else if (_bt_skeycmp(rel, keysz, scankey, rpage,
PageGetItemId(rpage, P_HIKEY),
BTGreaterStrategyNumber))
elog(FATAL, "btree: hikey is out of order");
else if (rpageop->btpo_flags & BTP_CHAIN)
/*
* If hikey > scankey then it's last page in chain and
* BTP_CHAIN must be OFF
*/
elog(FATAL, "btree: lost last page in the chain of duplicates");
}
else
/* rightmost page */
Assert(!(rpageop->btpo_flags & BTP_CHAIN));
_bt_relbuf(rel, buf, BT_WRITE); _bt_relbuf(rel, buf, BT_WRITE);
return (_bt_insertonpg(rel, rbuf, stack, keysz, buf = _bt_getbuf(rel, rblkno, BT_WRITE);
scankey, btitem, afteritem)); page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
} }
/* /*
* If after splitting un-chained page we'll got chain of pages * This is it, so find the position...
* with duplicates then we want to know 1. on which of two pages
* new btitem will go (current _bt_findsplitloc is quite bad); 2.
* what parent (if there's one) thinking about it (remember about
* deletions)
*/ */
else if (!(lpageop->btpo_flags & BTP_CHAIN)) newitemoff = _bt_binsrch(rel, buf, keysz, scankey);
{ }
OffsetNumber start = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
Size llimit;
maxoff = PageGetMaxOffsetNumber(page);
llimit = PageGetPageSize(page) - sizeof(PageHeaderData) -
MAXALIGN(sizeof(BTPageOpaqueData))
+sizeof(ItemIdData);
llimit /= 2;
firstright = _bt_findsplitloc(rel, keysz, scankey,
page, start, maxoff, llimit);
if (_bt_itemcmp(rel, keysz, scankey,
(BTItem) PageGetItem(page, PageGetItemId(page, start)),
(BTItem) PageGetItem(page, PageGetItemId(page, firstright)),
BTEqualStrategyNumber))
{
if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, firstright),
BTLessStrategyNumber))
/*
* force moving current items to the new page: new
* item will go on the current page.
*/
firstright = start;
else
/*
* new btitem >= firstright, start item == firstright
* - new chain of duplicates: if this non-leftmost
* leaf page and parent item < start item then force
* moving all items to the new page - current page
* will be "empty" after it.
*/
{
if (!P_LEFTMOST(lpageop) &&
(lpageop->btpo_flags & BTP_LEAF))
{
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
if (_bt_itemcmp(rel, keysz, scankey,
stack->bts_btitem,
(BTItem) PageGetItem(page,
PageGetItemId(page, start)),
BTLessStrategyNumber))
{
firstright = start;
shifted = true;
}
_bt_relbuf(rel, pbuf, BT_WRITE);
}
}
} /* else - no new chain if start item <
* firstright one */
}
/* split the buffer into left and right halves */ /*
rbuf = _bt_split(rel, keysz, scankey, buf, firstright); * Do we need to split the page to fit the item on it?
*/
if (PageGetFreeSpace(page) < itemsz)
{
Buffer rbuf;
BlockNumber bknum = BufferGetBlockNumber(buf);
BlockNumber rbknum;
bool is_root = P_ISROOT(lpageop);
bool newitemonleft;
/* which new page (left half or right half) gets the tuple? */ /* Choose the split point */
if (_bt_goesonpg(rel, buf, keysz, scankey, afteritem)) firstright = _bt_findsplitloc(rel, page,
{ newitemoff, itemsz,
/* left page */ &newitemonleft);
itup_off = _bt_pgaddtup(rel, buf, keysz, scankey,
itemsz, btitem, afteritem);
itup_blkno = BufferGetBlockNumber(buf);
}
else
{
/* right page */
itup_off = _bt_pgaddtup(rel, rbuf, keysz, scankey,
itemsz, btitem, afteritem);
itup_blkno = BufferGetBlockNumber(rbuf);
}
maxoff = PageGetMaxOffsetNumber(page); /* split the buffer into left and right halves */
if (shifted) rbuf = _bt_split(rel, buf, firstright,
{ newitemoff, itemsz, btitem, newitemonleft,
if (maxoff > P_FIRSTKEY) &itup_off, &itup_blkno);
elog(FATAL, "btree: shifted page is not empty");
lowLeftItem = (BTItem) NULL;
}
else
{
if (maxoff < P_FIRSTKEY)
elog(FATAL, "btree: un-shifted page is empty");
lowLeftItem = (BTItem) PageGetItem(page,
PageGetItemId(page, P_FIRSTKEY));
if (_bt_itemcmp(rel, keysz, scankey, lowLeftItem,
(BTItem) PageGetItem(page, PageGetItemId(page, P_HIKEY)),
BTEqualStrategyNumber))
lpageop->btpo_flags |= BTP_CHAIN;
}
/* /*----------
* By here, * By here,
* *
* + our target page has been split; + the original tuple has been * + our target page has been split;
* inserted; + we have write locks on both the old (left half) * + the original tuple has been inserted;
* and new (right half) buffers, after the split; and + we have * + we have write locks on both the old (left half)
* the key we want to insert into the parent. * and new (right half) buffers, after the split; and
* + we know the key we want to insert into the parent
* (it's the "high key" on the left child page).
*
* We're ready to do the parent insertion. We need to hold onto the
* locks for the child pages until we locate the parent, but we can
* release them before doing the actual insertion (see Lehman and Yao
* for the reasoning).
* *
* Do the parent insertion. We need to hold onto the locks for the * Here we have to do something Lehman and Yao don't talk about:
* child pages until we locate the parent, but we can release them * deal with a root split and construction of a new root. If our
* before doing the actual insertion (see Lehman and Yao for the * stack is empty then we have just split a node on what had been
* reasoning). * the root level when we descended the tree. If it is still the
* root then we perform a new-root construction. If it *wasn't*
* the root anymore, use the parent pointer to get up to the root
* level that someone constructed meanwhile, and find the right
* place to insert as for the normal case.
*----------
*/ */
l_spl: ; if (is_root)
if (stack == (BTStack) NULL)
{ {
if (!is_root) /* if this page was not root page */ Assert(stack == (BTStack) NULL);
{
elog(DEBUG, "btree: concurrent ROOT page split");
stack = (BTStack) palloc(sizeof(BTStackData));
stack->bts_blkno = lpageop->btpo_parent;
stack->bts_offset = InvalidOffsetNumber;
stack->bts_btitem = (BTItem) palloc(sizeof(BTItemData));
/* bts_btitem will be initialized below */
stack->bts_parent = NULL;
goto l_spl;
}
/* create a new root node and release the split buffers */ /* create a new root node and release the split buffers */
_bt_newroot(rel, buf, rbuf); _bt_newroot(rel, buf, rbuf);
} }
else else
{ {
ScanKey newskey;
InsertIndexResult newres; InsertIndexResult newres;
BTItem new_item; BTItem new_item;
OffsetNumber upditem_offset = P_HIKEY; BTStackData fakestack;
bool do_update = false; BTItem ritem;
bool update_in_place = true; Buffer pbuf;
bool parent_chained;
/* form a index tuple that points at the new right page */ /* Set up a phony stack entry if we haven't got a real one */
rbknum = BufferGetBlockNumber(rbuf); if (stack == (BTStack) NULL)
rpage = BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
/*
* By convention, the first entry (1) on every non-rightmost
* page is the high key for that page. In order to get the
* lowest key on the new right page, we actually look at its
* second (2) entry.
*/
if (!P_RIGHTMOST(rpageop))
{ {
ritem = (BTItem) PageGetItem(rpage, elog(DEBUG, "btree: concurrent ROOT page split");
PageGetItemId(rpage, P_FIRSTKEY)); stack = &fakestack;
if (_bt_itemcmp(rel, keysz, scankey, stack->bts_blkno = lpageop->btpo_parent;
ritem, stack->bts_offset = InvalidOffsetNumber;
(BTItem) PageGetItem(rpage, /* bts_btitem will be initialized below */
PageGetItemId(rpage, P_HIKEY)), stack->bts_parent = NULL;
BTEqualStrategyNumber))
rpageop->btpo_flags |= BTP_CHAIN;
} }
else
ritem = (BTItem) PageGetItem(rpage,
PageGetItemId(rpage, P_HIKEY));
/* get a unique btitem for this key */ /* get high key from left page == lowest key on new right page */
new_item = _bt_formitem(&(ritem->bti_itup)); ritem = (BTItem) PageGetItem(page,
PageGetItemId(page, P_HIKEY));
/* form an index tuple that points at the new right page */
new_item = _bt_formitem(&(ritem->bti_itup));
rbknum = BufferGetBlockNumber(rbuf);
ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY); ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY);
/* /*
...@@ -642,192 +462,39 @@ l_spl: ; ...@@ -642,192 +462,39 @@ l_spl: ;
* Oops - if we were moved right then we need to change stack * Oops - if we were moved right then we need to change stack
* item! We want to find parent pointing to where we are, * item! We want to find parent pointing to where we are,
* right ? - vadim 05/27/97 * right ? - vadim 05/27/97
*/
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
ppage = BufferGetPage(pbuf);
ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
parent_chained = ((ppageop->btpo_flags & BTP_CHAIN)) ? true : false;
if (parent_chained && !left_chained)
elog(FATAL, "nbtree: unexpected chained parent of unchained page");
/*
* If the key of new_item is < than the key of the item in the
* parent page pointing to the left page (stack->bts_btitem),
* we have to update the latter key; otherwise the keys on the
* parent page wouldn't be monotonically increasing after we
* inserted the new pointer to the right page (new_item). This
* only happens if our left page is the leftmost page and a
* new minimum key had been inserted before, which is not
* reflected in the parent page but didn't matter so far. If
* there are duplicate keys and this new minimum key spills
* over to our new right page, we get an inconsistency if we
* don't update the left key in the parent page.
* *
* Also, new duplicates handling code require us to update parent * Interestingly, this means we didn't *really* need to stack
* item if some smaller items left on the left page (which is * the parent key at all; all we really care about is the
* possible in splitting leftmost page) and current parent * saved block and offset as a starting point for our search...
* item == new_item. - vadim 05/27/97
*/ */
if (_bt_itemcmp(rel, keysz, scankey, ItemPointerSet(&(stack->bts_btitem.bti_itup.t_tid),
stack->bts_btitem, new_item, bknum, P_HIKEY);
BTGreaterStrategyNumber) ||
(!shifted &&
_bt_itemcmp(rel, keysz, scankey,
stack->bts_btitem, new_item,
BTEqualStrategyNumber) &&
_bt_itemcmp(rel, keysz, scankey,
lowLeftItem, new_item,
BTLessStrategyNumber)))
{
do_update = true;
/*
* figure out which key is leftmost (if the parent page is
* rightmost, too, it must be the root)
*/
if (P_RIGHTMOST(ppageop))
upditem_offset = P_HIKEY;
else
upditem_offset = P_FIRSTKEY;
if (!P_LEFTMOST(lpageop) ||
stack->bts_offset != upditem_offset)
elog(FATAL, "btree: items are out of order (leftmost %d, stack %u, update %u)",
P_LEFTMOST(lpageop), stack->bts_offset, upditem_offset);
}
if (do_update)
{
if (shifted)
elog(FATAL, "btree: attempt to update parent for shifted page");
/*
* Try to update in place. If out parent page is chained
* then we must forse insertion.
*/
if (!parent_chained &&
MAXALIGN(IndexTupleDSize(lowLeftItem->bti_itup)) ==
MAXALIGN(IndexTupleDSize(stack->bts_btitem->bti_itup)))
{
_bt_updateitem(rel, keysz, pbuf,
stack->bts_btitem, lowLeftItem);
_bt_wrtbuf(rel, buf);
_bt_wrtbuf(rel, rbuf);
}
else
{
update_in_place = false;
PageIndexTupleDelete(ppage, upditem_offset);
/*
* don't write anything out yet--we still have the
* write lock, and now we call another _bt_insertonpg
* to insert the correct key. First, make a new item,
* using the tuple data from lowLeftItem. Point it to
* the left child. Update it on the stack at the same
* time.
*/
pfree(stack->bts_btitem);
stack->bts_btitem = _bt_formitem(&(lowLeftItem->bti_itup));
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
bknum, P_HIKEY);
/*
* Unlock the children before doing this
*/
_bt_wrtbuf(rel, buf);
_bt_wrtbuf(rel, rbuf);
/*
* A regular _bt_binsrch should find the right place
* to put the new entry, since it should be lower than
* any other key on the page. Therefore set afteritem
* to NULL.
*/
newskey = _bt_mkscankey(rel, &(stack->bts_btitem->bti_itup));
newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
keysz, newskey, stack->bts_btitem,
NULL);
pfree(newres);
pfree(newskey);
/*
* we have now lost our lock on the parent buffer, and
* need to get it back.
*/
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
}
}
else
{
_bt_wrtbuf(rel, buf);
_bt_wrtbuf(rel, rbuf);
}
newskey = _bt_mkscankey(rel, &(new_item->bti_itup)); pbuf = _bt_getstackbuf(rel, stack);
afteritem = stack->bts_btitem; /* Now we can write and unlock the children */
if (parent_chained && !update_in_place) _bt_wrtbuf(rel, rbuf);
{ _bt_wrtbuf(rel, buf);
ppage = BufferGetPage(pbuf);
ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
if (ppageop->btpo_flags & BTP_CHAIN)
elog(FATAL, "btree: unexpected BTP_CHAIN flag in parent after update");
if (P_RIGHTMOST(ppageop))
elog(FATAL, "btree: chained parent is RIGHTMOST after update");
maxoff = PageGetMaxOffsetNumber(ppage);
if (maxoff != P_FIRSTKEY)
elog(FATAL, "btree: FIRSTKEY was unexpected in parent after update");
if (_bt_skeycmp(rel, keysz, newskey, ppage,
PageGetItemId(ppage, P_FIRSTKEY),
BTLessEqualStrategyNumber))
elog(FATAL, "btree: parent FIRSTKEY is >= duplicate key after update");
if (!_bt_skeycmp(rel, keysz, newskey, ppage,
PageGetItemId(ppage, P_HIKEY),
BTEqualStrategyNumber))
elog(FATAL, "btree: parent HIGHKEY is not equal duplicate key after update");
afteritem = (BTItem) NULL;
}
else if (left_chained && !update_in_place)
{
ppage = BufferGetPage(pbuf);
ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
if (!P_RIGHTMOST(ppageop) &&
_bt_skeycmp(rel, keysz, newskey, ppage,
PageGetItemId(ppage, P_HIKEY),
BTGreaterStrategyNumber))
afteritem = (BTItem) NULL;
}
if (afteritem == (BTItem) NULL)
{
rbuf = _bt_getbuf(rel, ppageop->btpo_next, BT_WRITE);
_bt_relbuf(rel, pbuf, BT_WRITE);
pbuf = rbuf;
}
/* Recursively update the parent */
newres = _bt_insertonpg(rel, pbuf, stack->bts_parent, newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
keysz, newskey, new_item, 0, NULL, new_item, stack->bts_offset);
afteritem);
/* be tidy */ /* be tidy */
pfree(newres); pfree(newres);
pfree(newskey);
pfree(new_item); pfree(new_item);
} }
} }
else else
{ {
itup_off = _bt_pgaddtup(rel, buf, keysz, scankey, _bt_pgaddtup(rel, page, itemsz, btitem, newitemoff, "page");
itemsz, btitem, afteritem); itup_off = newitemoff;
itup_blkno = BufferGetBlockNumber(buf); itup_blkno = BufferGetBlockNumber(buf);
/* Write out the updated page and release pin/lock */
_bt_relbuf(rel, buf, BT_WRITE); _bt_wrtbuf(rel, buf);
} }
/* by here, the new tuple is inserted */ /* by here, the new tuple is inserted at itup_blkno/itup_off */
res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData)); res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData));
ItemPointerSet(&(res->pointerData), itup_blkno, itup_off); ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
...@@ -838,12 +505,19 @@ l_spl: ; ...@@ -838,12 +505,19 @@ l_spl: ;
* _bt_split() -- split a page in the btree. * _bt_split() -- split a page in the btree.
* *
* On entry, buf is the page to split, and is write-locked and pinned. * On entry, buf is the page to split, and is write-locked and pinned.
* Returns the new right sibling of buf, pinned and write-locked. The * firstright is the item index of the first item to be moved to the
* pin and lock on buf are maintained. * new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained. *itup_off and *itup_blkno
* are set to the exact location where newitem was inserted.
*/ */
static Buffer static Buffer
_bt_split(Relation rel, Size keysz, ScanKey scankey, _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
Buffer buf, OffsetNumber firstright) OffsetNumber newitemoff, Size newitemsz, BTItem newitem,
bool newitemonleft,
OffsetNumber *itup_off, BlockNumber *itup_blkno)
{ {
Buffer rbuf; Buffer rbuf;
Page origpage; Page origpage;
...@@ -860,7 +534,6 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, ...@@ -860,7 +534,6 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
BTItem item; BTItem item;
OffsetNumber leftoff, OffsetNumber leftoff,
rightoff; rightoff;
OffsetNumber start;
OffsetNumber maxoff; OffsetNumber maxoff;
OffsetNumber i; OffsetNumber i;
...@@ -869,8 +542,8 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, ...@@ -869,8 +542,8 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData)); leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData));
rightpage = BufferGetPage(rbuf); rightpage = BufferGetPage(rbuf);
_bt_pageinit(rightpage, BufferGetPageSize(rbuf));
_bt_pageinit(leftpage, BufferGetPageSize(buf)); _bt_pageinit(leftpage, BufferGetPageSize(buf));
_bt_pageinit(rightpage, BufferGetPageSize(rbuf));
/* init btree private data */ /* init btree private data */
oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage); oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage);
...@@ -879,106 +552,130 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, ...@@ -879,106 +552,130 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
/* if we're splitting this page, it won't be the root when we're done */ /* if we're splitting this page, it won't be the root when we're done */
oopaque->btpo_flags &= ~BTP_ROOT; oopaque->btpo_flags &= ~BTP_ROOT;
oopaque->btpo_flags &= ~BTP_CHAIN;
lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags; lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_prev = oopaque->btpo_prev; lopaque->btpo_prev = oopaque->btpo_prev;
ropaque->btpo_prev = BufferGetBlockNumber(buf);
lopaque->btpo_next = BufferGetBlockNumber(rbuf); lopaque->btpo_next = BufferGetBlockNumber(rbuf);
ropaque->btpo_prev = BufferGetBlockNumber(buf);
ropaque->btpo_next = oopaque->btpo_next; ropaque->btpo_next = oopaque->btpo_next;
/*
* Must copy the original parent link into both new pages, even though
* it might be quite obsolete by now. We might need it if this level
* is or recently was the root (see README).
*/
lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent; lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent;
/* /*
* If the page we're splitting is not the rightmost page at its level * If the page we're splitting is not the rightmost page at its level
* in the tree, then the first (0) entry on the page is the high key * in the tree, then the first entry on the page is the high key
* for the page. We need to copy that to the right half. Otherwise * for the page. We need to copy that to the right half. Otherwise
* (meaning the rightmost page case), we should treat the line * (meaning the rightmost page case), all the items on the right half
* pointers beginning at zero as user data. * will be user data.
*
* We leave a blank space at the start of the line table for the left
* page. We'll come back later and fill it in with the high key item
* we get from the right key.
*/ */
rightoff = P_HIKEY;
leftoff = P_FIRSTKEY;
ropaque->btpo_next = oopaque->btpo_next;
if (!P_RIGHTMOST(oopaque)) if (!P_RIGHTMOST(oopaque))
{ {
/* splitting a non-rightmost page, start at the first data item */
start = P_FIRSTKEY;
itemid = PageGetItemId(origpage, P_HIKEY); itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid); itemsz = ItemIdGetLength(itemid);
item = (BTItem) PageGetItem(origpage, itemid); item = (BTItem) PageGetItem(origpage, itemid);
if (PageAddItem(rightpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add hikey to the right sibling"); elog(FATAL, "btree: failed to add hikey to the right sibling");
rightoff = P_FIRSTKEY; rightoff = OffsetNumberNext(rightoff);
} }
else
{
/* splitting a rightmost page, "high key" is the first data item */
start = P_HIKEY;
/* the new rightmost page will not have a high key */ /*
rightoff = P_HIKEY; * The "high key" for the new left page will be the first key that's
* going to go into the new right page. This might be either the
* existing data item at position firstright, or the incoming tuple.
*/
leftoff = P_HIKEY;
if (!newitemonleft && newitemoff == firstright)
{
/* incoming tuple will become first on right page */
itemsz = newitemsz;
item = newitem;
} }
maxoff = PageGetMaxOffsetNumber(origpage); else
if (firstright == InvalidOffsetNumber)
{ {
Size llimit = PageGetFreeSpace(leftpage) / 2; /* existing item at firstright will become first on right page */
itemid = PageGetItemId(origpage, firstright);
firstright = _bt_findsplitloc(rel, keysz, scankey, itemsz = ItemIdGetLength(itemid);
origpage, start, maxoff, llimit); item = (BTItem) PageGetItem(origpage, itemid);
} }
if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add hikey to the left sibling");
leftoff = OffsetNumberNext(leftoff);
for (i = start; i <= maxoff; i = OffsetNumberNext(i)) /*
* Now transfer all the data items to the appropriate page
*/
maxoff = PageGetMaxOffsetNumber(origpage);
for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
{ {
itemid = PageGetItemId(origpage, i); itemid = PageGetItemId(origpage, i);
itemsz = ItemIdGetLength(itemid); itemsz = ItemIdGetLength(itemid);
item = (BTItem) PageGetItem(origpage, itemid); item = (BTItem) PageGetItem(origpage, itemid);
/* does new item belong before this one? */
if (i == newitemoff)
{
if (newitemonleft)
{
_bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
"left sibling");
*itup_off = leftoff;
*itup_blkno = BufferGetBlockNumber(buf);
leftoff = OffsetNumberNext(leftoff);
}
else
{
_bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
"right sibling");
*itup_off = rightoff;
*itup_blkno = BufferGetBlockNumber(rbuf);
rightoff = OffsetNumberNext(rightoff);
}
}
/* decide which page to put it on */ /* decide which page to put it on */
if (i < firstright) if (i < firstright)
{ {
if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, _bt_pgaddtup(rel, leftpage, itemsz, item, leftoff,
LP_USED) == InvalidOffsetNumber) "left sibling");
elog(FATAL, "btree: failed to add item to the left sibling");
leftoff = OffsetNumberNext(leftoff); leftoff = OffsetNumberNext(leftoff);
} }
else else
{ {
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, _bt_pgaddtup(rel, rightpage, itemsz, item, rightoff,
LP_USED) == InvalidOffsetNumber) "right sibling");
elog(FATAL, "btree: failed to add item to the right sibling");
rightoff = OffsetNumberNext(rightoff); rightoff = OffsetNumberNext(rightoff);
} }
} }
/* /* cope with possibility that newitem goes at the end */
* Okay, page has been split, high key on right page is correct. Now if (i <= newitemoff)
* set the high key on the left page to be the min key on the right {
* page. if (newitemonleft)
*/ {
_bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
if (P_RIGHTMOST(ropaque)) "left sibling");
itemid = PageGetItemId(rightpage, P_HIKEY); *itup_off = leftoff;
else *itup_blkno = BufferGetBlockNumber(buf);
itemid = PageGetItemId(rightpage, P_FIRSTKEY); leftoff = OffsetNumberNext(leftoff);
itemsz = ItemIdGetLength(itemid); }
item = (BTItem) PageGetItem(rightpage, itemid); else
{
/* _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
* We left a hole for the high key on the left page; fill it. The "right sibling");
* modal crap is to tell the page manager to put the new item on the *itup_off = rightoff;
* page and not screw around with anything else. Whoever designed *itup_blkno = BufferGetBlockNumber(rbuf);
* this interface has presumably crawled back into the dung heap they rightoff = OffsetNumberNext(rightoff);
* came from. No one here will admit to it. }
*/ }
PageManagerModeSet(OverwritePageManagerMode);
if (PageAddItem(leftpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add hikey to the left sibling");
PageManagerModeSet(ShufflePageManagerMode);
/* /*
* By here, the original data page has been split into two new halves, * By here, the original data page has been split into two new halves,
...@@ -992,14 +689,10 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, ...@@ -992,14 +689,10 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
PageRestoreTempPage(leftpage, origpage); PageRestoreTempPage(leftpage, origpage);
/* write these guys out */
_bt_wrtnorelbuf(rel, rbuf);
_bt_wrtnorelbuf(rel, buf);
/* /*
* Finally, we need to grab the right sibling (if any) and fix the * Finally, we need to grab the right sibling (if any) and fix the
* prev pointer there. We are guaranteed that this is deadlock-free * prev pointer there. We are guaranteed that this is deadlock-free
* since no other writer will be moving holding a lock on that page * since no other writer will be holding a lock on that page
* and trying to move left, and all readers release locks on a page * and trying to move left, and all readers release locks on a page
* before trying to fetch its neighbors. * before trying to fetch its neighbors.
*/ */
...@@ -1020,87 +713,214 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, ...@@ -1020,87 +713,214 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
} }
/* /*
* _bt_findsplitloc() -- find a safe place to split a page. * _bt_findsplitloc() -- find an appropriate place to split a page.
*
* The idea here is to equalize the free space that will be on each split
* page, *after accounting for the inserted tuple*. (If we fail to account
* for it, we might find ourselves with too little room on the page that
* it needs to go into!)
* *
* In order to guarantee the proper handling of searches for duplicate * We are passed the intended insert position of the new tuple, expressed as
* keys, the first duplicate in the chain must either be the first * the offsetnumber of the tuple it must go in front of. (This could be
* item on the page after the split, or the entire chain must be on * maxoff+1 if the tuple is to go at the end.)
* one of the two pages. That is, *
* [1 2 2 2 3 4 5] * We return the index of the first existing tuple that should go on the
* must become * righthand page, plus a boolean indicating whether the new tuple goes on
* [1] [2 2 2 3 4 5] * the left or right page. The bool is necessary to disambiguate the case
* or * where firstright == newitemoff.
* [1 2 2 2] [3 4 5]
* but not
* [1 2 2] [2 3 4 5].
* However,
* [2 2 2 2 2 3 4]
* may be split as
* [2 2 2 2] [2 3 4].
*/ */
static OffsetNumber static OffsetNumber
_bt_findsplitloc(Relation rel, _bt_findsplitloc(Relation rel,
Size keysz,
ScanKey scankey,
Page page, Page page,
OffsetNumber start, OffsetNumber newitemoff,
OffsetNumber maxoff, Size newitemsz,
Size llimit) bool *newitemonleft)
{ {
OffsetNumber i; BTPageOpaque opaque;
OffsetNumber saferight; OffsetNumber offnum;
ItemId nxtitemid, OffsetNumber maxoff;
safeitemid; ItemId itemid;
BTItem safeitem, FindSplitData state;
nxtitem; int leftspace,
Size nbytes; rightspace,
dataitemtotal,
if (start >= maxoff) dataitemstoleft;
elog(FATAL, "btree: cannot split if start (%d) >= maxoff (%d)",
start, maxoff); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
saferight = start;
safeitemid = PageGetItemId(page, saferight); state.newitemsz = newitemsz;
nbytes = ItemIdGetLength(safeitemid) + sizeof(ItemIdData); state.non_leaf = ! P_ISLEAF(opaque);
safeitem = (BTItem) PageGetItem(page, safeitemid); state.have_split = false;
i = OffsetNumberNext(start); /* Total free space available on a btree page, after fixed overhead */
leftspace = rightspace =
while (nbytes < llimit) PageGetPageSize(page) - sizeof(PageHeaderData) -
MAXALIGN(sizeof(BTPageOpaqueData))
+ sizeof(ItemIdData);
/* The right page will have the same high key as the old page */
if (!P_RIGHTMOST(opaque))
{ {
/* check the next item on the page */ itemid = PageGetItemId(page, P_HIKEY);
nxtitemid = PageGetItemId(page, i); rightspace -= (int) (ItemIdGetLength(itemid) + sizeof(ItemIdData));
nbytes += (ItemIdGetLength(nxtitemid) + sizeof(ItemIdData)); }
nxtitem = (BTItem) PageGetItem(page, nxtitemid);
/* Count up total space in data items without actually scanning 'em */
dataitemtotal = rightspace - (int) PageGetFreeSpace(page);
/*
* Scan through the data items and calculate space usage for a split
* at each possible position. XXX we could probably stop somewhere
* near the middle...
*/
dataitemstoleft = 0;
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = P_FIRSTDATAKEY(opaque);
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
Size itemsz;
int leftfree,
rightfree;
itemid = PageGetItemId(page, offnum);
itemsz = ItemIdGetLength(itemid) + sizeof(ItemIdData);
/* /*
* Test against last known safe item: if the tuple we're looking * We have to allow for the current item becoming the high key of
* at isn't equal to the last safe one we saw, then it's our new * the left page; therefore it counts against left space.
* safe tuple.
*/ */
if (!_bt_itemcmp(rel, keysz, scankey, leftfree = leftspace - dataitemstoleft - (int) itemsz;
safeitem, nxtitem, BTEqualStrategyNumber)) rightfree = rightspace - (dataitemtotal - dataitemstoleft);
if (offnum < newitemoff)
_bt_checksplitloc(&state, offnum, leftfree, rightfree,
false, itemsz);
else if (offnum > newitemoff)
_bt_checksplitloc(&state, offnum, leftfree, rightfree,
true, itemsz);
else
{ {
safeitem = nxtitem; /* need to try it both ways!! */
saferight = i; _bt_checksplitloc(&state, offnum, leftfree, rightfree,
false, newitemsz);
_bt_checksplitloc(&state, offnum, leftfree, rightfree,
true, itemsz);
} }
if (i < maxoff)
i = OffsetNumberNext(i); dataitemstoleft += itemsz;
else
break;
} }
if (! state.have_split)
elog(FATAL, "_bt_findsplitloc: can't find a feasible split point for %s",
RelationGetRelationName(rel));
*newitemonleft = state.newitemonleft;
return state.firstright;
}
static void
_bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
int leftfree, int rightfree,
bool newitemonleft, Size firstrightitemsz)
{
if (newitemonleft)
leftfree -= (int) state->newitemsz;
else
rightfree -= (int) state->newitemsz;
/*
* If we are not on the leaf level, we will be able to discard the
* key data from the first item that winds up on the right page.
*/
if (state->non_leaf)
rightfree += (int) firstrightitemsz -
(int) (sizeof(BTItemData) + sizeof(ItemIdData));
/* /*
* If the chain of dups starts at the beginning of the page and * If feasible split point, remember best delta.
* extends past the halfway mark, we can split it in the middle.
*/ */
if (leftfree >= 0 && rightfree >= 0)
{
int delta = leftfree - rightfree;
if (delta < 0)
delta = -delta;
if (!state->have_split || delta < state->best_delta)
{
state->have_split = true;
state->newitemonleft = newitemonleft;
state->firstright = firstright;
state->best_delta = delta;
}
}
}
/*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
* This is possible because we save a bit image of the last item
* we looked at in the parent, and the update algorithm guarantees
* that if items above us in the tree move, they only move right.
*
* Also, re-set bts_blkno & bts_offset if changed.
*/
static Buffer
_bt_getstackbuf(Relation rel, BTStack stack)
{
BlockNumber blkno;
Buffer buf;
OffsetNumber start,
offnum,
maxoff;
Page page;
ItemId itemid;
BTItem item;
BTPageOpaque opaque;
blkno = stack->bts_blkno;
buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
if (saferight == start) start = stack->bts_offset;
saferight = i; /*
* _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
* case of concurrent ROOT page split. Also, watch out for
* possibility that page has a high key now when it didn't before.
*/
if (start < P_FIRSTDATAKEY(opaque))
start = P_FIRSTDATAKEY(opaque);
if (saferight == maxoff && (maxoff - start) > 1) for (;;)
saferight = start + (maxoff - start) / 2; {
/* see if it's on this page */
for (offnum = start;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
itemid = PageGetItemId(page, offnum);
item = (BTItem) PageGetItem(page, itemid);
if (BTItemSame(item, &stack->bts_btitem))
{
/* Return accurate pointer to where link is now */
stack->bts_blkno = blkno;
stack->bts_offset = offnum;
return buf;
}
}
/* by here, the item we're looking for moved right at least one page */
if (P_RIGHTMOST(opaque))
elog(FATAL, "_bt_getstackbuf: my bits moved right off the end of the world!"
"\n\tRecreate index %s.", RelationGetRelationName(rel));
return saferight; blkno = opaque->btpo_next;
_bt_relbuf(rel, buf, BT_WRITE);
buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
start = P_FIRSTDATAKEY(opaque);
}
} }
/* /*
...@@ -1116,9 +936,9 @@ _bt_findsplitloc(Relation rel, ...@@ -1116,9 +936,9 @@ _bt_findsplitloc(Relation rel,
* graph. * graph.
* *
* On entry, lbuf (the old root) and rbuf (its new peer) are write- * On entry, lbuf (the old root) and rbuf (its new peer) are write-
* locked. We don't drop the locks in this routine; that's done by * locked. On exit, a new root page exists with entries for the
* the caller. On exit, a new root page exists with entries for the * two new children. The new root page is neither pinned nor locked, and
* two new children. The new root page is neither pinned nor locked. * we have also written out lbuf and rbuf and dropped their pins/locks.
*/ */
static void static void
_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
...@@ -1140,52 +960,52 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) ...@@ -1140,52 +960,52 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
rootpage = BufferGetPage(rootbuf); rootpage = BufferGetPage(rootbuf);
rootbknum = BufferGetBlockNumber(rootbuf); rootbknum = BufferGetBlockNumber(rootbuf);
_bt_pageinit(rootpage, BufferGetPageSize(rootbuf));
/* set btree special data */ /* set btree special data */
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE; rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
rootopaque->btpo_flags |= BTP_ROOT; rootopaque->btpo_flags |= BTP_ROOT;
/*
* Insert the internal tuple pointers.
*/
lbkno = BufferGetBlockNumber(lbuf); lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf); rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf); lpage = BufferGetPage(lbuf);
rpage = BufferGetPage(rbuf); rpage = BufferGetPage(rbuf);
/*
* Make sure pages in old root level have valid parent links --- we will
* need this in _bt_insertonpg() if a concurrent root split happens (see
* README).
*/
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent = ((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent =
((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent = ((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent =
rootbknum; rootbknum;
/* /*
* step over the high key on the left page while building the left * Create downlink item for left page (old root). Since this will be
* page pointer. * the first item in a non-leaf page, it implicitly has minus-infinity
* key value, so we need not store any actual key in it.
*/ */
itemid = PageGetItemId(lpage, P_FIRSTKEY); itemsz = sizeof(BTItemData);
itemsz = ItemIdGetLength(itemid); new_item = (BTItem) palloc(itemsz);
item = (BTItem) PageGetItem(lpage, itemid); new_item->bti_itup.t_info = itemsz;
new_item = _bt_formitem(&(item->bti_itup));
ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY); ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY);
/* /*
* insert the left page pointer into the new root page. the root page * Insert the left page pointer into the new root page. The root page
* is the rightmost page on its level so the "high key" item is the * is the rightmost page on its level so there is no "high key" in it;
* first data item. * the two items will go into positions P_HIKEY and P_FIRSTKEY.
*/ */
if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add leftkey to new root page"); elog(FATAL, "btree: failed to add leftkey to new root page");
pfree(new_item); pfree(new_item);
/* /*
* the right page is the rightmost page on the second level, so the * Create downlink item for right page. The key for it is obtained from
* "high key" item is the first data item on that page as well. * the "high key" position in the left page.
*/ */
itemid = PageGetItemId(rpage, P_HIKEY); itemid = PageGetItemId(lpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid); itemsz = ItemIdGetLength(itemid);
item = (BTItem) PageGetItem(rpage, itemid); item = (BTItem) PageGetItem(lpage, itemid);
new_item = _bt_formitem(&(item->bti_itup)); new_item = _bt_formitem(&(item->bti_itup));
ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY); ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY);
...@@ -1196,497 +1016,101 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) ...@@ -1196,497 +1016,101 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
elog(FATAL, "btree: failed to add rightkey to new root page"); elog(FATAL, "btree: failed to add rightkey to new root page");
pfree(new_item); pfree(new_item);
/* write and let go of the root buffer */ /* write and let go of the new root buffer */
_bt_wrtbuf(rel, rootbuf); _bt_wrtbuf(rel, rootbuf);
/* update metadata page with new root block number */ /* update metadata page with new root block number */
_bt_metaproot(rel, rootbknum, 0); _bt_metaproot(rel, rootbknum, 0);
_bt_wrtbuf(rel, lbuf); /* update and release new sibling, and finally the old root */
_bt_wrtbuf(rel, rbuf); _bt_wrtbuf(rel, rbuf);
_bt_wrtbuf(rel, lbuf);
} }
/* /*
* _bt_pgaddtup() -- add a tuple to a particular page in the index. * _bt_pgaddtup() -- add a tuple to a particular page in the index.
* *
* This routine adds the tuple to the page as requested, and keeps the * This routine adds the tuple to the page as requested. It does
* write lock and reference associated with the page's buffer. It is * not affect pin/lock status, but you'd better have a write lock
* an error to call pgaddtup() without a write lock and reference. If * and pin on the target buffer! Don't forget to write and release
* afteritem is non-null, it's the item that we expect our new item * the buffer afterwards, either.
* to follow. Otherwise, we do a binary search for the correct place *
* and insert the new item there. * The main difference between this routine and a bare PageAddItem call
* is that this code knows that the leftmost data item on a non-leaf
* btree page doesn't need to have a key. Therefore, it strips such
* items down to just the item header. CAUTION: this works ONLY if
* we insert the items in order, so that the given itup_off does
* represent the final position of the item!
*/ */
static OffsetNumber static void
_bt_pgaddtup(Relation rel, _bt_pgaddtup(Relation rel,
Buffer buf, Page page,
int keysz,
ScanKey itup_scankey,
Size itemsize, Size itemsize,
BTItem btitem, BTItem btitem,
BTItem afteritem) OffsetNumber itup_off,
{ const char *where)
OffsetNumber itup_off;
OffsetNumber first;
Page page;
BTPageOpaque opaque;
BTItem chkitem;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
first = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
if (afteritem == (BTItem) NULL)
itup_off = _bt_binsrch(rel, buf, keysz, itup_scankey, BT_INSERTION);
else
{
itup_off = first;
do
{
chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, itup_off));
itup_off = OffsetNumberNext(itup_off);
} while (!BTItemSame(chkitem, afteritem));
}
if (PageAddItem(page, (Item) btitem, itemsize, itup_off, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item to the page");
/* write the buffer, but hold our lock */
_bt_wrtnorelbuf(rel, buf);
return itup_off;
}
/*
* _bt_goesonpg() -- Does a new tuple belong on this page?
*
* This is part of the complexity introduced by allowing duplicate
* keys into the index. The tuple belongs on this page if:
*
* + there is no page to the right of this one; or
* + it is less than the high key on the page; or
* + the item it is to follow ("afteritem") appears on this
* page.
*/
static bool
_bt_goesonpg(Relation rel,
Buffer buf,
Size keysz,
ScanKey scankey,
BTItem afteritem)
{
Page page;
ItemId hikey;
BTPageOpaque opaque;
BTItem chkitem;
OffsetNumber offnum,
maxoff;
bool found;
page = BufferGetPage(buf);
/* no right neighbor? */
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque))
return true;
/*
* this is a non-rightmost page, so it must have a high key item.
*
* If the scan key is < the high key (the min key on the next page), then
* it for sure belongs here.
*/
hikey = PageGetItemId(page, P_HIKEY);
if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTLessStrategyNumber))
return true;
/*
* If the scan key is > the high key, then it for sure doesn't belong
* here.
*/
if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTGreaterStrategyNumber))
return false;
/*
* If we have no adjacency information, and the item is equal to the
* high key on the page (by here it is), then the item does not belong
* on this page.
*
* Now it's not true in all cases. - vadim 06/10/97
*/
if (afteritem == (BTItem) NULL)
{
if (opaque->btpo_flags & BTP_LEAF)
return false;
if (opaque->btpo_flags & BTP_CHAIN)
return true;
if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, P_FIRSTKEY),
BTEqualStrategyNumber))
return true;
return false;
}
/* damn, have to work for it. i hate that. */
maxoff = PageGetMaxOffsetNumber(page);
/*
* Search the entire page for the afteroid. We need to do this,
* rather than doing a binary search and starting from there, because
* if the key we're searching for is the leftmost key in the tree at
* this level, then a binary search will do the wrong thing. Splits
* are pretty infrequent, so the cost isn't as bad as it could be.
*/
found = false;
for (offnum = P_FIRSTKEY;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
if (BTItemSame(chkitem, afteritem))
{
found = true;
break;
}
}
return found;
}
/*
* _bt_tuplecompare() -- compare two IndexTuples,
* return -1, 0, or +1
*
*/
static int32
_bt_tuplecompare(Relation rel,
Size keysz,
ScanKey scankey,
IndexTuple tuple1,
IndexTuple tuple2)
{ {
TupleDesc tupDes; BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
int i; BTItemData truncitem;
int32 compare = 0;
tupDes = RelationGetDescr(rel); if (! P_ISLEAF(opaque) && itup_off == P_FIRSTDATAKEY(opaque))
for (i = 1; i <= (int) keysz; i++)
{
ScanKey entry = &scankey[i - 1];
Datum attrDatum1,
attrDatum2;
bool isFirstNull,
isSecondNull;
attrDatum1 = index_getattr(tuple1, i, tupDes, &isFirstNull);
attrDatum2 = index_getattr(tuple2, i, tupDes, &isSecondNull);
/* see comments about NULLs handling in btbuild */
if (isFirstNull) /* attr in tuple1 is NULL */
{
if (isSecondNull) /* attr in tuple2 is NULL too */
compare = 0;
else
compare = 1; /* NULL ">" not-NULL */
}
else if (isSecondNull) /* attr in tuple1 is NOT_NULL and */
{ /* attr in tuple2 is NULL */
compare = -1; /* not-NULL "<" NULL */
}
else
{
compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
attrDatum1, attrDatum2));
}
if (compare != 0)
break; /* done when we find unequal attributes */
}
return compare;
}
/*
* _bt_itemcmp() -- compare two BTItems using a requested
* strategy (<, <=, =, >=, >)
*
*/
bool
_bt_itemcmp(Relation rel,
Size keysz,
ScanKey scankey,
BTItem item1,
BTItem item2,
StrategyNumber strat)
{
int32 compare;
compare = _bt_tuplecompare(rel, keysz, scankey,
&(item1->bti_itup),
&(item2->bti_itup));
switch (strat)
{ {
case BTLessStrategyNumber: memcpy(&truncitem, btitem, sizeof(BTItemData));
return (bool) (compare < 0); truncitem.bti_itup.t_info = sizeof(BTItemData);
case BTLessEqualStrategyNumber: btitem = &truncitem;
return (bool) (compare <= 0); itemsize = sizeof(BTItemData);
case BTEqualStrategyNumber:
return (bool) (compare == 0);
case BTGreaterEqualStrategyNumber:
return (bool) (compare >= 0);
case BTGreaterStrategyNumber:
return (bool) (compare > 0);
} }
elog(ERROR, "_bt_itemcmp: bogus strategy %d", (int) strat); if (PageAddItem(page, (Item) btitem, itemsize, itup_off,
return false; LP_USED) == InvalidOffsetNumber)
} elog(FATAL, "btree: failed to add item to the %s for %s",
where, RelationGetRelationName(rel));
/*
* _bt_updateitem() -- updates the key of the item identified by the
* oid with the key of newItem (done in place if
* possible)
*
*/
static void
_bt_updateitem(Relation rel,
Size keysz,
Buffer buf,
BTItem oldItem,
BTItem newItem)
{
Page page;
OffsetNumber maxoff;
OffsetNumber i;
ItemPointerData itemPtrData;
BTItem item;
IndexTuple oldIndexTuple,
newIndexTuple;
int first;
page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page);
/* locate item on the page */
first = P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page))
? P_HIKEY : P_FIRSTKEY;
i = first;
do
{
item = (BTItem) PageGetItem(page, PageGetItemId(page, i));
i = OffsetNumberNext(i);
} while (i <= maxoff && !BTItemSame(item, oldItem));
/* this should never happen (in theory) */
if (!BTItemSame(item, oldItem))
elog(FATAL, "_bt_getstackbuf was lying!!");
/*
* It's defined by caller (_bt_insertonpg)
*/
/*
* if(IndexTupleDSize(newItem->bti_itup) >
* IndexTupleDSize(item->bti_itup)) { elog(NOTICE, "trying to
* overwrite a smaller value with a bigger one in _bt_updateitem");
* elog(ERROR, "this is not good."); }
*/
oldIndexTuple = &(item->bti_itup);
newIndexTuple = &(newItem->bti_itup);
/* keep the original item pointer */
ItemPointerCopy(&(oldIndexTuple->t_tid), &itemPtrData);
CopyIndexTuple(newIndexTuple, &oldIndexTuple);
ItemPointerCopy(&itemPtrData, &(oldIndexTuple->t_tid));
} }
/* /*
* _bt_isequal - used in _bt_doinsert in check for duplicates. * _bt_isequal - used in _bt_doinsert in check for duplicates.
* *
* This is very similar to _bt_compare, except for NULL handling.
* Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too. * Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too.
*/ */
static bool static bool
_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
int keysz, ScanKey scankey) int keysz, ScanKey scankey)
{ {
Datum datum;
BTItem btitem; BTItem btitem;
IndexTuple itup; IndexTuple itup;
ScanKey entry;
AttrNumber attno;
int32 result;
int i; int i;
bool null;
/* Better be comparing to a leaf item */
Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup); itup = &(btitem->bti_itup);
for (i = 1; i <= keysz; i++) for (i = 1; i <= keysz; i++)
{ {
entry = &scankey[i - 1]; ScanKey entry = &scankey[i - 1];
AttrNumber attno;
Datum datum;
bool isNull;
int32 result;
attno = entry->sk_attno; attno = entry->sk_attno;
Assert(attno == i); Assert(attno == i);
datum = index_getattr(itup, attno, itupdesc, &null); datum = index_getattr(itup, attno, itupdesc, &isNull);
/* NULLs are not equal */ /* NULLs are never equal to anything */
if (entry->sk_flags & SK_ISNULL || null) if (entry->sk_flags & SK_ISNULL || isNull)
return false; return false;
result = DatumGetInt32(FunctionCall2(&entry->sk_func, result = DatumGetInt32(FunctionCall2(&entry->sk_func,
entry->sk_argument, datum)); entry->sk_argument,
datum));
if (result != 0) if (result != 0)
return false; return false;
} }
/* by here, the keys are equal */ /* if we get here, the keys are equal */
return true; return true;
} }
#ifdef NOT_USED
/*
* _bt_shift - insert btitem on the passed page after shifting page
* to the right in the tree.
*
* NOTE: tested for shifting leftmost page only, having btitem < hikey.
*/
static InsertIndexResult
_bt_shift(Relation rel, Buffer buf, BTStack stack, int keysz,
ScanKey scankey, BTItem btitem, BTItem hikey)
{
InsertIndexResult res;
int itemsz;
Page page;
BlockNumber bknum;
BTPageOpaque pageop;
Buffer rbuf;
Page rpage;
BTPageOpaque rpageop;
Buffer pbuf;
Page ppage;
BTPageOpaque ppageop;
Buffer nbuf;
Page npage;
BTPageOpaque npageop;
BlockNumber nbknum;
BTItem nitem;
OffsetNumber afteroff;
btitem = _bt_formitem(&(btitem->bti_itup));
hikey = _bt_formitem(&(hikey->bti_itup));
page = BufferGetPage(buf);
/* grab new page */
nbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
nbknum = BufferGetBlockNumber(nbuf);
npage = BufferGetPage(nbuf);
_bt_pageinit(npage, BufferGetPageSize(nbuf));
npageop = (BTPageOpaque) PageGetSpecialPointer(npage);
/* copy content of the passed page */
memmove((char *) npage, (char *) page, BufferGetPageSize(buf));
/* re-init old (passed) page */
_bt_pageinit(page, BufferGetPageSize(buf));
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
/* init old page opaque */
pageop->btpo_flags = npageop->btpo_flags; /* restore flags */
pageop->btpo_flags &= ~BTP_CHAIN;
if (_bt_itemcmp(rel, keysz, scankey, hikey, btitem, BTEqualStrategyNumber))
pageop->btpo_flags |= BTP_CHAIN;
pageop->btpo_prev = npageop->btpo_prev; /* restore prev */
pageop->btpo_next = nbknum; /* next points to the new page */
pageop->btpo_parent = npageop->btpo_parent;
/* init shifted page opaque */
npageop->btpo_prev = bknum = BufferGetBlockNumber(buf);
/* shifted page is ok, populate old page */
/* add passed hikey */
itemsz = IndexTupleDSize(hikey->bti_itup)
+ (sizeof(BTItemData) - sizeof(IndexTupleData));
itemsz = MAXALIGN(itemsz);
if (PageAddItem(page, (Item) hikey, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add hikey in _bt_shift");
pfree(hikey);
/* add btitem */
itemsz = IndexTupleDSize(btitem->bti_itup)
+ (sizeof(BTItemData) - sizeof(IndexTupleData));
itemsz = MAXALIGN(itemsz);
if (PageAddItem(page, (Item) btitem, itemsz, P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add firstkey in _bt_shift");
pfree(btitem);
nitem = (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
btitem = _bt_formitem(&(nitem->bti_itup));
ItemPointerSet(&(btitem->bti_itup.t_tid), bknum, P_HIKEY);
/* ok, write them out */
_bt_wrtnorelbuf(rel, nbuf);
_bt_wrtnorelbuf(rel, buf);
/* fix btpo_prev on right sibling of old page */
if (!P_RIGHTMOST(npageop))
{
rbuf = _bt_getbuf(rel, npageop->btpo_next, BT_WRITE);
rpage = BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
rpageop->btpo_prev = nbknum;
_bt_wrtbuf(rel, rbuf);
}
/* get parent pointing to the old page */
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
ppage = BufferGetPage(pbuf);
ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
_bt_relbuf(rel, nbuf, BT_WRITE);
_bt_relbuf(rel, buf, BT_WRITE);
/* re-set parent' pointer - we shifted our page to the right ! */
nitem = (BTItem) PageGetItem(ppage,
PageGetItemId(ppage, stack->bts_offset));
ItemPointerSet(&(nitem->bti_itup.t_tid), nbknum, P_HIKEY);
ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), nbknum, P_HIKEY);
_bt_wrtnorelbuf(rel, pbuf);
/*
* Now we want insert into the parent pointer to our old page. It has
* to be inserted before the pointer to new page. You may get problems
* here (in the _bt_goesonpg and/or _bt_pgaddtup), but may be not - I
* don't know. It works if old page is leftmost (nitem is NULL) and
* btitem < hikey and it's all what we need currently. - vadim
* 05/30/97
*/
nitem = NULL;
afteroff = P_FIRSTKEY;
if (!P_RIGHTMOST(ppageop))
afteroff = OffsetNumberNext(afteroff);
if (stack->bts_offset >= afteroff)
{
afteroff = OffsetNumberPrev(stack->bts_offset);
nitem = (BTItem) PageGetItem(ppage, PageGetItemId(ppage, afteroff));
nitem = _bt_formitem(&(nitem->bti_itup));
}
res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
keysz, scankey, btitem, nitem);
pfree(btitem);
ItemPointerSet(&(res->pointerData), nbknum, P_HIKEY);
return res;
}
#endif
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $
* *
* NOTES * NOTES
* Postgres btree pages look like ordinary relation pages. The opaque * Postgres btree pages look like ordinary relation pages. The opaque
...@@ -90,7 +90,7 @@ _bt_metapinit(Relation rel) ...@@ -90,7 +90,7 @@ _bt_metapinit(Relation rel)
metad.btm_version = BTREE_VERSION; metad.btm_version = BTREE_VERSION;
metad.btm_root = P_NONE; metad.btm_root = P_NONE;
metad.btm_level = 0; metad.btm_level = 0;
memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad)); memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
op = (BTPageOpaque) PageGetSpecialPointer(pg); op = (BTPageOpaque) PageGetSpecialPointer(pg);
op->btpo_flags = BTP_META; op->btpo_flags = BTP_META;
...@@ -102,52 +102,6 @@ _bt_metapinit(Relation rel) ...@@ -102,52 +102,6 @@ _bt_metapinit(Relation rel)
UnlockRelation(rel, AccessExclusiveLock); UnlockRelation(rel, AccessExclusiveLock);
} }
#ifdef NOT_USED
/*
* _bt_checkmeta() -- Verify that the metadata stored in a btree are
* reasonable.
*/
void
_bt_checkmeta(Relation rel)
{
Buffer metabuf;
Page metap;
BTMetaPageData *metad;
BTPageOpaque op;
int nblocks;
/* if the relation is empty, this is init time; don't complain */
if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0)
return;
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metap = BufferGetPage(metabuf);
op = (BTPageOpaque) PageGetSpecialPointer(metap);
if (!(op->btpo_flags & BTP_META))
{
elog(ERROR, "Invalid metapage for index %s",
RelationGetRelationName(rel));
}
metad = BTPageGetMeta(metap);
if (metad->btm_magic != BTREE_MAGIC)
{
elog(ERROR, "Index %s is not a btree",
RelationGetRelationName(rel));
}
if (metad->btm_version != BTREE_VERSION)
{
elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
RelationGetRelationName(rel),
metad->btm_version, BTREE_VERSION);
}
_bt_relbuf(rel, metabuf, BT_READ);
}
#endif
/* /*
* _bt_getroot() -- Get the root page of the btree. * _bt_getroot() -- Get the root page of the btree.
* *
...@@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel) ...@@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel)
* standard class of race conditions exists here; I think I covered * standard class of race conditions exists here; I think I covered
* them all in the Hopi Indian rain dance of lock requests below. * them all in the Hopi Indian rain dance of lock requests below.
* *
* We pass in the access type (BT_READ or BT_WRITE), and return the * The access type parameter (BT_READ or BT_WRITE) controls whether
* root page's buffer with the appropriate lock type set. Reference * a new root page will be created or not. If access = BT_READ,
* count on the root page gets bumped by ReadBuffer. The metadata * and no root page exists, we just return InvalidBuffer. For
* page is unlocked and unreferenced by this process when this routine * BT_WRITE, we try to create the root page if it doesn't exist.
* returns. * NOTE that the returned root page will have only a read lock set
* on it even if access = BT_WRITE!
*
* On successful return, the root page is pinned and read-locked.
* The metadata page is not locked or pinned on exit.
*/ */
Buffer Buffer
_bt_getroot(Relation rel, int access) _bt_getroot(Relation rel, int access)
...@@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access) ...@@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access)
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf); metapg = BufferGetPage(metabuf);
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
Assert(metaopaque->btpo_flags & BTP_META);
metad = BTPageGetMeta(metapg); metad = BTPageGetMeta(metapg);
if (metad->btm_magic != BTREE_MAGIC) if (!(metaopaque->btpo_flags & BTP_META) ||
{ metad->btm_magic != BTREE_MAGIC)
elog(ERROR, "Index %s is not a btree", elog(ERROR, "Index %s is not a btree",
RelationGetRelationName(rel)); RelationGetRelationName(rel));
}
if (metad->btm_version != BTREE_VERSION) if (metad->btm_version != BTREE_VERSION)
{ elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
RelationGetRelationName(rel), RelationGetRelationName(rel),
metad->btm_version, BTREE_VERSION); metad->btm_version, BTREE_VERSION);
}
/* if no root page initialized yet, do it */ /* if no root page initialized yet, do it */
if (metad->btm_root == P_NONE) if (metad->btm_root == P_NONE)
{ {
/* If access = BT_READ, caller doesn't want us to create root yet */
if (access == BT_READ)
{
_bt_relbuf(rel, metabuf, BT_READ);
return InvalidBuffer;
}
/* turn our read lock in for a write lock */ /* trade in our read lock for a write lock */
_bt_relbuf(rel, metabuf, BT_READ); LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE); LockBuffer(metabuf, BT_WRITE);
metapg = BufferGetPage(metabuf);
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
Assert(metaopaque->btpo_flags & BTP_META);
metad = BTPageGetMeta(metapg);
/* /*
* Race condition: if someone else initialized the metadata * Race condition: if someone else initialized the metadata
* between the time we released the read lock and acquired the * between the time we released the read lock and acquired the
* write lock, above, we want to avoid doing it again. * write lock, above, we must avoid doing it again.
*/ */
if (metad->btm_root == P_NONE) if (metad->btm_root == P_NONE)
{ {
/* /*
* Get, initialize, write, and leave a lock of the appropriate * Get, initialize, write, and leave a lock of the appropriate
* type on the new root page. Since this is the first page in * type on the new root page. Since this is the first page in
* the tree, it's a leaf. * the tree, it's a leaf as well as the root.
*/ */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
rootblkno = BufferGetBlockNumber(rootbuf); rootblkno = BufferGetBlockNumber(rootbuf);
rootpg = BufferGetPage(rootbuf); rootpg = BufferGetPage(rootbuf);
metad->btm_root = rootblkno; metad->btm_root = rootblkno;
metad->btm_level = 1; metad->btm_level = 1;
_bt_pageinit(rootpg, BufferGetPageSize(rootbuf)); _bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT); rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
_bt_wrtnorelbuf(rel, rootbuf); _bt_wrtnorelbuf(rel, rootbuf);
/* swap write lock for read lock, if appropriate */ /* swap write lock for read lock */
if (access != BT_WRITE) LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
{ LockBuffer(rootbuf, BT_READ);
LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
LockBuffer(rootbuf, BT_READ);
}
/* okay, metadata is correct */ /* okay, metadata is correct, write and release it */
_bt_wrtbuf(rel, metabuf); _bt_wrtbuf(rel, metabuf);
} }
else else
{ {
/* /*
* Metadata initialized by someone else. In order to * Metadata initialized by someone else. In order to
* guarantee no deadlocks, we have to release the metadata * guarantee no deadlocks, we have to release the metadata
* page and start all over again. * page and start all over again.
*/ */
_bt_relbuf(rel, metabuf, BT_WRITE); _bt_relbuf(rel, metabuf, BT_WRITE);
return _bt_getroot(rel, access); return _bt_getroot(rel, access);
} }
...@@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access) ...@@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access)
rootblkno = metad->btm_root; rootblkno = metad->btm_root;
_bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */ _bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */
rootbuf = _bt_getbuf(rel, rootblkno, access); rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
} }
/* /*
* Race condition: If the root page split between the time we looked * Race condition: If the root page split between the time we looked
* at the metadata page and got the root buffer, then we got the wrong * at the metadata page and got the root buffer, then we got the wrong
* buffer. * buffer. Release it and try again.
*/ */
rootpg = BufferGetPage(rootbuf); rootpg = BufferGetPage(rootbuf);
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
if (!(rootopaque->btpo_flags & BTP_ROOT))
{
if (! P_ISROOT(rootopaque))
{
/* it happened, try again */ /* it happened, try again */
_bt_relbuf(rel, rootbuf, access); _bt_relbuf(rel, rootbuf, BT_READ);
return _bt_getroot(rel, access); return _bt_getroot(rel, access);
} }
...@@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access) ...@@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access)
* count is correct, and we have no lock set on the metadata page. * count is correct, and we have no lock set on the metadata page.
* Return the root block. * Return the root block.
*/ */
return rootbuf; return rootbuf;
} }
...@@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access) ...@@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access)
* _bt_getbuf() -- Get a buffer by block number for read or write. * _bt_getbuf() -- Get a buffer by block number for read or write.
* *
* When this routine returns, the appropriate lock is set on the * When this routine returns, the appropriate lock is set on the
* requested buffer its reference count is correct. * requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned").
*/ */
Buffer Buffer
_bt_getbuf(Relation rel, BlockNumber blkno, int access) _bt_getbuf(Relation rel, BlockNumber blkno, int access)
{ {
Buffer buf; Buffer buf;
Page page;
if (blkno != P_NEW) if (blkno != P_NEW)
{ {
/* Read an existing block of the relation */
buf = ReadBuffer(rel, blkno); buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access); LockBuffer(buf, access);
} }
else else
{ {
Page page;
/* /*
* Extend bufmgr code is unclean and so we have to use locking * Extend the relation by one page.
*
* Extend bufmgr code is unclean and so we have to use extra locking
* here. * here.
*/ */
LockPage(rel, 0, ExclusiveLock); LockPage(rel, 0, ExclusiveLock);
buf = ReadBuffer(rel, blkno); buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);
UnlockPage(rel, 0, ExclusiveLock); UnlockPage(rel, 0, ExclusiveLock);
blkno = BufferGetBlockNumber(buf);
/* Initialize the new page before returning it */
page = BufferGetPage(buf); page = BufferGetPage(buf);
_bt_pageinit(page, BufferGetPageSize(buf)); _bt_pageinit(page, BufferGetPageSize(buf));
LockBuffer(buf, access);
} }
/* ref count and lock type are correct */ /* ref count and lock type are correct */
...@@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) ...@@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
/* /*
* _bt_relbuf() -- release a locked buffer. * _bt_relbuf() -- release a locked buffer.
*
* Lock and pin (refcount) are both dropped.
*/ */
void void
_bt_relbuf(Relation rel, Buffer buf, int access) _bt_relbuf(Relation rel, Buffer buf, int access)
...@@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access) ...@@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access)
/* /*
* _bt_wrtbuf() -- write a btree page to disk. * _bt_wrtbuf() -- write a btree page to disk.
* *
* This routine releases the lock held on the buffer and our reference * This routine releases the lock held on the buffer and our refcount
* to it. It is an error to call _bt_wrtbuf() without a write lock * for it. It is an error to call _bt_wrtbuf() without a write lock
* or a reference to the buffer. * and a pin on the buffer.
*
* NOTE: actually, the buffer manager just marks the shared buffer page
* dirty here, the real I/O happens later. Since we can't persuade the
* Unix kernel to schedule disk writes in a particular order, there's not
* much point in worrying about this. The most we can say is that all the
* writes will occur before commit.
*/ */
void void
_bt_wrtbuf(Relation rel, Buffer buf) _bt_wrtbuf(Relation rel, Buffer buf)
...@@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf) ...@@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf)
* our reference or lock. * our reference or lock.
* *
* It is an error to call _bt_wrtnorelbuf() without a write lock * It is an error to call _bt_wrtnorelbuf() without a write lock
* or a reference to the buffer. * and a pin on the buffer.
*
* See above NOTE.
*/ */
void void
_bt_wrtnorelbuf(Relation rel, Buffer buf) _bt_wrtnorelbuf(Relation rel, Buffer buf)
...@@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size) ...@@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size)
* we split the root page, we record the new parent in the metadata page * we split the root page, we record the new parent in the metadata page
* for the relation. This routine does the work. * for the relation. This routine does the work.
* *
* No direct preconditions, but if you don't have the a write lock on * No direct preconditions, but if you don't have the write lock on
* at least the old root page when you call this, you're making a big * at least the old root page when you call this, you're making a big
* mistake. On exit, metapage data is correct and we no longer have * mistake. On exit, metapage data is correct and we no longer have
* a reference to or lock on the metapage. * a pin or lock on the metapage.
*/ */
void void
_bt_metaproot(Relation rel, BlockNumber rootbknum, int level) _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
...@@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level) ...@@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
} }
/* /*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item * Delete an item from a btree. It had better be a leaf item...
* we last looked at in the parent.
*
* This is possible because we save a bit image of the last item
* we looked at in the parent, and the update algorithm guarantees
* that if items above us in the tree move, they only move right.
*
* Also, re-set bts_blkno & bts_offset if changed and
* bts_btitem (it may be changed - see _bt_insertonpg).
*/ */
Buffer
_bt_getstackbuf(Relation rel, BTStack stack, int access)
{
Buffer buf;
BlockNumber blkno;
OffsetNumber start,
offnum,
maxoff;
OffsetNumber i;
Page page;
ItemId itemid;
BTItem item;
BTPageOpaque opaque;
BTItem item_save;
int item_nbytes;
blkno = stack->bts_blkno;
buf = _bt_getbuf(rel, blkno, access);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
if (stack->bts_offset == InvalidOffsetNumber ||
maxoff >= stack->bts_offset)
{
/*
* _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
* case of concurrent ROOT page split
*/
if (stack->bts_offset == InvalidOffsetNumber)
i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
else
{
itemid = PageGetItemId(page, stack->bts_offset);
item = (BTItem) PageGetItem(page, itemid);
/* if the item is where we left it, we're done */
if (BTItemSame(item, stack->bts_btitem))
{
pfree(stack->bts_btitem);
item_nbytes = ItemIdGetLength(itemid);
item_save = (BTItem) palloc(item_nbytes);
memmove((char *) item_save, (char *) item, item_nbytes);
stack->bts_btitem = item_save;
return buf;
}
i = OffsetNumberNext(stack->bts_offset);
}
/* if the item has just moved right on this page, we're done */
for (;
i <= maxoff;
i = OffsetNumberNext(i))
{
itemid = PageGetItemId(page, i);
item = (BTItem) PageGetItem(page, itemid);
/* if the item is where we left it, we're done */
if (BTItemSame(item, stack->bts_btitem))
{
stack->bts_offset = i;
pfree(stack->bts_btitem);
item_nbytes = ItemIdGetLength(itemid);
item_save = (BTItem) palloc(item_nbytes);
memmove((char *) item_save, (char *) item, item_nbytes);
stack->bts_btitem = item_save;
return buf;
}
}
}
/* by here, the item we're looking for moved right at least one page */
for (;;)
{
blkno = opaque->btpo_next;
if (P_RIGHTMOST(opaque))
elog(FATAL, "my bits moved right off the end of the world!\
\n\tRecreate index %s.", RelationGetRelationName(rel));
_bt_relbuf(rel, buf, access);
buf = _bt_getbuf(rel, blkno, access);
page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* if we have a right sibling, step over the high key */
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
/* see if it's on this page */
for (offnum = start;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
itemid = PageGetItemId(page, offnum);
item = (BTItem) PageGetItem(page, itemid);
if (BTItemSame(item, stack->bts_btitem))
{
stack->bts_offset = offnum;
stack->bts_blkno = blkno;
pfree(stack->bts_btitem);
item_nbytes = ItemIdGetLength(itemid);
item_save = (BTItem) palloc(item_nbytes);
memmove((char *) item_save, (char *) item, item_nbytes);
stack->bts_btitem = item_save;
return buf;
}
}
}
}
void void
_bt_pagedel(Relation rel, ItemPointer tid) _bt_pagedel(Relation rel, ItemPointer tid)
{ {
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -26,6 +26,7 @@ ...@@ -26,6 +26,7 @@
#include "executor/executor.h" #include "executor/executor.h"
#include "miscadmin.h" #include "miscadmin.h"
bool BuildingBtree = false; /* see comment in btbuild() */ bool BuildingBtree = false; /* see comment in btbuild() */
bool FastBuild = true; /* use sort/build instead of insertion bool FastBuild = true; /* use sort/build instead of insertion
* build */ * build */
...@@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS) ...@@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS)
* btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE. * btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
* Sure, it's just rule for placing/finding items and no more - * Sure, it's just rule for placing/finding items and no more -
* keytest'll return FALSE for a = 5 for items having 'a' isNULL. * keytest'll return FALSE for a = 5 for items having 'a' isNULL.
* Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it * Look at _bt_compare for how it works.
* works. - vadim 03/23/97 * - vadim 03/23/97
* *
* if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; } * if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
*/ */
...@@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS) ...@@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS)
/* generate an index tuple */ /* generate an index tuple */
itup = index_formtuple(RelationGetDescr(rel), datum, nulls); itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
itup->t_tid = *ht_ctid; itup->t_tid = *ht_ctid;
/*
* See comments in btbuild.
*
* if (itup->t_info & INDEX_NULL_MASK)
* PG_RETURN_POINTER((InsertIndexResult) NULL);
*/
btitem = _bt_formitem(itup); btitem = _bt_formitem(itup);
res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel); res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
...@@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS) ...@@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS)
if (ItemPointerIsValid(&(scan->currentItemData))) if (ItemPointerIsValid(&(scan->currentItemData)))
{ {
/* /*
* Restore scan position using heap TID returned by previous call * Restore scan position using heap TID returned by previous call
* to btgettuple(). _bt_restscan() locks buffer. * to btgettuple(). _bt_restscan() re-grabs the read lock on
* the buffer, too.
*/ */
_bt_restscan(scan); _bt_restscan(scan);
res = _bt_next(scan, dir); res = _bt_next(scan, dir);
...@@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS) ...@@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS)
res = _bt_first(scan, dir); res = _bt_first(scan, dir);
/* /*
* Save heap TID to use it in _bt_restscan. Unlock buffer before * Save heap TID to use it in _bt_restscan. Then release the read
* leaving index ! * lock on the buffer so that we aren't blocking other backends.
* NOTE: we do keep the pin on the buffer!
*/ */
if (res) if (res)
{ {
...@@ -419,7 +413,18 @@ btrescan(PG_FUNCTION_ARGS) ...@@ -419,7 +413,18 @@ btrescan(PG_FUNCTION_ARGS)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/* we don't hold a read lock on the current page in the scan */ if (so == NULL) /* if called from btbeginscan */
{
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
so->keyData = (ScanKey) NULL;
if (scan->numberOfKeys > 0)
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
scan->opaque = so;
scan->flags = 0x0;
}
/* we aren't holding any read locks, but gotta drop the pins */
if (ItemPointerIsValid(iptr = &(scan->currentItemData))) if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{ {
ReleaseBuffer(so->btso_curbuf); ReleaseBuffer(so->btso_curbuf);
...@@ -427,7 +432,6 @@ btrescan(PG_FUNCTION_ARGS) ...@@ -427,7 +432,6 @@ btrescan(PG_FUNCTION_ARGS)
ItemPointerSetInvalid(iptr); ItemPointerSetInvalid(iptr);
} }
/* and we don't hold a read lock on the last marked item in the scan */
if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
{ {
ReleaseBuffer(so->btso_mrkbuf); ReleaseBuffer(so->btso_mrkbuf);
...@@ -435,17 +439,6 @@ btrescan(PG_FUNCTION_ARGS) ...@@ -435,17 +439,6 @@ btrescan(PG_FUNCTION_ARGS)
ItemPointerSetInvalid(iptr); ItemPointerSetInvalid(iptr);
} }
if (so == NULL) /* if called from btbeginscan */
{
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
so->keyData = (ScanKey) NULL;
if (scan->numberOfKeys > 0)
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
scan->opaque = so;
scan->flags = 0x0;
}
/* /*
* Reset the scan keys. Note that keys ordering stuff moved to * Reset the scan keys. Note that keys ordering stuff moved to
* _bt_first. - vadim 05/05/97 * _bt_first. - vadim 05/05/97
...@@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v) ...@@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/* we don't hold a read lock on the current page in the scan */ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentItemData))) if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{ {
ReleaseBuffer(so->btso_curbuf); ReleaseBuffer(so->btso_curbuf);
...@@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v) ...@@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v)
ItemPointerSetInvalid(iptr); ItemPointerSetInvalid(iptr);
} }
/* scan->keyData[0].sk_argument = v; */
so->keyData[0].sk_argument = v; so->keyData[0].sk_argument = v;
} }
...@@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS) ...@@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/* we don't hold any read locks */ /* we aren't holding any read locks, but gotta drop the pins */
if (ItemPointerIsValid(iptr = &(scan->currentItemData))) if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{ {
if (BufferIsValid(so->btso_curbuf)) if (BufferIsValid(so->btso_curbuf))
...@@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS) ...@@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/* we don't hold any read locks */ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
{ {
ReleaseBuffer(so->btso_mrkbuf); ReleaseBuffer(so->btso_mrkbuf);
...@@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS) ...@@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS)
ItemPointerSetInvalid(iptr); ItemPointerSetInvalid(iptr);
} }
/* bump pin on current buffer */ /* bump pin on current buffer for assignment to mark buffer */
if (ItemPointerIsValid(&(scan->currentItemData))) if (ItemPointerIsValid(&(scan->currentItemData)))
{ {
so->btso_mrkbuf = ReadBuffer(scan->relation, so->btso_mrkbuf = ReadBuffer(scan->relation,
...@@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS) ...@@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/* we don't hold any read locks */ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentItemData))) if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{ {
ReleaseBuffer(so->btso_curbuf); ReleaseBuffer(so->btso_curbuf);
...@@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS) ...@@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS)
{ {
so->btso_curbuf = ReadBuffer(scan->relation, so->btso_curbuf = ReadBuffer(scan->relation,
BufferGetBlockNumber(so->btso_mrkbuf)); BufferGetBlockNumber(so->btso_mrkbuf));
scan->currentItemData = scan->currentMarkData; scan->currentItemData = scan->currentMarkData;
so->curHeapIptr = so->mrkHeapIptr; so->curHeapIptr = so->mrkHeapIptr;
} }
...@@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS) ...@@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS)
PG_RETURN_VOID(); PG_RETURN_VOID();
} }
/*
* Restore scan position when btgettuple is called to continue a scan.
*/
static void static void
_bt_restscan(IndexScanDesc scan) _bt_restscan(IndexScanDesc scan)
{ {
...@@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan) ...@@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan)
BTItem item; BTItem item;
BlockNumber blkno; BlockNumber blkno;
LockBuffer(buf, BT_READ); /* lock buffer first! */ /*
* Get back the read lock we were holding on the buffer.
* (We still have a reference-count pin on it, though.)
*/
LockBuffer(buf, BT_READ);
page = BufferGetPage(buf); page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
...@@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan) ...@@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan)
*/ */
if (!ItemPointerIsValid(&target)) if (!ItemPointerIsValid(&target))
{ {
ItemPointerSetOffsetNumber(&(scan->currentItemData), ItemPointerSetOffsetNumber(current,
OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)); OffsetNumberPrev(P_FIRSTDATAKEY(opaque)));
return; return;
} }
if (maxoff >= offnum) /*
* The item we were on may have moved right due to insertions.
* Find it again.
*/
for (;;)
{ {
/* Check for item on this page */
/*
* if the item is where we left it or has just moved right on this
* page, we're done
*/
for (; for (;
offnum <= maxoff; offnum <= maxoff;
offnum = OffsetNumberNext(offnum)) offnum = OffsetNumberNext(offnum))
{ {
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
if (item->bti_itup.t_tid.ip_blkid.bi_hi == \ if (item->bti_itup.t_tid.ip_blkid.bi_hi ==
target.ip_blkid.bi_hi && \ target.ip_blkid.bi_hi &&
item->bti_itup.t_tid.ip_blkid.bi_lo == \ item->bti_itup.t_tid.ip_blkid.bi_lo ==
target.ip_blkid.bi_lo && \ target.ip_blkid.bi_lo &&
item->bti_itup.t_tid.ip_posid == target.ip_posid) item->bti_itup.t_tid.ip_posid == target.ip_posid)
{ {
current->ip_posid = offnum; current->ip_posid = offnum;
return; return;
} }
} }
}
/* /*
* By here, the item we're looking for moved right at least one page * By here, the item we're looking for moved right at least one page
*/ */
for (;;)
{
if (P_RIGHTMOST(opaque)) if (P_RIGHTMOST(opaque))
elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\ elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!"
\n\tRecreate index %s.", RelationGetRelationName(rel)); "\n\tRecreate index %s.", RelationGetRelationName(rel));
blkno = opaque->btpo_next; blkno = opaque->btpo_next;
_bt_relbuf(rel, buf, BT_READ); _bt_relbuf(rel, buf, BT_READ);
...@@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan) ...@@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan)
page = BufferGetPage(buf); page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
offnum = P_FIRSTDATAKEY(opaque);
/* see if it's on this page */ ItemPointerSet(current, blkno, offnum);
for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; so->btso_curbuf = buf;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
target.ip_blkid.bi_hi && \
item->bti_itup.t_tid.ip_blkid.bi_lo == \
target.ip_blkid.bi_lo && \
item->bti_itup.t_tid.ip_posid == target.ip_posid)
{
ItemPointerSet(current, blkno, offnum);
so->btso_curbuf = buf;
return;
}
}
} }
} }
...@@ -8,22 +8,25 @@ ...@@ -8,22 +8,25 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $
* *
* *
* NOTES * NOTES
* Because we can be doing an index scan on a relation while we update * Because we can be doing an index scan on a relation while we update
* it, we need to avoid missing data that moves around in the index. * it, we need to avoid missing data that moves around in the index.
* The routines and global variables in this file guarantee that all * Insertions and page splits are no problem because _bt_restscan()
* scans in the local address space stay correctly positioned. This * can figure out where the current item moved to, but if a deletion
* is all we need to worry about, since write locking guarantees that * happens at or before the current scan position, we'd better do
* no one else will be on the same page at the same time as we are. * something to stay in sync.
*
* The routines in this file handle the problem for deletions issued
* by the current backend. Currently, that's all we need, since
* deletions are only done by VACUUM and it gets an exclusive lock.
* *
* The scheme is to manage a list of active scans in the current backend. * The scheme is to manage a list of active scans in the current backend.
* Whenever we add or remove records from an index, or whenever we * Whenever we remove a record from an index, we check the list of active
* split a leaf page, we check the list of active scans to see if any * scans to see if any has been affected. A scan is affected only if it
* has been affected. A scan is affected only if it is on the same * is on the same relation, and the same page, as the update.
* relation, and the same page, as the update.
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan) ...@@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan)
/* /*
* _bt_adjscans() -- adjust all scans in the scan list to compensate * _bt_adjscans() -- adjust all scans in the scan list to compensate
* for a given deletion or insertion * for a given deletion
*/ */
void void
_bt_adjscans(Relation rel, ItemPointer tid) _bt_adjscans(Relation rel, ItemPointer tid)
...@@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) ...@@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
{ {
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; start = P_FIRSTDATAKEY(opaque);
if (ItemPointerGetOffsetNumber(current) == start) if (ItemPointerGetOffsetNumber(current) == start)
ItemPointerSetInvalid(&(so->curHeapIptr)); ItemPointerSetInvalid(&(so->curHeapIptr));
else else
...@@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) ...@@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
*/ */
LockBuffer(buf, BT_READ); LockBuffer(buf, BT_READ);
_bt_step(scan, &buf, BackwardScanDirection); _bt_step(scan, &buf, BackwardScanDirection);
so->btso_curbuf = buf;
if (ItemPointerIsValid(current)) if (ItemPointerIsValid(current))
{ {
Page pg = BufferGetPage(buf); Page pg = BufferGetPage(buf);
...@@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) ...@@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
&& ItemPointerGetBlockNumber(current) == blkno && ItemPointerGetBlockNumber(current) == blkno
&& ItemPointerGetOffsetNumber(current) >= offno) && ItemPointerGetOffsetNumber(current) >= offno)
{ {
page = BufferGetPage(so->btso_mrkbuf); page = BufferGetPage(so->btso_mrkbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; start = P_FIRSTDATAKEY(opaque);
if (ItemPointerGetOffsetNumber(current) == start) if (ItemPointerGetOffsetNumber(current) == start)
ItemPointerSetInvalid(&(so->mrkHeapIptr)); ItemPointerSetInvalid(&(so->mrkHeapIptr));
......
/*------------------------------------------------------------------------- /*-------------------------------------------------------------------------
* *
* btsearch.c * nbtsearch.c
* search code for postgres btrees. * search code for postgres btrees.
* *
*
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
*
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.60 2000/05/30 04:24:33 tgl Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.61 2000/07/21 06:42:32 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -19,102 +19,96 @@ ...@@ -19,102 +19,96 @@
#include "access/nbtree.h" #include "access/nbtree.h"
static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static BTStack _bt_searchr(Relation rel, int keysz, ScanKey scankey,
Buffer *bufP, BTStack stack_in);
static int32 _bt_compare(Relation rel, TupleDesc itupdesc, Page page,
int keysz, ScanKey scankey, OffsetNumber offnum);
static bool
_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
static RetrieveIndexResult
_bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/* /*
* _bt_search() -- Search for a scan key in the index. * _bt_search() -- Search the tree for a particular scankey,
* or more precisely for the first leaf page it could be on.
*
* Return value is a stack of parent-page pointers. *bufP is set to the
* address of the leaf-page buffer, which is read-locked and pinned.
* No locks are held on the parent pages, however!
* *
* This routine is actually just a helper that sets things up and * NOTE that the returned buffer is read-locked regardless of the access
* calls a recursive-descent search routine on the tree. * parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
* will result in *bufP being set to InvalidBuffer.
*/ */
BTStack BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, Buffer *bufP) _bt_search(Relation rel, int keysz, ScanKey scankey,
{ Buffer *bufP, int access)
*bufP = _bt_getroot(rel, BT_READ);
return _bt_searchr(rel, keysz, scankey, bufP, (BTStack) NULL);
}
/*
* _bt_searchr() -- Search the tree recursively for a particular scankey.
*/
static BTStack
_bt_searchr(Relation rel,
int keysz,
ScanKey scankey,
Buffer *bufP,
BTStack stack_in)
{ {
BTStack stack; BTStack stack_in = NULL;
OffsetNumber offnum;
Page page;
BTPageOpaque opaque;
BlockNumber par_blkno;
BlockNumber blkno;
ItemId itemid;
BTItem btitem;
BTItem item_save;
int item_nbytes;
IndexTuple itup;
/* if this is a leaf page, we're done */ /* Get the root page to start with */
page = BufferGetPage(*bufP); *bufP = _bt_getroot(rel, access);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (opaque->btpo_flags & BTP_LEAF)
return stack_in;
/* /* If index is empty and access = BT_READ, no root page is created. */
* Find the appropriate item on the internal page, and get the child if (! BufferIsValid(*bufP))
* page that it points to. return (BTStack) NULL;
*/
par_blkno = BufferGetBlockNumber(*bufP); /* Loop iterates once per level descended in the tree */
offnum = _bt_binsrch(rel, *bufP, keysz, scankey, BT_DESCENT); for (;;)
itemid = PageGetItemId(page, offnum); {
btitem = (BTItem) PageGetItem(page, itemid); Page page;
itup = &(btitem->bti_itup); BTPageOpaque opaque;
blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); OffsetNumber offnum;
ItemId itemid;
BTItem btitem;
IndexTuple itup;
BlockNumber blkno;
BlockNumber par_blkno;
BTStack new_stack;
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISLEAF(opaque))
break;
/* /*
* We need to save the bit image of the index entry we chose in the * Find the appropriate item on the internal page, and get the
* parent page on a stack. In case we split the tree, we'll use this * child page that it points to.
* bit image to figure out what our real parent page is, in case the */
* parent splits while we're working lower in the tree. See the paper offnum = _bt_binsrch(rel, *bufP, keysz, scankey);
* by Lehman and Yao for how this is detected and handled. (We use itemid = PageGetItemId(page, offnum);
* unique OIDs to disambiguate duplicate keys in the index -- Lehman btitem = (BTItem) PageGetItem(page, itemid);
* and Yao disallow duplicate keys). itup = &(btitem->bti_itup);
*/ blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
par_blkno = BufferGetBlockNumber(*bufP);
item_nbytes = ItemIdGetLength(itemid); /*
item_save = (BTItem) palloc(item_nbytes); * We need to save the bit image of the index entry we chose in the
memmove((char *) item_save, (char *) btitem, item_nbytes); * parent page on a stack. In case we split the tree, we'll use this
stack = (BTStack) palloc(sizeof(BTStackData)); * bit image to figure out what our real parent page is, in case the
stack->bts_blkno = par_blkno; * parent splits while we're working lower in the tree. See the paper
stack->bts_offset = offnum; * by Lehman and Yao for how this is detected and handled. (We use the
stack->bts_btitem = item_save; * child link to disambiguate duplicate keys in the index -- Lehman
stack->bts_parent = stack_in; * and Yao disallow duplicate keys.)
*/
new_stack = (BTStack) palloc(sizeof(BTStackData));
new_stack->bts_blkno = par_blkno;
new_stack->bts_offset = offnum;
memcpy(&new_stack->bts_btitem, btitem, sizeof(BTItemData));
new_stack->bts_parent = stack_in;
/* drop the read lock on the parent page and acquire one on the child */ /* drop the read lock on the parent page, acquire one on the child */
_bt_relbuf(rel, *bufP, BT_READ); _bt_relbuf(rel, *bufP, BT_READ);
*bufP = _bt_getbuf(rel, blkno, BT_READ); *bufP = _bt_getbuf(rel, blkno, BT_READ);
/* /*
* Race -- the page we just grabbed may have split since we read its * Race -- the page we just grabbed may have split since we read its
* pointer in the parent. If it has, we may need to move right to its * pointer in the parent. If it has, we may need to move right to its
* new sibling. Do that. * new sibling. Do that.
*/ */
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ);
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ); /* okay, all set to move down a level */
stack_in = new_stack;
}
/* okay, all set to move down a level */ return stack_in;
return _bt_searchr(rel, keysz, scankey, bufP, stack);
} }
/* /*
...@@ -133,7 +127,7 @@ _bt_searchr(Relation rel, ...@@ -133,7 +127,7 @@ _bt_searchr(Relation rel,
* *
* On entry, we have the buffer pinned and a lock of the proper type. * On entry, we have the buffer pinned and a lock of the proper type.
* If we move right, we release the buffer and lock and acquire the * If we move right, we release the buffer and lock and acquire the
* same on the right sibling. * same on the right sibling. Return value is the buffer we stop at.
*/ */
Buffer Buffer
_bt_moveright(Relation rel, _bt_moveright(Relation rel,
...@@ -144,231 +138,81 @@ _bt_moveright(Relation rel, ...@@ -144,231 +138,81 @@ _bt_moveright(Relation rel,
{ {
Page page; Page page;
BTPageOpaque opaque; BTPageOpaque opaque;
ItemId hikey;
BlockNumber rblkno;
int natts = rel->rd_rel->relnatts;
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* if we're on a rightmost page, we don't need to move right */
if (P_RIGHTMOST(opaque))
return buf;
/* by convention, item 0 on non-rightmost pages is the high key */
hikey = PageGetItemId(page, P_HIKEY);
/* /*
* If the scan key that brought us to this page is >= the high key * If the scan key that brought us to this page is > the high key
* stored on the page, then the page has split and we need to move * stored on the page, then the page has split and we need to move
* right. * right. (If the scan key is equal to the high key, we might or
* might not need to move right; have to scan the page first anyway.)
* It could even have split more than once, so scan as far as needed.
*/ */
while (!P_RIGHTMOST(opaque) &&
if (_bt_skeycmp(rel, keysz, scankey, page, hikey, _bt_compare(rel, keysz, scankey, page, P_HIKEY) > 0)
BTGreaterEqualStrategyNumber))
{ {
/* move right as long as we need to */ /* step right one page */
do BlockNumber rblkno = opaque->btpo_next;
{
OffsetNumber offmax = PageGetMaxOffsetNumber(page);
/*
* If this page consists of all duplicate keys (hikey and
* first key on the page have the same value), then we don't
* need to step right.
*
* NOTE for multi-column indices: we may do scan using keys not
* for all attrs. But we handle duplicates using all attrs in
* _bt_insert/_bt_spool code. And so we've to compare scankey
* with _last_ item on this page to do not lose "good" tuples
* if number of attrs > keysize. Example: (2,0) - last items
* on this page, (2,1) - first item on next page (hikey), our
* scankey is x = 2. Scankey == (2,1) because of we compare
* first attrs only, but we shouldn't to move right of here. -
* vadim 04/15/97
*
* Also, if this page is not LEAF one (and # of attrs > keysize)
* then we can't move too. - vadim 10/22/97
*/
if (_bt_skeycmp(rel, keysz, scankey, page, hikey,
BTEqualStrategyNumber))
{
if (opaque->btpo_flags & BTP_CHAIN)
{
Assert((opaque->btpo_flags & BTP_LEAF) || offmax > P_HIKEY);
break;
}
if (offmax > P_HIKEY)
{
if (natts == keysz) /* sanity checks */
{
if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, P_FIRSTKEY),
BTEqualStrategyNumber))
elog(FATAL, "btree: BTP_CHAIN flag was expected in %s (access = %s)",
RelationGetRelationName(rel), access ? "bt_write" : "bt_read");
if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, offmax),
BTEqualStrategyNumber))
elog(FATAL, "btree: unexpected equal last item");
if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, offmax),
BTLessStrategyNumber))
elog(FATAL, "btree: unexpected greater last item");
/* move right */
}
else if (!(opaque->btpo_flags & BTP_LEAF))
break;
else if (_bt_skeycmp(rel, keysz, scankey, page,
PageGetItemId(page, offmax),
BTLessEqualStrategyNumber))
break;
}
}
/* step right one page */ _bt_relbuf(rel, buf, access);
rblkno = opaque->btpo_next; buf = _bt_getbuf(rel, rblkno, access);
_bt_relbuf(rel, buf, access); page = BufferGetPage(buf);
buf = _bt_getbuf(rel, rblkno, access); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
hikey = PageGetItemId(page, P_HIKEY);
} while (!P_RIGHTMOST(opaque)
&& _bt_skeycmp(rel, keysz, scankey, page, hikey,
BTGreaterEqualStrategyNumber));
} }
return buf; return buf;
} }
/* /*
* _bt_skeycmp() -- compare a scan key to a particular item on a page using * _bt_binsrch() -- Do a binary search for a key on a particular page.
* a requested strategy (<, <=, =, >=, >).
* *
* We ignore the unique OIDs stored in the btree item here. Those * The scankey we get has the compare function stored in the procedure
* numbers are intended for use internally only, in repositioning a * entry of each data struct. We invoke this regproc to do the
* scan after a page split. They do not impose any meaningful ordering. * comparison for every key in the scankey.
* *
* The comparison is A <op> B, where A is the scan key and B is the * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
* tuple pointed at by itemid on page. * key >= given scankey. (NOTE: in particular, this means it is possible
*/ * to return a value 1 greater than the number of keys on the page,
bool * if the scankey is > all keys on the page.)
_bt_skeycmp(Relation rel,
Size keysz,
ScanKey scankey,
Page page,
ItemId itemid,
StrategyNumber strat)
{
BTItem item;
IndexTuple indexTuple;
TupleDesc tupDes;
int i;
int32 compare = 0;
item = (BTItem) PageGetItem(page, itemid);
indexTuple = &(item->bti_itup);
tupDes = RelationGetDescr(rel);
for (i = 1; i <= (int) keysz; i++)
{
ScanKey entry = &scankey[i - 1];
Datum attrDatum;
bool isNull;
Assert(entry->sk_attno == i);
attrDatum = index_getattr(indexTuple,
entry->sk_attno,
tupDes,
&isNull);
/* see comments about NULLs handling in btbuild */
if (entry->sk_flags & SK_ISNULL) /* key is NULL */
{
if (isNull)
compare = 0; /* NULL key "=" NULL datum */
else
compare = 1; /* NULL key ">" not-NULL datum */
}
else if (isNull) /* key is NOT_NULL and item is NULL */
{
compare = -1; /* not-NULL key "<" NULL datum */
}
else
compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
entry->sk_argument,
attrDatum));
if (compare != 0)
break; /* done when we find unequal attributes */
}
switch (strat)
{
case BTLessStrategyNumber:
return (bool) (compare < 0);
case BTLessEqualStrategyNumber:
return (bool) (compare <= 0);
case BTEqualStrategyNumber:
return (bool) (compare == 0);
case BTGreaterEqualStrategyNumber:
return (bool) (compare >= 0);
case BTGreaterStrategyNumber:
return (bool) (compare > 0);
}
elog(ERROR, "_bt_skeycmp: bogus strategy %d", (int) strat);
return false;
}
/*
* _bt_binsrch() -- Do a binary search for a key on a particular page.
* *
* The scankey we get has the compare function stored in the procedure * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
* entry of each data struct. We invoke this regproc to do the * of the last key < given scankey. (Since _bt_compare treats the first
* comparison for every key in the scankey. _bt_binsrch() returns * data key of such a page as minus infinity, there will be at least one
* the OffsetNumber of the first matching key on the page, or the * key < scankey, so the result always points at one of the keys on the
* OffsetNumber at which the matching key would appear if it were * page.) This key indicates the right place to descend to be sure we
* on this page. (NOTE: in particular, this means it is possible to * find all leaf keys >= given scankey.
* return a value 1 greater than the number of keys on the page, if
* the scankey is > all keys on the page.)
* *
* By the time this procedure is called, we're sure we're looking * This procedure is not responsible for walking right, it just examines
* at the right page -- don't need to walk right. _bt_binsrch() has * the given page. _bt_binsrch() has no lock or refcount side effects
* no lock or refcount side effects on the buffer. * on the buffer.
*/ */
OffsetNumber OffsetNumber
_bt_binsrch(Relation rel, _bt_binsrch(Relation rel,
Buffer buf, Buffer buf,
int keysz, int keysz,
ScanKey scankey, ScanKey scankey)
int srchtype)
{ {
TupleDesc itupdesc; TupleDesc itupdesc;
Page page; Page page;
BTPageOpaque opaque; BTPageOpaque opaque;
OffsetNumber low, OffsetNumber low,
high; high;
bool haveEq;
int natts = rel->rd_rel->relnatts;
int32 result; int32 result;
itupdesc = RelationGetDescr(rel); itupdesc = RelationGetDescr(rel);
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* by convention, item 1 on any non-rightmost page is the high key */ low = P_FIRSTDATAKEY(opaque);
low = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
high = PageGetMaxOffsetNumber(page); high = PageGetMaxOffsetNumber(page);
/* /*
* If there are no keys on the page, return the first available slot. * If there are no keys on the page, return the first available slot.
* Note this covers two cases: the page is really empty (no keys), or * Note this covers two cases: the page is really empty (no keys), or
* it contains only a high key. The latter case is possible after * it contains only a high key. The latter case is possible after
* vacuuming. * vacuuming. This can never happen on an internal page, however,
* since they are never empty (an internal page must have children).
*/ */
if (high < low) if (high < low)
return low; return low;
...@@ -376,11 +220,9 @@ _bt_binsrch(Relation rel, ...@@ -376,11 +220,9 @@ _bt_binsrch(Relation rel,
/* /*
* Binary search to find the first key on the page >= scan key. Loop * Binary search to find the first key on the page >= scan key. Loop
* invariant: all slots before 'low' are < scan key, all slots at or * invariant: all slots before 'low' are < scan key, all slots at or
* after 'high' are >= scan key. Also, haveEq is true if the tuple at * after 'high' are >= scan key. We can fall out when high == low.
* 'high' is == scan key. We can fall out when high == low.
*/ */
high++; /* establish the loop invariant for high */ high++; /* establish the loop invariant for high */
haveEq = false;
while (high > low) while (high > low)
{ {
...@@ -388,175 +230,77 @@ _bt_binsrch(Relation rel, ...@@ -388,175 +230,77 @@ _bt_binsrch(Relation rel,
/* We have low <= mid < high, so mid points at a real slot */ /* We have low <= mid < high, so mid points at a real slot */
result = _bt_compare(rel, itupdesc, page, keysz, scankey, mid); result = _bt_compare(rel, keysz, scankey, page, mid);
if (result > 0) if (result > 0)
low = mid + 1; low = mid + 1;
else else
{
high = mid; high = mid;
haveEq = (result == 0);
}
} }
/*-------------------- /*--------------------
* At this point we have high == low, but be careful: they could point * At this point we have high == low, but be careful: they could point
* past the last slot on the page. We also know that haveEq is true * past the last slot on the page.
* if and only if there is an equal key (in which case high&low point
* at the first equal key).
* *
* On a leaf page, we always return the first key >= scan key * On a leaf page, we always return the first key >= scan key
* (which could be the last slot + 1). * (which could be the last slot + 1).
*-------------------- *--------------------
*/ */
if (P_ISLEAF(opaque))
if (opaque->btpo_flags & BTP_LEAF)
return low; return low;
/*-------------------- /*--------------------
* On a non-leaf page, there are special cases: * On a non-leaf page, return the last key < scan key.
* * There must be one if _bt_compare() is playing by the rules.
* For an insertion (srchtype != BT_DESCENT and natts == keysz)
* always return first key >= scan key (which could be off the end).
*
* For a standard search (srchtype == BT_DESCENT and natts == keysz)
* return the first equal key if one exists, else the last lesser key
* if one exists, else the first slot on the page.
*
* For a partial-match search (srchtype == BT_DESCENT and natts > keysz)
* return the last lesser key if one exists, else the first slot.
*
* Old comments:
* For multi-column indices, we may scan using keys
* not for all attrs. But we handle duplicates using all attrs
* in _bt_insert/_bt_spool code. And so while searching on
* internal pages having number of attrs > keysize we want to
* point at the last item < the scankey, not at the first item
* = the scankey (!!!), and let _bt_moveright decide later
* whether to move right or not (see comments and example
* there). Note also that INSERTions are not affected by this
* code (since natts == keysz for inserts). - vadim 04/15/97
*-------------------- *--------------------
*/ */
Assert(low > P_FIRSTDATAKEY(opaque));
if (haveEq)
{
/*
* There is an equal key. We return either the first equal key
* (which we just found), or the last lesser key.
*
* We need not check srchtype != BT_DESCENT here, since if that is
* true then natts == keysz by assumption.
*/
if (natts == keysz)
return low; /* return first equal key */
}
else
{
/*
* There is no equal key. We return either the first greater key
* (which we just found), or the last lesser key.
*/
if (srchtype != BT_DESCENT)
return low; /* return first greater key */
}
if (low == (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY))
return low; /* there is no prior item */
return OffsetNumberPrev(low); return OffsetNumberPrev(low);
} }
/* /*----------
* _bt_compare() -- Compare scankey to a particular tuple on the page. * _bt_compare() -- Compare scankey to a particular tuple on the page.
* *
* keysz: number of key conditions to be checked (might be less than the
* total length of the scan key!)
* page/offnum: location of btree item to be compared to.
*
* This routine returns: * This routine returns:
* <0 if scankey < tuple at offnum; * <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum; * 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum. * >0 if scankey > tuple at offnum.
* NULLs in the keys are treated as sortable values. Therefore
* "equality" does not necessarily mean that the item should be
* returned to the caller as a matching key!
* *
* -- Old comments: * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* In order to avoid having to propagate changes up the tree any time * "minus infinity": this routine will always claim it is less than the
* a new minimal key is inserted, the leftmost entry on the leftmost * scankey. The actual key value stored (if any, which there probably isn't)
* page is less than all possible keys, by definition. * does not matter. This convention allows us to implement the Lehman and
* * Yao convention that the first down-link pointer is before the first key.
* -- New ones: * See backend/access/nbtree/README for details.
* New insertion code (fix against updating _in_place_ if new minimal *----------
* key has bigger size than old one) may delete P_HIKEY entry on the
* root page in order to insert new minimal key - and so this definition
* does not work properly in this case and breaks key' order on root
* page. BTW, this propagation occures only while page' splitting,
* but not "any time a new min key is inserted" (see _bt_insertonpg).
* - vadim 12/05/96
*/ */
static int32 int32
_bt_compare(Relation rel, _bt_compare(Relation rel,
TupleDesc itupdesc,
Page page,
int keysz, int keysz,
ScanKey scankey, ScanKey scankey,
Page page,
OffsetNumber offnum) OffsetNumber offnum)
{ {
Datum datum; TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
BTItem btitem; BTItem btitem;
IndexTuple itup; IndexTuple itup;
BTPageOpaque opaque;
ScanKey entry;
AttrNumber attno;
int32 result;
int i; int i;
bool null;
/* /*
* If this is a leftmost internal page, and if our comparison is with * Force result ">" if target item is first data item on an internal
* the first key on the page, then the item at that position is by * page --- see NOTE above.
* definition less than the scan key.
*
* - see new comments above...
*/ */
if (! P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (!(opaque->btpo_flags & BTP_LEAF)
&& P_LEFTMOST(opaque)
&& offnum == P_HIKEY)
{
/*
* we just have to believe that this will only be called with
* offnum == P_HIKEY when P_HIKEY is the OffsetNumber of the first
* actual data key (i.e., this is also a rightmost page). there
* doesn't seem to be any code that implies that the leftmost page
* is normally missing a high key as well as the rightmost page.
* but that implies that this code path only applies to the root
* -- which seems unlikely..
*
* - see new comments above...
*/
if (!P_RIGHTMOST(opaque))
elog(ERROR, "_bt_compare: invalid comparison to high key");
#ifdef NOT_USED
/*
* We just have to belive that right answer will not break
* anything. I've checked code and all seems to be ok. See new
* comments above...
*
* -- Old comments If the item on the page is equal to the scankey,
* that's okay to admit. We just can't claim that the first key
* on the page is greater than anything.
*/
if (_bt_skeycmp(rel, keysz, scankey, page, PageGetItemId(page, offnum),
BTEqualStrategyNumber))
return 0;
return 1; return 1;
#endif
}
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup); itup = &(btitem->bti_itup);
...@@ -568,37 +312,45 @@ _bt_compare(Relation rel, ...@@ -568,37 +312,45 @@ _bt_compare(Relation rel,
* they be in order. If you think about how multi-key ordering works, * they be in order. If you think about how multi-key ordering works,
* you'll understand why this is. * you'll understand why this is.
* *
* We don't test for violation of this condition here. * We don't test for violation of this condition here, however. The
* initial setup for the index scan had better have gotten it right
* (see _bt_first).
*/ */
for (i = 1; i <= keysz; i++) for (i = 0; i < keysz; i++)
{ {
entry = &scankey[i - 1]; ScanKey entry = &scankey[i];
attno = entry->sk_attno; Datum datum;
datum = index_getattr(itup, attno, itupdesc, &null); bool isNull;
int32 result;
datum = index_getattr(itup, entry->sk_attno, itupdesc, &isNull);
/* see comments about NULLs handling in btbuild */ /* see comments about NULLs handling in btbuild */
if (entry->sk_flags & SK_ISNULL) /* key is NULL */ if (entry->sk_flags & SK_ISNULL) /* key is NULL */
{ {
if (null) if (isNull)
result = 0; /* NULL "=" NULL */ result = 0; /* NULL "=" NULL */
else else
result = 1; /* NULL ">" NOT_NULL */ result = 1; /* NULL ">" NOT_NULL */
} }
else if (null) /* key is NOT_NULL and item is NULL */ else if (isNull) /* key is NOT_NULL and item is NULL */
{ {
result = -1; /* NOT_NULL "<" NULL */ result = -1; /* NOT_NULL "<" NULL */
} }
else else
{
result = DatumGetInt32(FunctionCall2(&entry->sk_func, result = DatumGetInt32(FunctionCall2(&entry->sk_func,
entry->sk_argument, datum)); entry->sk_argument,
datum));
}
/* if the keys are unequal, return the difference */ /* if the keys are unequal, return the difference */
if (result != 0) if (result != 0)
return result; return result;
} }
/* by here, the keys are equal */ /* if we get here, the keys are equal */
return 0; return 0;
} }
...@@ -606,10 +358,10 @@ _bt_compare(Relation rel, ...@@ -606,10 +358,10 @@ _bt_compare(Relation rel,
* _bt_next() -- Get the next item in a scan. * _bt_next() -- Get the next item in a scan.
* *
* On entry, we have a valid currentItemData in the scan, and a * On entry, we have a valid currentItemData in the scan, and a
* read lock on the page that contains that item. We do not have * read lock and pin count on the page that contains that item.
* the page pinned. We return the next item in the scan. On * We return the next item in the scan, or NULL if no more.
* exit, we have the page containing the next item locked but not * On successful exit, the page containing the new item is locked
* pinned. * and pinned; on NULL exit, no lock or pin is held.
*/ */
RetrieveIndexResult RetrieveIndexResult
_bt_next(IndexScanDesc scan, ScanDirection dir) _bt_next(IndexScanDesc scan, ScanDirection dir)
...@@ -618,7 +370,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) ...@@ -618,7 +370,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
Buffer buf; Buffer buf;
Page page; Page page;
OffsetNumber offnum; OffsetNumber offnum;
RetrieveIndexResult res;
ItemPointer current; ItemPointer current;
BTItem btitem; BTItem btitem;
IndexTuple itup; IndexTuple itup;
...@@ -629,10 +380,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) ...@@ -629,10 +380,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
current = &(scan->currentItemData); current = &(scan->currentItemData);
Assert(BufferIsValid(so->btso_curbuf));
/* we still have the buffer pinned and locked */ /* we still have the buffer pinned and locked */
buf = so->btso_curbuf; buf = so->btso_curbuf;
Assert(BufferIsValid(buf));
do do
{ {
...@@ -640,7 +390,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) ...@@ -640,7 +390,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
if (!_bt_step(scan, &buf, dir)) if (!_bt_step(scan, &buf, dir))
return (RetrieveIndexResult) NULL; return (RetrieveIndexResult) NULL;
/* by here, current is the tuple we want to return */ /* current is the next candidate tuple to return */
offnum = ItemPointerGetOffsetNumber(current); offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf); page = BufferGetPage(buf);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
...@@ -648,17 +398,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) ...@@ -648,17 +398,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
if (_bt_checkkeys(scan, itup, &keysok)) if (_bt_checkkeys(scan, itup, &keysok))
{ {
/* tuple passes all scan key conditions, so return it */
Assert(keysok == so->numberOfKeys); Assert(keysok == so->numberOfKeys);
res = FormRetrieveIndexResult(current, &(itup->t_tid)); return FormRetrieveIndexResult(current, &(itup->t_tid));
/* remember which buffer we have pinned and locked */
so->btso_curbuf = buf;
return res;
} }
/* This tuple doesn't pass, but there might be more that do */
} while (keysok >= so->numberOfFirstKeys || } while (keysok >= so->numberOfFirstKeys ||
(keysok == ((Size) -1) && ScanDirectionIsBackward(dir))); (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)));
/* No more items, so close down the current-item info */
ItemPointerSetInvalid(current); ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer; so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ); _bt_relbuf(rel, buf, BT_READ);
...@@ -680,14 +429,10 @@ RetrieveIndexResult ...@@ -680,14 +429,10 @@ RetrieveIndexResult
_bt_first(IndexScanDesc scan, ScanDirection dir) _bt_first(IndexScanDesc scan, ScanDirection dir)
{ {
Relation rel; Relation rel;
TupleDesc itupdesc;
Buffer buf; Buffer buf;
Page page; Page page;
BTPageOpaque pop;
BTStack stack; BTStack stack;
OffsetNumber offnum, OffsetNumber offnum;
maxoff;
bool offGmax = false;
BTItem btitem; BTItem btitem;
IndexTuple itup; IndexTuple itup;
ItemPointer current; ItemPointer current;
...@@ -698,7 +443,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) ...@@ -698,7 +443,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
int32 result; int32 result;
BTScanOpaque so; BTScanOpaque so;
Size keysok; Size keysok;
bool strategyCheck; bool strategyCheck;
ScanKey scankeys = 0; ScanKey scankeys = 0;
int keysCount = 0; int keysCount = 0;
...@@ -784,20 +528,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) ...@@ -784,20 +528,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
return _bt_endpoint(scan, dir); return _bt_endpoint(scan, dir);
} }
itupdesc = RelationGetDescr(rel);
current = &(scan->currentItemData);
/* /*
* Okay, we want something more complicated. What we'll do is use the * Okay, we want something more complicated. What we'll do is use the
* first item in the scan key passed in (which has been correctly * first item in the scan key passed in (which has been correctly
* ordered to take advantage of index ordering) to position ourselves * ordered to take advantage of index ordering) to position ourselves
* at the right place in the scan. * at the right place in the scan.
*/ */
/* _bt_orderkeys disallows it, but it's place to add some code latter */
scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData)); scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData));
for (i = 0; i < keysCount; i++) for (i = 0; i < keysCount; i++)
{ {
j = nKeyIs[i]; j = nKeyIs[i];
/* _bt_orderkeys disallows it, but it's place to add some code latter */
if (so->keyData[j].sk_flags & SK_ISNULL) if (so->keyData[j].sk_flags & SK_ISNULL)
{ {
pfree(nKeyIs); pfree(nKeyIs);
...@@ -812,234 +553,213 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) ...@@ -812,234 +553,213 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
if (nKeyIs) if (nKeyIs)
pfree(nKeyIs); pfree(nKeyIs);
stack = _bt_search(rel, keysCount, scankeys, &buf); current = &(scan->currentItemData);
_bt_freestack(stack);
blkno = BufferGetBlockNumber(buf);
page = BufferGetPage(buf);
/* /*
* This will happen if the tree we're searching is entirely empty, or * Use the manufactured scan key to descend the tree and position
* if we're doing a search for a key that would appear on an entirely * ourselves on the target leaf page.
* empty internal page. In either case, there are no matching tuples
* in the index.
*/ */
stack = _bt_search(rel, keysCount, scankeys, &buf, BT_READ);
if (PageIsEmpty(page)) /* don't need to keep the stack around... */
_bt_freestack(stack);
if (! BufferIsValid(buf))
{ {
/* Only get here if index is completely empty */
ItemPointerSetInvalid(current); ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer; so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ);
pfree(scankeys); pfree(scankeys);
return (RetrieveIndexResult) NULL; return (RetrieveIndexResult) NULL;
} }
maxoff = PageGetMaxOffsetNumber(page);
pop = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* Now _bt_moveright doesn't move from non-rightmost leaf page if
* scankey == hikey and there is only hikey there. It's good for
* insertion, but we need to do work for scan here. - vadim 05/27/97
*/
while (maxoff == P_HIKEY && !P_RIGHTMOST(pop) &&
_bt_skeycmp(rel, keysCount, scankeys, page,
PageGetItemId(page, P_HIKEY),
BTGreaterEqualStrategyNumber))
{
/* step right one page */
blkno = pop->btpo_next;
_bt_relbuf(rel, buf, BT_READ);
buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
if (PageIsEmpty(page))
{
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ);
pfree(scankeys);
return (RetrieveIndexResult) NULL;
}
maxoff = PageGetMaxOffsetNumber(page);
pop = (BTPageOpaque) PageGetSpecialPointer(page);
}
/* find the nearest match to the manufactured scan key on the page */ /* remember which buffer we have pinned */
offnum = _bt_binsrch(rel, buf, keysCount, scankeys, BT_DESCENT); so->btso_curbuf = buf;
blkno = BufferGetBlockNumber(buf);
page = BufferGetPage(buf);
if (offnum > maxoff) offnum = _bt_binsrch(rel, buf, keysCount, scankeys);
{
offnum = maxoff;
offGmax = true;
}
ItemPointerSet(current, blkno, offnum); ItemPointerSet(current, blkno, offnum);
/* /*----------
* Now find the right place to start the scan. Result is the value * At this point we are positioned at the first item >= scan key,
* we're looking for minus the value we're looking at in the index. * or possibly at the end of a page on which all the existing items
* are < scan key and we know that everything on later pages is
* >= scan key. We could step forward in the latter case, but that'd
* be a waste of time if we want to scan backwards. So, it's now time to
* examine the scan strategy to find the exact place to start the scan.
*
* Note: if _bt_step fails (meaning we fell off the end of the index
* in one direction or the other), we either return NULL (no matches) or
* call _bt_endpoint() to set up a scan starting at that index endpoint,
* as appropriate for the desired scan type.
*
* it's yet other place to add some code latter for is(not)null ...
*----------
*/ */
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); switch (strat_total)
/* it's yet other place to add some code latter for is(not)null */
strat = strat_total;
switch (strat)
{ {
case BTLessStrategyNumber: case BTLessStrategyNumber:
if (result <= 0) /*
* Back up one to arrive at last item < scankey
*/
if (!_bt_step(scan, &buf, BackwardScanDirection))
{ {
do pfree(scankeys);
{ return (RetrieveIndexResult) NULL;
if (!_bt_twostep(scan, &buf, BackwardScanDirection))
break;
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
} while (result <= 0);
} }
break; break;
case BTLessEqualStrategyNumber: case BTLessEqualStrategyNumber:
if (result >= 0) /*
* We need to find the last item <= scankey, so step forward
* till we find one > scankey, then step back one.
*/
if (offnum > PageGetMaxOffsetNumber(page))
{ {
do if (!_bt_step(scan, &buf, ForwardScanDirection))
{ {
if (!_bt_twostep(scan, &buf, ForwardScanDirection)) pfree(scankeys);
break; return _bt_endpoint(scan, dir);
}
offnum = ItemPointerGetOffsetNumber(current); }
page = BufferGetPage(buf); for (;;)
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); {
} while (result >= 0); offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
result = _bt_compare(rel, keysCount, scankeys, page, offnum);
if (result < 0)
break;
if (!_bt_step(scan, &buf, ForwardScanDirection))
{
pfree(scankeys);
return _bt_endpoint(scan, dir);
}
}
if (!_bt_step(scan, &buf, BackwardScanDirection))
{
pfree(scankeys);
return (RetrieveIndexResult) NULL;
} }
if (result < 0)
_bt_twostep(scan, &buf, BackwardScanDirection);
break; break;
case BTEqualStrategyNumber: case BTEqualStrategyNumber:
if (result != 0) /*
* Make sure we are on the first equal item; might have to step
* forward if currently at end of page.
*/
if (offnum > PageGetMaxOffsetNumber(page))
{ {
_bt_relbuf(scan->relation, buf, BT_READ); if (!_bt_step(scan, &buf, ForwardScanDirection))
so->btso_curbuf = InvalidBuffer; {
ItemPointerSetInvalid(&(scan->currentItemData)); pfree(scankeys);
pfree(scankeys); return (RetrieveIndexResult) NULL;
return (RetrieveIndexResult) NULL; }
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
} }
else if (ScanDirectionIsBackward(dir)) result = _bt_compare(rel, keysCount, scankeys, page, offnum);
if (result != 0)
goto nomatches; /* no equal items! */
/*
* If a backward scan was specified, need to start with last
* equal item not first one.
*/
if (ScanDirectionIsBackward(dir))
{ {
do do
{ {
if (!_bt_twostep(scan, &buf, ForwardScanDirection)) if (!_bt_step(scan, &buf, ForwardScanDirection))
break; {
pfree(scankeys);
return _bt_endpoint(scan, dir);
}
offnum = ItemPointerGetOffsetNumber(current); offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf); page = BufferGetPage(buf);
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); result = _bt_compare(rel, keysCount, scankeys, page, offnum);
} while (result == 0); } while (result == 0);
if (!_bt_step(scan, &buf, BackwardScanDirection))
if (result < 0) elog(ERROR, "_bt_first: equal items disappeared?");
_bt_twostep(scan, &buf, BackwardScanDirection);
} }
break; break;
case BTGreaterEqualStrategyNumber: case BTGreaterEqualStrategyNumber:
if (offGmax) /*
* We want the first item >= scankey, which is where we are...
* unless we're not anywhere at all...
*/
if (offnum > PageGetMaxOffsetNumber(page))
{ {
if (result < 0) if (!_bt_step(scan, &buf, ForwardScanDirection))
{ {
Assert(!P_RIGHTMOST(pop) && maxoff == P_HIKEY); pfree(scankeys);
if (!_bt_step(scan, &buf, ForwardScanDirection)) return (RetrieveIndexResult) NULL;
{
_bt_relbuf(scan->relation, buf, BT_READ);
so->btso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(&(scan->currentItemData));
pfree(scankeys);
return (RetrieveIndexResult) NULL;
}
}
else if (result > 0)
{ /* Just remember: _bt_binsrch() returns
* the OffsetNumber of the first matching
* key on the page, or the OffsetNumber at
* which the matching key WOULD APPEAR IF
* IT WERE on this page. No key on this
* page, but offnum from _bt_binsrch()
* greater maxoff - have to move right. -
* vadim 12/06/96 */
_bt_twostep(scan, &buf, ForwardScanDirection);
} }
} }
else if (result < 0)
{
do
{
if (!_bt_twostep(scan, &buf, BackwardScanDirection))
break;
page = BufferGetPage(buf);
offnum = ItemPointerGetOffsetNumber(current);
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
} while (result < 0);
if (result > 0)
_bt_twostep(scan, &buf, ForwardScanDirection);
}
break; break;
case BTGreaterStrategyNumber: case BTGreaterStrategyNumber:
/* offGmax helps as above */ /*
if (result >= 0 || offGmax) * We want the first item > scankey, so make sure we are on
* an item and then step over any equal items.
*/
if (offnum > PageGetMaxOffsetNumber(page))
{ {
do if (!_bt_step(scan, &buf, ForwardScanDirection))
{ {
if (!_bt_twostep(scan, &buf, ForwardScanDirection)) pfree(scankeys);
break; return (RetrieveIndexResult) NULL;
}
offnum = ItemPointerGetOffsetNumber(current); offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf); page = BufferGetPage(buf);
result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); }
} while (result >= 0); result = _bt_compare(rel, keysCount, scankeys, page, offnum);
while (result == 0)
{
if (!_bt_step(scan, &buf, ForwardScanDirection))
{
pfree(scankeys);
return (RetrieveIndexResult) NULL;
}
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
result = _bt_compare(rel, keysCount, scankeys, page, offnum);
} }
break; break;
} }
pfree(scankeys);
/* okay, current item pointer for the scan is right */ /* okay, current item pointer for the scan is right */
offnum = ItemPointerGetOffsetNumber(current); offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf); page = BufferGetPage(buf);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &btitem->bti_itup; itup = &btitem->bti_itup;
/* is the first item actually acceptable? */
if (_bt_checkkeys(scan, itup, &keysok)) if (_bt_checkkeys(scan, itup, &keysok))
{ {
/* yes, return it */
res = FormRetrieveIndexResult(current, &(itup->t_tid)); res = FormRetrieveIndexResult(current, &(itup->t_tid));
/* remember which buffer we have pinned */
so->btso_curbuf = buf;
}
else if (keysok >= so->numberOfFirstKeys)
{
so->btso_curbuf = buf;
return _bt_next(scan, dir);
} }
else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) else if (keysok >= so->numberOfFirstKeys ||
(keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
{ {
so->btso_curbuf = buf; /* no, but there might be another one that is */
return _bt_next(scan, dir); res = _bt_next(scan, dir);
} }
else else
{ {
/* no tuples in the index match this scan key */
nomatches:
ItemPointerSetInvalid(current); ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer; so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ); _bt_relbuf(rel, buf, BT_READ);
res = (RetrieveIndexResult) NULL; res = (RetrieveIndexResult) NULL;
} }
pfree(scankeys);
return res; return res;
} }
...@@ -1047,276 +767,128 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) ...@@ -1047,276 +767,128 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_step() -- Step one item in the requested direction in a scan on * _bt_step() -- Step one item in the requested direction in a scan on
* the tree. * the tree.
* *
* If no adjacent record exists in the requested direction, return * *bufP is the current buffer (read-locked and pinned). If we change
* false. Else, return true and set the currentItemData for the * pages, it's updated appropriately.
* scan to the right thing. *
* If successful, update scan's currentItemData and return true.
* If no adjacent record exists in the requested direction,
* release buffer pin/locks and return false.
*/ */
bool bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{ {
Relation rel = scan->relation;
ItemPointer current = &(scan->currentItemData);
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page; Page page;
BTPageOpaque opaque; BTPageOpaque opaque;
OffsetNumber offnum, OffsetNumber offnum,
maxoff; maxoff;
OffsetNumber start;
BlockNumber blkno; BlockNumber blkno;
BlockNumber obknum; BlockNumber obknum;
BTScanOpaque so;
ItemPointer current;
Relation rel;
rel = scan->relation;
current = &(scan->currentItemData);
/* /*
* Don't use ItemPointerGetOffsetNumber or you risk to get assertion * Don't use ItemPointerGetOffsetNumber or you risk to get assertion
* due to ability of ip_posid to be equal 0. * due to ability of ip_posid to be equal 0.
*/ */
offnum = current->ip_posid; offnum = current->ip_posid;
page = BufferGetPage(*bufP); page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
so = (BTScanOpaque) scan->opaque;
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
/* get the next tuple */
if (ScanDirectionIsForward(dir)) if (ScanDirectionIsForward(dir))
{ {
if (!PageIsEmpty(page) && offnum < maxoff) if (!PageIsEmpty(page) && offnum < maxoff)
offnum = OffsetNumberNext(offnum); offnum = OffsetNumberNext(offnum);
else else
{ {
/* walk right to the next page with data */
/* if we're at end of scan, release the buffer and return */ for (;;)
blkno = opaque->btpo_next;
if (P_RIGHTMOST(opaque))
{
_bt_relbuf(rel, *bufP, BT_READ);
ItemPointerSetInvalid(current);
*bufP = so->btso_curbuf = InvalidBuffer;
return false;
}
else
{ {
/* if we're at end of scan, release the buffer and return */
/* walk right to the next page with data */ if (P_RIGHTMOST(opaque))
_bt_relbuf(rel, *bufP, BT_READ);
for (;;)
{ {
*bufP = _bt_getbuf(rel, blkno, BT_READ); _bt_relbuf(rel, *bufP, BT_READ);
page = BufferGetPage(*bufP); ItemPointerSetInvalid(current);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); *bufP = so->btso_curbuf = InvalidBuffer;
maxoff = PageGetMaxOffsetNumber(page); return false;
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
if (!PageIsEmpty(page) && start <= maxoff)
break;
else
{
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP, BT_READ);
if (blkno == P_NONE)
{
*bufP = so->btso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
return false;
}
}
} }
offnum = start; /* step right one page */
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP, BT_READ);
*bufP = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
/* done if it's not empty */
offnum = P_FIRSTDATAKEY(opaque);
if (!PageIsEmpty(page) && offnum <= maxoff)
break;
} }
} }
} }
else if (ScanDirectionIsBackward(dir)) else
{ {
if (offnum > P_FIRSTDATAKEY(opaque))
/* remember that high key is item zero on non-rightmost pages */
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
if (offnum > start)
offnum = OffsetNumberPrev(offnum); offnum = OffsetNumberPrev(offnum);
else else
{ {
/* walk left to the next page with data */
/* if we're at end of scan, release the buffer and return */ for (;;)
blkno = opaque->btpo_prev;
if (P_LEFTMOST(opaque))
{
_bt_relbuf(rel, *bufP, BT_READ);
*bufP = so->btso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
return false;
}
else
{ {
/* if we're at end of scan, release the buffer and return */
if (P_LEFTMOST(opaque))
{
_bt_relbuf(rel, *bufP, BT_READ);
ItemPointerSetInvalid(current);
*bufP = so->btso_curbuf = InvalidBuffer;
return false;
}
/* step left */
obknum = BufferGetBlockNumber(*bufP); obknum = BufferGetBlockNumber(*bufP);
blkno = opaque->btpo_prev;
/* walk right to the next page with data */
_bt_relbuf(rel, *bufP, BT_READ); _bt_relbuf(rel, *bufP, BT_READ);
for (;;) *bufP = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* If the adjacent page just split, then we have to walk
* right to find the block that's now adjacent to where
* we were. Because pages only split right, we don't have
* to worry about this failing to terminate.
*/
while (opaque->btpo_next != obknum)
{ {
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP, BT_READ);
*bufP = _bt_getbuf(rel, blkno, BT_READ); *bufP = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(*bufP); page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
/*
* If the adjacent page just split, then we may have
* the wrong block. Handle this case. Because pages
* only split right, we don't have to worry about this
* failing to terminate.
*/
while (opaque->btpo_next != obknum)
{
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP, BT_READ);
*bufP = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
}
/* don't consider the high key */
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
/* anything to look at here? */
if (!PageIsEmpty(page) && maxoff >= start)
break;
else
{
blkno = opaque->btpo_prev;
obknum = BufferGetBlockNumber(*bufP);
_bt_relbuf(rel, *bufP, BT_READ);
if (blkno == P_NONE)
{
*bufP = so->btso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
return false;
}
}
} }
offnum = maxoff;/* XXX PageIsEmpty? */ /* done if it's not empty */
maxoff = PageGetMaxOffsetNumber(page);
offnum = maxoff;
if (!PageIsEmpty(page) && maxoff >= P_FIRSTDATAKEY(opaque))
break;
} }
} }
} }
blkno = BufferGetBlockNumber(*bufP);
/* Update scan state */
so->btso_curbuf = *bufP; so->btso_curbuf = *bufP;
blkno = BufferGetBlockNumber(*bufP);
ItemPointerSet(current, blkno, offnum); ItemPointerSet(current, blkno, offnum);
return true; return true;
} }
/*
* _bt_twostep() -- Move to an adjacent record in a scan on the tree,
* if an adjacent record exists.
*
* This is like _bt_step, except that if no adjacent record exists
* it restores us to where we were before trying the step. This is
* only hairy when you cross page boundaries, since the page you cross
* from could have records inserted or deleted, or could even split.
* This is unlikely, but we try to handle it correctly here anyway.
*
* This routine contains the only case in which our changes to Lehman
* and Yao's algorithm.
*
* Like step, this routine leaves the scan's currentItemData in the
* proper state and acquires a lock and pin on *bufP. If the twostep
* succeeded, we return true; otherwise, we return false.
*/
static bool
_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Page page;
BTPageOpaque opaque;
OffsetNumber offnum,
maxoff;
OffsetNumber start;
ItemPointer current;
ItemId itemid;
int itemsz;
BTItem btitem;
BTItem svitem;
BlockNumber blkno;
blkno = BufferGetBlockNumber(*bufP);
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
current = &(scan->currentItemData);
offnum = ItemPointerGetOffsetNumber(current);
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
/* if we're safe, just do it */
if (ScanDirectionIsForward(dir) && offnum < maxoff)
{ /* XXX PageIsEmpty? */
ItemPointerSet(current, blkno, OffsetNumberNext(offnum));
return true;
}
else if (ScanDirectionIsBackward(dir) && offnum > start)
{
ItemPointerSet(current, blkno, OffsetNumberPrev(offnum));
return true;
}
/* if we've hit end of scan we don't have to do any work */
if (ScanDirectionIsForward(dir) && P_RIGHTMOST(opaque))
return false;
else if (ScanDirectionIsBackward(dir) && P_LEFTMOST(opaque))
return false;
/*
* Okay, it's off the page; let _bt_step() do the hard work, and we'll
* try to remember where we were. This is not guaranteed to work;
* this is the only place in the code where concurrency can screw us
* up, and it's because we want to be able to move in two directions
* in the scan.
*/
itemid = PageGetItemId(page, offnum);
itemsz = ItemIdGetLength(itemid);
btitem = (BTItem) PageGetItem(page, itemid);
svitem = (BTItem) palloc(itemsz);
memmove((char *) svitem, (char *) btitem, itemsz);
if (_bt_step(scan, bufP, dir))
{
pfree(svitem);
return true;
}
/* try to find our place again */
*bufP = _bt_getbuf(scan->relation, blkno, BT_READ);
page = BufferGetPage(*bufP);
maxoff = PageGetMaxOffsetNumber(page);
while (offnum <= maxoff)
{
itemid = PageGetItemId(page, offnum);
btitem = (BTItem) PageGetItem(page, itemid);
if (BTItemSame(btitem, svitem))
{
pfree(svitem);
ItemPointerSet(current, blkno, offnum);
return false;
}
}
/*
* XXX crash and burn -- can't find our place. We can be a little
* smarter -- walk to the next page to the right, for example, since
* that's the only direction that splits happen in. Deletions screw
* us up less often since they're only done by the vacuum daemon.
*/
elog(ERROR, "btree synchronization error: concurrent update botched scan");
return false;
}
/* /*
* _bt_endpoint() -- Find the first or last key in the index. * _bt_endpoint() -- Find the first or last key in the index.
*
* This is used by _bt_first() to set up a scan when we've determined
* that the scan must start at the beginning or end of the index (for
* a forward or backward scan respectively).
*/ */
static RetrieveIndexResult static RetrieveIndexResult
_bt_endpoint(IndexScanDesc scan, ScanDirection dir) _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
...@@ -1328,7 +900,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ...@@ -1328,7 +900,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
ItemPointer current; ItemPointer current;
OffsetNumber offnum, OffsetNumber offnum,
maxoff; maxoff;
OffsetNumber start = 0; OffsetNumber start;
BlockNumber blkno; BlockNumber blkno;
BTItem btitem; BTItem btitem;
IndexTuple itup; IndexTuple itup;
...@@ -1340,38 +912,50 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ...@@ -1340,38 +912,50 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
current = &(scan->currentItemData); current = &(scan->currentItemData);
so = (BTScanOpaque) scan->opaque; so = (BTScanOpaque) scan->opaque;
/*
* Scan down to the leftmost or rightmost leaf page. This is a
* simplified version of _bt_search(). We don't maintain a stack
* since we know we won't need it.
*/
buf = _bt_getroot(rel, BT_READ); buf = _bt_getroot(rel, BT_READ);
if (! BufferIsValid(buf))
{
/* empty index... */
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
return (RetrieveIndexResult) NULL;
}
blkno = BufferGetBlockNumber(buf); blkno = BufferGetBlockNumber(buf);
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
for (;;) for (;;)
{ {
if (opaque->btpo_flags & BTP_LEAF) if (P_ISLEAF(opaque))
break; break;
if (ScanDirectionIsForward(dir)) if (ScanDirectionIsForward(dir))
offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; offnum = P_FIRSTDATAKEY(opaque);
else else
offnum = PageGetMaxOffsetNumber(page); offnum = PageGetMaxOffsetNumber(page);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup); itup = &(btitem->bti_itup);
blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
_bt_relbuf(rel, buf, BT_READ); _bt_relbuf(rel, buf, BT_READ);
buf = _bt_getbuf(rel, blkno, BT_READ); buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf); page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* /*
* Race condition: If the child page we just stepped onto is in * Race condition: If the child page we just stepped onto was just
* the process of being split, we need to make sure we're all the * split, we need to make sure we're all the way at the right edge
* way at the right edge of the tree. See the paper by Lehman and * of the tree. See the paper by Lehman and Yao.
* Yao.
*/ */
if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque)) if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque))
{ {
do do
...@@ -1390,101 +974,39 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ...@@ -1390,101 +974,39 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
if (ScanDirectionIsForward(dir)) if (ScanDirectionIsForward(dir))
{ {
if (!P_LEFTMOST(opaque))/* non-leftmost page ? */ Assert(P_LEFTMOST(opaque));
elog(ERROR, "_bt_endpoint: leftmost page (%u) has not leftmost flag", blkno);
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
/*
* I don't understand this stuff! It doesn't work for
* non-rightmost pages with only one element (P_HIKEY) which we
* have after deletion itups by vacuum (it's case of start >
* maxoff). Scanning in BackwardScanDirection is not
* understandable at all. Well - new stuff. - vadim 12/06/96
*/
#ifdef NOT_USED
if (PageIsEmpty(page) || start > maxoff)
{
ItemPointerSet(current, blkno, maxoff);
if (!_bt_step(scan, &buf, BackwardScanDirection))
return (RetrieveIndexResult) NULL;
start = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
}
#endif
if (PageIsEmpty(page))
{
if (start != P_HIKEY) /* non-rightmost page */
elog(ERROR, "_bt_endpoint: non-rightmost page (%u) is empty", blkno);
/* start = P_FIRSTDATAKEY(opaque);
* It's left- & right- most page - root page, - and it's
* empty...
*/
_bt_relbuf(rel, buf, BT_READ);
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
return (RetrieveIndexResult) NULL;
}
if (start > maxoff) /* start == 2 && maxoff == 1 */
{
ItemPointerSet(current, blkno, maxoff);
if (!_bt_step(scan, &buf, ForwardScanDirection))
return (RetrieveIndexResult) NULL;
start = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
}
/* new stuff ends here */
else
ItemPointerSet(current, blkno, start);
} }
else if (ScanDirectionIsBackward(dir)) else if (ScanDirectionIsBackward(dir))
{ {
Assert(P_RIGHTMOST(opaque));
/* start = PageGetMaxOffsetNumber(page);
* I don't understand this stuff too! If RIGHT-most leaf page is if (start < P_FIRSTDATAKEY(opaque)) /* watch out for empty page */
* empty why do scanning in ForwardScanDirection ??? Well - new start = P_FIRSTDATAKEY(opaque);
* stuff. - vadim 12/06/96
*/
#ifdef NOT_USED
if (PageIsEmpty(page))
{
ItemPointerSet(current, blkno, FirstOffsetNumber);
if (!_bt_step(scan, &buf, ForwardScanDirection))
return (RetrieveIndexResult) NULL;
start = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
}
#endif
if (PageIsEmpty(page))
{
/* If it's leftmost page too - it's empty root page... */
if (P_LEFTMOST(opaque))
{
_bt_relbuf(rel, buf, BT_READ);
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
return (RetrieveIndexResult) NULL;
}
/* Go back ! */
ItemPointerSet(current, blkno, FirstOffsetNumber);
if (!_bt_step(scan, &buf, BackwardScanDirection))
return (RetrieveIndexResult) NULL;
start = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
}
/* new stuff ends here */
else
{
start = PageGetMaxOffsetNumber(page);
ItemPointerSet(current, blkno, start);
}
} }
else else
{
elog(ERROR, "Illegal scan direction %d", dir); elog(ERROR, "Illegal scan direction %d", dir);
start = 0; /* keep compiler quiet */
}
ItemPointerSet(current, blkno, start);
/* remember which buffer we have pinned */
so->btso_curbuf = buf;
/*
* Left/rightmost page could be empty due to deletions,
* if so step till we find a nonempty page.
*/
if (start > maxoff)
{
if (!_bt_step(scan, &buf, dir))
return (RetrieveIndexResult) NULL;
start = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
}
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start)); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start));
itup = &(btitem->bti_itup); itup = &(btitem->bti_itup);
...@@ -1492,23 +1014,18 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ...@@ -1492,23 +1014,18 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/* see if we picked a winner */ /* see if we picked a winner */
if (_bt_checkkeys(scan, itup, &keysok)) if (_bt_checkkeys(scan, itup, &keysok))
{ {
/* yes, return it */
res = FormRetrieveIndexResult(current, &(itup->t_tid)); res = FormRetrieveIndexResult(current, &(itup->t_tid));
/* remember which buffer we have pinned */
so->btso_curbuf = buf;
}
else if (keysok >= so->numberOfFirstKeys)
{
so->btso_curbuf = buf;
return _bt_next(scan, dir);
} }
else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) else if (keysok >= so->numberOfFirstKeys ||
(keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
{ {
so->btso_curbuf = buf; /* no, but there might be another one that is */
return _bt_next(scan, dir); res = _bt_next(scan, dir);
} }
else else
{ {
/* no tuples in the index match this scan key */
ItemPointerSetInvalid(current); ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer; so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ); _bt_relbuf(rel, buf, BT_READ);
......
...@@ -6,8 +6,12 @@ ...@@ -6,8 +6,12 @@
* *
* We use tuplesort.c to sort the given index tuples into order. * We use tuplesort.c to sort the given index tuples into order.
* Then we scan the index tuples in order and build the btree pages * Then we scan the index tuples in order and build the btree pages
* for each level. When we have only one page on a level, it must be the * for each level. We load source tuples into leaf-level pages.
* root -- it can be attached to the btree metapage and we are done. * Whenever we fill a page at one level, we add a link to it to its
* parent level (starting a new parent level if necessary). When
* done, we write out each final page on each level, adding it to
* its parent level. When we have only one page on a level, it must be
* the root -- it can be attached to the btree metapage and we are done.
* *
* this code is moderately slow (~10% slower) compared to the regular * this code is moderately slow (~10% slower) compared to the regular
* btree (insertion) build code on sorted or well-clustered data. on * btree (insertion) build code on sorted or well-clustered data. on
...@@ -23,12 +27,20 @@ ...@@ -23,12 +27,20 @@
* something like the standard 70% steady-state load factor for btrees * something like the standard 70% steady-state load factor for btrees
* would probably be better. * would probably be better.
* *
* Another limitation is that we currently load full copies of all keys
* into upper tree levels. The leftmost data key in each non-leaf node
* could be omitted as far as normal btree operations are concerned
* (see README for more info). However, because we build the tree from
* the bottom up, we need that data key to insert into the node's parent.
* This could be fixed by keeping a spare copy of the minimum key in the
* state stack, but I haven't time for that right now.
*
* *
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -57,6 +69,20 @@ struct BTSpool ...@@ -57,6 +69,20 @@ struct BTSpool
bool isunique; bool isunique;
}; };
/*
* Status record for a btree page being built. We have one of these
* for each active tree level.
*/
typedef struct BTPageState
{
Buffer btps_buf; /* current buffer & page */
Page btps_page;
OffsetNumber btps_lastoff; /* last item offset loaded */
int btps_level;
struct BTPageState *btps_next; /* link to parent level, if any */
} BTPageState;
#define BTITEMSZ(btitem) \ #define BTITEMSZ(btitem) \
((btitem) ? \ ((btitem) ? \
(IndexTupleDSize((btitem)->bti_itup) + \ (IndexTupleDSize((btitem)->bti_itup) + \
...@@ -65,13 +91,11 @@ struct BTSpool ...@@ -65,13 +91,11 @@ struct BTSpool
static void _bt_load(Relation index, BTSpool *btspool); static void _bt_load(Relation index, BTSpool *btspool);
static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey, static void _bt_buildadd(Relation index, BTPageState *state,
BTPageState *state, BTItem bti, int flags); BTItem bti, int flags);
static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend); static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
static BTPageState *_bt_pagestate(Relation index, int flags, static BTPageState *_bt_pagestate(Relation index, int flags, int level);
int level, bool doupper); static void _bt_uppershutdown(Relation index, BTPageState *state);
static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
BTPageState *state);
/* /*
...@@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags) ...@@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags)
BTPageOpaque opaque; BTPageOpaque opaque;
*buf = _bt_getbuf(index, P_NEW, BT_WRITE); *buf = _bt_getbuf(index, P_NEW, BT_WRITE);
#ifdef NOT_USED
printf("\tblk=%d\n", BufferGetBlockNumber(*buf));
#endif
*page = BufferGetPage(*buf); *page = BufferGetPage(*buf);
_bt_pageinit(*page, BufferGetPageSize(*buf)); _bt_pageinit(*page, BufferGetPageSize(*buf));
opaque = (BTPageOpaque) PageGetSpecialPointer(*page); opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
...@@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page) ...@@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page)
* is suitable for immediate use by _bt_buildadd. * is suitable for immediate use by _bt_buildadd.
*/ */
static BTPageState * static BTPageState *
_bt_pagestate(Relation index, int flags, int level, bool doupper) _bt_pagestate(Relation index, int flags, int level)
{ {
BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState)); BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));
MemSet((char *) state, 0, sizeof(BTPageState)); MemSet((char *) state, 0, sizeof(BTPageState));
_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags); _bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
state->btps_firstoff = InvalidOffsetNumber;
state->btps_lastoff = P_HIKEY; state->btps_lastoff = P_HIKEY;
state->btps_lastbti = (BTItem) NULL;
state->btps_next = (BTPageState *) NULL; state->btps_next = (BTPageState *) NULL;
state->btps_level = level; state->btps_level = level;
state->btps_doupper = doupper;
return state; return state;
} }
...@@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) ...@@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
} }
/* /*
* add an item to a disk page from a merge tape block. * add an item to a disk page from the sort output.
* *
* we must be careful to observe the following restrictions, placed * we must be careful to observe the following restrictions, placed
* upon us by the conventions in nbtsearch.c: * upon us by the conventions in nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at * - rightmost pages start data items at P_HIKEY instead of at
* P_FIRSTKEY. * P_FIRSTKEY.
* - duplicates cannot be split among pages unless the chain of
* duplicates starts at the first data item.
* *
* a leaf page being built looks like: * a leaf page being built looks like:
* *
* +----------------+---------------------------------+ * +----------------+---------------------------------+
* | PageHeaderData | linp0 linp1 linp2 ... | * | PageHeaderData | linp0 linp1 linp2 ... |
* +-----------+----+---------------------------------+ * +-----------+----+---------------------------------+
* | ... linpN | ^ first | * | ... linpN | |
* +-----------+--------------------------------------+ * +-----------+--------------------------------------+
* | ^ last | * | ^ last |
* | | * | |
* | v last |
* +-------------+------------------------------------+ * +-------------+------------------------------------+
* | | itemN ... | * | | itemN ... |
* +-------------+------------------+-----------------+ * +-------------+------------------+-----------------+
* | ... item3 item2 item1 | "special space" | * | ... item3 item2 item1 | "special space" |
* +--------------------------------+-----------------+ * +--------------------------------+-----------------+
* ^ first
* *
* contrast this with the diagram in bufpage.h; note the mismatch * contrast this with the diagram in bufpage.h; note the mismatch
* between linps and items. this is because we reserve linp0 as a * between linps and items. this is because we reserve linp0 as a
...@@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) ...@@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
* filled up the page, we will set linp0 to point to itemN and clear * filled up the page, we will set linp0 to point to itemN and clear
* linpN. * linpN.
* *
* 'last' pointers indicate the last offset/item added to the page. * 'last' pointer indicates the last offset added to the page.
* 'first' pointers indicate the first offset/item that is part of a
* chain of duplicates extending from 'first' to 'last'.
*
* if all keys are unique, 'first' will always be the same as 'last'.
*/ */
static BTItem static void
_bt_buildadd(Relation index, Size keysz, ScanKey scankey, _bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags)
BTPageState *state, BTItem bti, int flags)
{ {
Buffer nbuf; Buffer nbuf;
Page npage; Page npage;
BTItem last_bti;
OffsetNumber first_off;
OffsetNumber last_off; OffsetNumber last_off;
OffsetNumber off;
Size pgspc; Size pgspc;
Size btisz; Size btisz;
nbuf = state->btps_buf; nbuf = state->btps_buf;
npage = state->btps_page; npage = state->btps_page;
first_off = state->btps_firstoff;
last_off = state->btps_lastoff; last_off = state->btps_lastoff;
last_bti = state->btps_lastbti;
pgspc = PageGetFreeSpace(npage); pgspc = PageGetFreeSpace(npage);
btisz = BTITEMSZ(bti); btisz = BTITEMSZ(bti);
...@@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, ...@@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
if (pgspc < btisz) if (pgspc < btisz)
{ {
/*
* Item won't fit on this page, so finish off the page and
* write it out.
*/
Buffer obuf = nbuf; Buffer obuf = nbuf;
Page opage = npage; Page opage = npage;
OffsetNumber o,
n;
ItemId ii; ItemId ii;
ItemId hii; ItemId hii;
BTItem nbti;
_bt_blnewpage(index, &nbuf, &npage, flags); _bt_blnewpage(index, &nbuf, &npage, flags);
/* /*
* if 'last' is part of a chain of duplicates that does not start * We copy the last item on the page into the new page, and then
* at the beginning of the old page, the entire chain is copied to * rearrange the old page so that the 'last item' becomes its high
* the new page; we delete all of the duplicates from the old page * key rather than a true data item.
* except the first, which becomes the high key item of the old
* page.
* *
* if the chain starts at the beginning of the page or there is no * note that since we always copy an item to the new page,
* chain ('first' == 'last'), we need only copy 'last' to the new * 'bti' will never be the first data item on the new page.
* page. again, 'first' (== 'last') becomes the high key of the
* old page.
*
* note that in either case, we copy at least one item to the new
* page, so 'last_bti' will always be valid. 'bti' will never be
* the first data item on the new page.
*/ */
if (first_off == P_FIRSTKEY) ii = PageGetItemId(opage, last_off);
if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len,
P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
#ifdef FASTBUILD_DEBUG
{ {
Assert(last_off != P_FIRSTKEY); bool isnull;
first_off = last_off; BTItem tmpbti =
(BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY));
Datum d = index_getattr(&(tmpbti->bti_itup), 1,
index->rd_att, &isnull);
printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
d, P_FIRSTKEY, state->btps_level);
} }
for (o = first_off, n = P_FIRSTKEY;
o <= last_off;
o = OffsetNumberNext(o), n = OffsetNumberNext(n))
{
ii = PageGetItemId(opage, o);
if (PageAddItem(npage, PageGetItem(opage, ii),
ii->lp_len, n, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
#ifdef FASTBUILD_DEBUG
{
bool isnull;
BTItem tmpbti =
(BTItem) PageGetItem(npage, PageGetItemId(npage, n));
Datum d = index_getattr(&(tmpbti->bti_itup), 1,
index->rd_att, &isnull);
printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
d, n, state->btps_level);
}
#endif #endif
}
/* /*
* this loop is backward because PageIndexTupleDelete shuffles the * Move 'last' into the high key position on opage
* tuples to fill holes in the page -- by starting at the end and
* working back, we won't create holes (and thereby avoid
* shuffling).
*/ */
for (o = last_off; o > first_off; o = OffsetNumberPrev(o))
PageIndexTupleDelete(opage, o);
hii = PageGetItemId(opage, P_HIKEY); hii = PageGetItemId(opage, P_HIKEY);
ii = PageGetItemId(opage, first_off);
*hii = *ii; *hii = *ii;
ii->lp_flags &= ~LP_USED; ii->lp_flags &= ~LP_USED;
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData); ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
first_off = P_FIRSTKEY; /*
* Reset last_off to point to new page
*/
last_off = PageGetMaxOffsetNumber(npage); last_off = PageGetMaxOffsetNumber(npage);
last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off));
/* /*
* set the page (side link) pointers. * set the page (side link) pointers.
...@@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, ...@@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
oopaque->btpo_next = BufferGetBlockNumber(nbuf); oopaque->btpo_next = BufferGetBlockNumber(nbuf);
nopaque->btpo_prev = BufferGetBlockNumber(obuf); nopaque->btpo_prev = BufferGetBlockNumber(obuf);
nopaque->btpo_next = P_NONE; nopaque->btpo_next = P_NONE;
if (_bt_itemcmp(index, keysz, scankey,
(BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)),
(BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)),
BTEqualStrategyNumber))
oopaque->btpo_flags |= BTP_CHAIN;
} }
/* /*
* copy the old buffer's minimum key to its parent. if we don't * Link the old buffer into its parent, using its minimum key.
* have a parent, we have to create one; this adds a new btree * If we don't have a parent, we have to create one;
* level. * this adds a new btree level.
*/ */
if (state->btps_doupper) if (state->btps_next == (BTPageState *) NULL)
{ {
BTItem nbti; state->btps_next =
_bt_pagestate(index, 0, state->btps_level + 1);
if (state->btps_next == (BTPageState *) NULL)
{
state->btps_next =
_bt_pagestate(index, 0, state->btps_level + 1, true);
}
nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
_bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0);
pfree((void *) nbti);
} }
nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
_bt_buildadd(index, state->btps_next, nbti, 0);
pfree((void *) nbti);
/* /*
* write out the old stuff. we never want to see it again, so we * write out the old stuff. we never want to see it again, so we
...@@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, ...@@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
} }
/* /*
* if this item is different from the last item added, we start a new * Add the new item into the current page.
* chain of duplicates.
*/ */
off = OffsetNumberNext(last_off); last_off = OffsetNumberNext(last_off);
if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber) if (PageAddItem(npage, (Item) bti, btisz,
last_off, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)"); elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
#ifdef FASTBUILD_DEBUG #ifdef FASTBUILD_DEBUG
{ {
...@@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, ...@@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull); Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);
printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n", printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
d, off, state->btps_level); d, last_off, state->btps_level);
} }
#endif #endif
if (last_bti == (BTItem) NULL)
first_off = P_FIRSTKEY;
else if (!_bt_itemcmp(index, keysz, scankey,
bti, last_bti, BTEqualStrategyNumber))
first_off = off;
last_off = off;
last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off));
state->btps_buf = nbuf; state->btps_buf = nbuf;
state->btps_page = npage; state->btps_page = npage;
state->btps_lastbti = last_bti;
state->btps_lastoff = last_off; state->btps_lastoff = last_off;
state->btps_firstoff = first_off;
return last_bti;
} }
/*
* Finish writing out the completed btree.
*/
static void static void
_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, _bt_uppershutdown(Relation index, BTPageState *state)
BTPageState *state)
{ {
BTPageState *s; BTPageState *s;
BlockNumber blkno; BlockNumber blkno;
BTPageOpaque opaque; BTPageOpaque opaque;
BTItem bti; BTItem bti;
/*
* Each iteration of this loop completes one more level of the tree.
*/
for (s = state; s != (BTPageState *) NULL; s = s->btps_next) for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
{ {
blkno = BufferGetBlockNumber(s->btps_buf); blkno = BufferGetBlockNumber(s->btps_buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page); opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
/* /*
* if this is the root, attach it to the metapage. otherwise, * We have to link the last page on this level to somewhere.
* stick the minimum key of the last page on this level (which has *
* not been split, or else it wouldn't be the last page) into its * If we're at the top, it's the root, so attach it to the metapage.
* parent. this may cause the last page of upper levels to split, * Otherwise, add an entry for it to its parent using its minimum
* but that's not a problem -- we haven't gotten to them yet. * key. This may cause the last page of the parent level to split,
* but that's not a problem -- we haven't gotten to it yet.
*/ */
if (s->btps_doupper) if (s->btps_next == (BTPageState *) NULL)
{ {
if (s->btps_next == (BTPageState *) NULL) opaque->btpo_flags |= BTP_ROOT;
{ _bt_metaproot(index, blkno, s->btps_level + 1);
opaque->btpo_flags |= BTP_ROOT; }
_bt_metaproot(index, blkno, s->btps_level + 1); else
} {
else bti = _bt_minitem(s->btps_page, blkno, 0);
{ _bt_buildadd(index, s->btps_next, bti, 0);
bti = _bt_minitem(s->btps_page, blkno, 0); pfree((void *) bti);
_bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0);
pfree((void *) bti);
}
} }
/* /*
* this is the rightmost page, so the ItemId array needs to be * This is the rightmost page, so the ItemId array needs to be
* slid back one slot. * slid back one slot. Then we can dump out the page.
*/ */
_bt_slideleft(index, s->btps_buf, s->btps_page); _bt_slideleft(index, s->btps_buf, s->btps_page);
_bt_wrtbuf(index, s->btps_buf); _bt_wrtbuf(index, s->btps_buf);
...@@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, ...@@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
static void static void
_bt_load(Relation index, BTSpool *btspool) _bt_load(Relation index, BTSpool *btspool)
{ {
BTPageState *state; BTPageState *state = NULL;
ScanKey skey;
int natts;
BTItem bti;
bool should_free;
/*
* initialize state needed for the merge into the btree leaf pages.
*/
state = _bt_pagestate(index, BTP_LEAF, 0, true);
skey = _bt_mkscankey_nodata(index);
natts = RelationGetNumberOfAttributes(index);
for (;;) for (;;)
{ {
BTItem bti;
bool should_free;
bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true, bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
&should_free); &should_free);
if (bti == (BTItem) NULL) if (bti == (BTItem) NULL)
break; break;
_bt_buildadd(index, natts, skey, state, bti, BTP_LEAF);
/* When we see first tuple, create first index page */
if (state == NULL)
state = _bt_pagestate(index, BTP_LEAF, 0);
_bt_buildadd(index, state, bti, BTP_LEAF);
if (should_free) if (should_free)
pfree((void *) bti); pfree((void *) bti);
} }
_bt_uppershutdown(index, natts, skey, state); if (state != NULL)
_bt_uppershutdown(index, state);
_bt_freeskey(skey);
} }
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -20,16 +20,13 @@ ...@@ -20,16 +20,13 @@
#include "access/nbtree.h" #include "access/nbtree.h"
#include "executor/execdebug.h" #include "executor/execdebug.h"
extern int NIndexTupleProcessed;
/* /*
* _bt_mkscankey * _bt_mkscankey
* Build a scan key that contains comparison data from itup * Build a scan key that contains comparison data from itup
* as well as comparator routines appropriate to the key datatypes. * as well as comparator routines appropriate to the key datatypes.
* *
* The result is intended for use with _bt_skeycmp() or _bt_compare(), * The result is intended for use with _bt_compare().
* although it could be used with _bt_itemcmp() or _bt_tuplecompare().
*/ */
ScanKey ScanKey
_bt_mkscankey(Relation rel, IndexTuple itup) _bt_mkscankey(Relation rel, IndexTuple itup)
...@@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup) ...@@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* Build a scan key that contains comparator routines appropriate to * Build a scan key that contains comparator routines appropriate to
* the key datatypes, but no comparison data. * the key datatypes, but no comparison data.
* *
* The result can be used with _bt_itemcmp() or _bt_tuplecompare(), * The result cannot be used with _bt_compare(). Currently this
* but not with _bt_skeycmp() or _bt_compare(). * routine is only called by utils/sort/tuplesort.c, which has its
* own comparison routine.
*/ */
ScanKey ScanKey
_bt_mkscankey_nodata(Relation rel) _bt_mkscankey_nodata(Relation rel)
...@@ -114,7 +112,6 @@ _bt_freestack(BTStack stack) ...@@ -114,7 +112,6 @@ _bt_freestack(BTStack stack)
{ {
ostack = stack; ostack = stack;
stack = stack->bts_parent; stack = stack->bts_parent;
pfree(ostack->bts_btitem);
pfree(ostack); pfree(ostack);
} }
} }
...@@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup) ...@@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup)
Size tuplen; Size tuplen;
extern Oid newoid(); extern Oid newoid();
/*
* see comments in btbuild
*
* if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot
* include null keys");
*/
/* make a copy of the index tuple with room for the sequence number */ /* make a copy of the index tuple with room for the sequence number */
tuplen = IndexTupleSize(itup); tuplen = IndexTupleSize(itup);
nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData)); nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));
btitem = (BTItem) palloc(nbytes_btitem); btitem = (BTItem) palloc(nbytes_btitem);
memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen); memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen);
return btitem; return btitem;
} }
#ifdef NOT_USED
bool
_bt_checkqual(IndexScanDesc scan, IndexTuple itup)
{
BTScanOpaque so;
so = (BTScanOpaque) scan->opaque;
if (so->numberOfKeys > 0)
return (index_keytest(itup, RelationGetDescr(scan->relation),
so->numberOfKeys, so->keyData));
else
return true;
}
#endif
#ifdef NOT_USED
bool
_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz)
{
BTScanOpaque so;
so = (BTScanOpaque) scan->opaque;
if (keysz > 0 && so->numberOfKeys >= keysz)
return (index_keytest(itup, RelationGetDescr(scan->relation),
keysz, so->keyData));
else
return true;
}
#endif
bool bool
_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok) _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
{ {
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $ * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -19,10 +19,10 @@ ...@@ -19,10 +19,10 @@
#include "storage/bufpage.h" #include "storage/bufpage.h"
static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr, static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
char *location, Size size); char *location, Size size);
static bool PageManagerShuffle = true; /* default is shuffle mode */
/* ---------------------------------------------------------------- /* ----------------------------------------------------------------
* Page support functions * Page support functions
...@@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize) ...@@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize)
/* ---------------- /* ----------------
* PageAddItem * PageAddItem
* *
* add an item to a page. * Add an item to a page. Return value is offset at which it was
* * inserted, or InvalidOffsetNumber if there's not room to insert.
* !!! ELOG(ERROR) IS DISALLOWED HERE !!!
* *
* Notes on interface: * If offsetNumber is valid and <= current max offset in the page,
* If offsetNumber is valid, shuffle ItemId's down to make room * insert item into the array at that position by shuffling ItemId's
* to use it, if PageManagerShuffle is true. If PageManagerShuffle is * down to make room.
* false, then overwrite the specified ItemId. (PageManagerShuffle is
* true by default, and is modified by calling PageManagerModeSet.)
* If offsetNumber is not valid, then assign one by finding the first * If offsetNumber is not valid, then assign one by finding the first
* one that is both unused and deallocated. * one that is both unused and deallocated.
* *
* NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it * !!! ELOG(ERROR) IS DISALLOWED HERE !!!
* is assumed that there is room on the page to shuffle the ItemId's *
* down by one.
* ---------------- * ----------------
*/ */
OffsetNumber OffsetNumber
...@@ -82,11 +78,8 @@ PageAddItem(Page page, ...@@ -82,11 +78,8 @@ PageAddItem(Page page,
Offset lower; Offset lower;
Offset upper; Offset upper;
ItemId itemId; ItemId itemId;
ItemId fromitemId,
toitemId;
OffsetNumber limit; OffsetNumber limit;
bool needshuffle = false;
bool shuffled = false;
/* /*
* Find first unallocated offsetNumber * Find first unallocated offsetNumber
...@@ -96,31 +89,12 @@ PageAddItem(Page page, ...@@ -96,31 +89,12 @@ PageAddItem(Page page,
/* was offsetNumber passed in? */ /* was offsetNumber passed in? */
if (OffsetNumberIsValid(offsetNumber)) if (OffsetNumberIsValid(offsetNumber))
{ {
if (PageManagerShuffle == true) needshuffle = true; /* need to increase "lower" */
{ /* don't actually do the shuffle till we've checked free space! */
/* shuffle ItemId's (Do the PageManager Shuffle...) */
for (i = (limit - 1); i >= offsetNumber; i--)
{
fromitemId = &((PageHeader) page)->pd_linp[i - 1];
toitemId = &((PageHeader) page)->pd_linp[i];
*toitemId = *fromitemId;
}
shuffled = true; /* need to increase "lower" */
}
else
{ /* overwrite mode */
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
if (((*itemId).lp_flags & LP_USED) ||
((*itemId).lp_len != 0))
{
elog(NOTICE, "PageAddItem: tried overwrite of used ItemId");
return InvalidOffsetNumber;
}
}
} }
else else
{ /* offsetNumber was not passed in, so find {
* one */ /* offsetNumber was not passed in, so find one */
/* look for "recyclable" (unused & deallocated) ItemId */ /* look for "recyclable" (unused & deallocated) ItemId */
for (offsetNumber = 1; offsetNumber < limit; offsetNumber++) for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
{ {
...@@ -130,9 +104,13 @@ PageAddItem(Page page, ...@@ -130,9 +104,13 @@ PageAddItem(Page page,
break; break;
} }
} }
/*
* Compute new lower and upper pointers for page, see if it'll fit
*/
if (offsetNumber > limit) if (offsetNumber > limit)
lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page)); lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
else if (offsetNumber == limit || shuffled == true) else if (offsetNumber == limit || needshuffle)
lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData); lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
else else
lower = ((PageHeader) page)->pd_lower; lower = ((PageHeader) page)->pd_lower;
...@@ -144,6 +122,23 @@ PageAddItem(Page page, ...@@ -144,6 +122,23 @@ PageAddItem(Page page,
if (lower > upper) if (lower > upper)
return InvalidOffsetNumber; return InvalidOffsetNumber;
/*
* OK to insert the item. First, shuffle the existing pointers if needed.
*/
if (needshuffle)
{
/* shuffle ItemId's (Do the PageManager Shuffle...) */
for (i = (limit - 1); i >= offsetNumber; i--)
{
ItemId fromitemId,
toitemId;
fromitemId = &((PageHeader) page)->pd_linp[i - 1];
toitemId = &((PageHeader) page)->pd_linp[i];
*toitemId = *fromitemId;
}
}
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1]; itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
(*itemId).lp_off = upper; (*itemId).lp_off = upper;
(*itemId).lp_len = size; (*itemId).lp_len = size;
...@@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize) ...@@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize)
PageHeader thdr; PageHeader thdr;
pageSize = PageGetPageSize(page); pageSize = PageGetPageSize(page);
temp = (Page) palloc(pageSize);
if ((temp = (Page) palloc(pageSize)) == (Page) NULL)
elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize);
thdr = (PageHeader) temp; thdr = (PageHeader) temp;
/* copy old page in */ /* copy old page in */
...@@ -327,23 +320,6 @@ PageGetFreeSpace(Page page) ...@@ -327,23 +320,6 @@ PageGetFreeSpace(Page page)
return space; return space;
} }
/*
* PageManagerModeSet
*
* Sets mode to either: ShufflePageManagerMode (the default) or
* OverwritePageManagerMode. For use by access methods code
* for determining semantics of PageAddItem when the offsetNumber
* argument is passed in.
*/
void
PageManagerModeSet(PageManagerMode mode)
{
if (mode == ShufflePageManagerMode)
PageManagerShuffle = true;
else if (mode == OverwritePageManagerMode)
PageManagerShuffle = false;
}
/* /*
*---------------------------------------------------------------- *----------------------------------------------------------------
* PageIndexTupleDelete * PageIndexTupleDelete
......
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $ * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -24,14 +24,9 @@ ...@@ -24,14 +24,9 @@
* info. In addition, we need to know what sort of page this is * info. In addition, we need to know what sort of page this is
* (leaf or internal), and whether the page is available for reuse. * (leaf or internal), and whether the page is available for reuse.
* *
* Lehman and Yao's algorithm requires a ``high key'' on every page. * We also store a back-link to the parent page, but this cannot be trusted
* The high key on a page is guaranteed to be greater than or equal * very far since it does not get updated when the parent is split.
* to any key that appears on this page. Our insertion algorithm * See backend/access/nbtree/README for details.
* guarantees that we can use the initial least key on our right
* sibling as the high key. We allocate space for the line pointer
* to the high key in the opaque data at the end of the page.
*
* Rightmost pages in the tree have no high key.
*/ */
typedef struct BTPageOpaqueData typedef struct BTPageOpaqueData
...@@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData ...@@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData
BlockNumber btpo_parent; BlockNumber btpo_parent;
uint16 btpo_flags; uint16 btpo_flags;
#define BTP_LEAF (1 << 0) /* Bits defined in btpo_flags */
#define BTP_ROOT (1 << 1) #define BTP_LEAF (1 << 0) /* It's a leaf page */
#define BTP_FREE (1 << 2) #define BTP_ROOT (1 << 1) /* It's the root page (has no parent) */
#define BTP_META (1 << 3) #define BTP_FREE (1 << 2) /* not currently used... */
#define BTP_CHAIN (1 << 4) #define BTP_META (1 << 3) /* Set in the meta-page only */
} BTPageOpaqueData; } BTPageOpaqueData;
...@@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData ...@@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque; typedef BTScanOpaqueData *BTScanOpaque;
/* /*
* BTItems are what we store in the btree. Each item has an index * BTItems are what we store in the btree. Each item is an index tuple,
* tuple, including key and pointer values. In addition, we must * including key and pointer values. (In some cases either the key or the
* guarantee that all tuples in the index are unique, in order to * pointer may go unused, see backend/access/nbtree/README for details.)
* satisfy some assumptions in Lehman and Yao. The way that we do *
* this is by generating a new OID for every insertion that we do in * Old comments:
* the tree. This adds eight bytes to the size of btree index * In addition, we must guarantee that all tuples in the index are unique,
* tuples. Note that we do not use the OID as part of a composite * in order to satisfy some assumptions in Lehman and Yao. The way that we
* key; the OID only serves as a unique identifier for a given index * do this is by generating a new OID for every insertion that we do in the
* tuple (logical position within a page). * tree. This adds eight bytes to the size of btree index tuples. Note
* that we do not use the OID as part of a composite key; the OID only
* serves as a unique identifier for a given index tuple (logical position
* within a page).
* *
* New comments: * New comments:
* actually, we must guarantee that all tuples in A LEVEL * actually, we must guarantee that all tuples in A LEVEL
* are unique, not in ALL INDEX. So, we can use bti_itup->t_tid * are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
* as unique identifier for a given index tuple (logical position * as unique identifier for a given index tuple (logical position
* within a level). - vadim 04/09/97 * within a level). - vadim 04/09/97
*/ */
typedef struct BTItemData typedef struct BTItemData
...@@ -108,12 +106,13 @@ typedef struct BTItemData ...@@ -108,12 +106,13 @@ typedef struct BTItemData
typedef BTItemData *BTItem; typedef BTItemData *BTItem;
#define BTItemSame(i1, i2) ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \ /* Test whether items are the "same" per the above notes */
i2->bti_itup.t_tid.ip_blkid.bi_hi && \ #define BTItemSame(i1, i2) ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \
i1->bti_itup.t_tid.ip_blkid.bi_lo == \ (i2)->bti_itup.t_tid.ip_blkid.bi_hi && \
i2->bti_itup.t_tid.ip_blkid.bi_lo && \ (i1)->bti_itup.t_tid.ip_blkid.bi_lo == \
i1->bti_itup.t_tid.ip_posid == \ (i2)->bti_itup.t_tid.ip_blkid.bi_lo && \
i2->bti_itup.t_tid.ip_posid ) (i1)->bti_itup.t_tid.ip_posid == \
(i2)->bti_itup.t_tid.ip_posid )
/* /*
* BTStackData -- As we descend a tree, we push the (key, pointer) * BTStackData -- As we descend a tree, we push the (key, pointer)
...@@ -129,24 +128,12 @@ typedef struct BTStackData ...@@ -129,24 +128,12 @@ typedef struct BTStackData
{ {
BlockNumber bts_blkno; BlockNumber bts_blkno;
OffsetNumber bts_offset; OffsetNumber bts_offset;
BTItem bts_btitem; BTItemData bts_btitem;
struct BTStackData *bts_parent; struct BTStackData *bts_parent;
} BTStackData; } BTStackData;
typedef BTStackData *BTStack; typedef BTStackData *BTStack;
typedef struct BTPageState
{
Buffer btps_buf;
Page btps_page;
BTItem btps_lastbti;
OffsetNumber btps_lastoff;
OffsetNumber btps_firstoff;
int btps_level;
bool btps_doupper;
struct BTPageState *btps_next;
} BTPageState;
/* /*
* We need to be able to tell the difference between read and write * We need to be able to tell the difference between read and write
* requests for pages, in order to do locking correctly. * requests for pages, in order to do locking correctly.
...@@ -155,31 +142,49 @@ typedef struct BTPageState ...@@ -155,31 +142,49 @@ typedef struct BTPageState
#define BT_READ BUFFER_LOCK_SHARE #define BT_READ BUFFER_LOCK_SHARE
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE #define BT_WRITE BUFFER_LOCK_EXCLUSIVE
/*
* Similarly, the difference between insertion and non-insertion binary
* searches on a given page makes a difference when we're descending the
* tree.
*/
#define BT_INSERTION 0
#define BT_DESCENT 1
/* /*
* In general, the btree code tries to localize its knowledge about * In general, the btree code tries to localize its knowledge about
* page layout to a couple of routines. However, we need a special * page layout to a couple of routines. However, we need a special
* value to indicate "no page number" in those places where we expect * value to indicate "no page number" in those places where we expect
* page numbers. * page numbers. We can use zero for this because we never need to
* make a pointer to the metadata page.
*/ */
#define P_NONE 0 #define P_NONE 0
/*
* Macros to test whether a page is leftmost or rightmost on its tree level,
* as well as other state info kept in the opaque data.
*/
#define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE) #define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE)
#define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE) #define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE)
#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
* page. The high key is not a data key, but gives info about what range of
* keys is supposed to be on this page. The high key on a page is required
* to be greater than or equal to any data key that appears on the page.
* If we find ourselves trying to insert a key > high key, we know we need
* to move right (this should only happen if the page was split since we
* examined the parent page).
*
* Our insertion algorithm guarantees that we can use the initial least key
* on our right sibling as the high key. Once a page is created, its high
* key changes only if the page is split.
*
* On a non-rightmost page, the high key lives in item 1 and data items
* start in item 2. Rightmost pages have no high key, so we store data
* items beginning in item 1.
*/
#define P_HIKEY ((OffsetNumber) 1) #define P_HIKEY ((OffsetNumber) 1)
#define P_FIRSTKEY ((OffsetNumber) 2) #define P_FIRSTKEY ((OffsetNumber) 2)
#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
/* /*
* Strategy numbers -- ordering of these is <, <=, =, >=, > * Operator strategy numbers -- ordering of these is <, <=, =, >=, >
*/ */
#define BTLessStrategyNumber 1 #define BTLessStrategyNumber 1
...@@ -199,13 +204,27 @@ typedef struct BTPageState ...@@ -199,13 +204,27 @@ typedef struct BTPageState
#define BTORDER_PROC 1 #define BTORDER_PROC 1
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
extern bool BuildingBtree; /* in nbtree.c */
extern Datum btbuild(PG_FUNCTION_ARGS);
extern Datum btinsert(PG_FUNCTION_ARGS);
extern Datum btgettuple(PG_FUNCTION_ARGS);
extern Datum btbeginscan(PG_FUNCTION_ARGS);
extern Datum btrescan(PG_FUNCTION_ARGS);
extern void btmovescan(IndexScanDesc scan, Datum v);
extern Datum btendscan(PG_FUNCTION_ARGS);
extern Datum btmarkpos(PG_FUNCTION_ARGS);
extern Datum btrestrpos(PG_FUNCTION_ARGS);
extern Datum btdelete(PG_FUNCTION_ARGS);
/* /*
* prototypes for functions in nbtinsert.c * prototypes for functions in nbtinsert.c
*/ */
extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem, extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
bool index_is_unique, Relation heapRel); bool index_is_unique, Relation heapRel);
extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey,
BTItem item1, BTItem item2, StrategyNumber strat);
/* /*
* prototypes for functions in nbtpage.c * prototypes for functions in nbtpage.c
...@@ -218,25 +237,8 @@ extern void _bt_wrtbuf(Relation rel, Buffer buf); ...@@ -218,25 +237,8 @@ extern void _bt_wrtbuf(Relation rel, Buffer buf);
extern void _bt_wrtnorelbuf(Relation rel, Buffer buf); extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
extern void _bt_pageinit(Page page, Size size); extern void _bt_pageinit(Page page, Size size);
extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level); extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
extern void _bt_pagedel(Relation rel, ItemPointer tid); extern void _bt_pagedel(Relation rel, ItemPointer tid);
/*
* prototypes for functions in nbtree.c
*/
extern bool BuildingBtree; /* in nbtree.c */
extern Datum btbuild(PG_FUNCTION_ARGS);
extern Datum btinsert(PG_FUNCTION_ARGS);
extern Datum btgettuple(PG_FUNCTION_ARGS);
extern Datum btbeginscan(PG_FUNCTION_ARGS);
extern Datum btrescan(PG_FUNCTION_ARGS);
extern void btmovescan(IndexScanDesc scan, Datum v);
extern Datum btendscan(PG_FUNCTION_ARGS);
extern Datum btmarkpos(PG_FUNCTION_ARGS);
extern Datum btrestrpos(PG_FUNCTION_ARGS);
extern Datum btdelete(PG_FUNCTION_ARGS);
/* /*
* prototypes for functions in nbtscan.c * prototypes for functions in nbtscan.c
*/ */
...@@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void); ...@@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void);
* prototypes for functions in nbtsearch.c * prototypes for functions in nbtsearch.c
*/ */
extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey, extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
Buffer *bufP); Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz, extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, int access); ScanKey scankey, int access);
extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey,
Page page, ItemId itemid, StrategyNumber strat);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz, extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, int srchtype); ScanKey scankey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir); extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir); extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir); extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
......
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $ * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize); ...@@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize);
extern void PageRestoreTempPage(Page tempPage, Page oldPage); extern void PageRestoreTempPage(Page tempPage, Page oldPage);
extern void PageRepairFragmentation(Page page); extern void PageRepairFragmentation(Page page);
extern Size PageGetFreeSpace(Page page); extern Size PageGetFreeSpace(Page page);
extern void PageManagerModeSet(PageManagerMode mode);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset); extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment