Commit 2ed5b87f authored by Kevin Grittner's avatar Kevin Grittner

Reduce pinning and buffer content locking for btree scans.

Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one.  The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.

The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.

This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits.  These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged.  If later patches allow  all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.

Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
parent 8217fb14
...@@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf page ...@@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while into backend-local storage. The heap tuple IDs are then processed while
not holding any page lock within the index. We do continue to hold a pin not holding any page lock within the index. We do continue to hold a pin
on the leaf page, to protect against concurrent deletions (see below). on the leaf page in some circumstances, to protect against concurrent
In this state the scan is effectively stopped "between" pages, either deletions (see below). In this state the scan is effectively stopped
before or after the page it has pinned. This is safe in the presence of "between" pages, either before or after the page it has pinned. This is
concurrent insertions and even page splits, because items are never moved safe in the presence of concurrent insertions and even page splits, because
across pre-existing page boundaries --- so the scan cannot miss any items items are never moved across pre-existing page boundaries --- so the scan
it should have seen, nor accidentally return the same item twice. The scan cannot miss any items it should have seen, nor accidentally return the same
must remember the page's right-link at the time it was scanned, since that item twice. The scan must remember the page's right-link at the time it
is the page to move right to; if we move right to the current right-link was scanned, since that is the page to move right to; if we move right to
then we'd re-scan any items moved by a page split. We don't similarly the current right-link then we'd re-scan any items moved by a page split.
remember the left-link, since it's best to use the most up-to-date We don't similarly remember the left-link, since it's best to use the most
left-link when trying to move left (see detailed move-left algorithm below). up-to-date left-link when trying to move left (see detailed move-left
algorithm below).
In most cases we release our lock and pin on a page before attempting In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places to acquire pin and lock on the page we are moving to. In a few places
...@@ -154,25 +155,37 @@ starts. This is not necessary for correctness in terms of the btree index ...@@ -154,25 +155,37 @@ starts. This is not necessary for correctness in terms of the btree index
operations themselves; as explained above, index scans logically stop operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place. The reason we do it is to "between" pages and so can't lose their place. The reason we do it is to
provide an interlock between non-full VACUUM and indexscans. Since VACUUM provide an interlock between non-full VACUUM and indexscans. Since VACUUM
deletes index entries before deleting tuples, the super-exclusive lock deletes index entries before reclaiming heap tuple line pointers, the
guarantees that VACUUM can't delete any heap tuple that an indexscanning super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
process might be about to visit. (This guarantee works only for simple line pointer that an indexscanning process might be about to visit. This
indexscans that visit the heap in sync with the index scan, not for bitmap guarantee works only for simple indexscans that visit the heap in sync
scans. We only need the guarantee when using non-MVCC snapshot rules; in with the index scan, not for bitmap scans. We only need the guarantee
an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with when using non-MVCC snapshot rules; when using an MVCC snapshot, it
an unrelated tuple at the same TID, because the new tuple wouldn't be doesn't matter if the heap tuple is replaced with an unrelated tuple at
visible to our scan anyway.) the same TID, because the new tuple won't be visible to our scan anyway.
Therefore, a scan using an MVCC snapshot which has no other confounding
Because a page can be split even while someone holds a pin on it, it is factors will not hold the pin after the page contents are read. The
possible that an indexscan will return items that are no longer stored on current reasons for exceptions, where a pin is still needed, are if the
the page it has a pin on, but rather somewhere to the right of that page. index is not WAL-logged or if the scan is an index-only scan. If later
To ensure that VACUUM can't prematurely remove such heap tuples, we require work allows the pin to be dropped for all cases we will be able to
btbulkdelete to obtain super-exclusive lock on every leaf page in the index, simplify the vacuum code, since the concept of a super-exclusive lock
even pages that don't contain any deletable tuples. This guarantees that for btree indexes will no longer be needed.
the btbulkdelete call cannot return while any indexscan is still holding
a copy of a deleted index tuple. Note that this requirement does not say Because a pin is not always held, and a page can be split even while
that btbulkdelete must visit the pages in any particular order. (See also someone does hold a pin on it, it is possible that an indexscan will
on-the-fly deletion, below.) return items that are no longer stored on the page it has a pin on, but
rather somewhere to the right of that page. To ensure that VACUUM can't
prematurely remove such heap tuples, we require btbulkdelete to obtain a
super-exclusive lock on every leaf page in the index, even pages that
don't contain any deletable tuples. Any scan which could yield incorrect
results if the tuple at a TID matching the the scan's range and filter
conditions were replaced by a different tuple while the scan is in
progress must hold the pin on each index page until all index entries read
from the page have been processed. This guarantees that the btbulkdelete
call cannot return while any indexscan is still holding a copy of a
deleted index tuple if the scan could be confused by that. Note that this
requirement does not say that btbulkdelete must visit the pages in any
particular order. (See also on-the-fly deletion, below.)
There is no such interlocking for deletion of items in internal pages, There is no such interlocking for deletion of items in internal pages,
since backends keep no lock nor pin on a page they have descended past. since backends keep no lock nor pin on a page they have descended past.
...@@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that is ...@@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that is
not so: as long as an indexscanning process has a pin on the page where not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple. This is another reason why and so cannot remove the heap tuple. This is another reason why
btbulkdelete has to get super-exclusive lock on every leaf page, not only btbulkdelete has to get a super-exclusive lock on every leaf page, not
the ones where it actually sees items to delete. only the ones where it actually sees items to delete. So that we can
handle the cases where we attempt LP_DEAD flagging for a page after we
have released its pin, we remember the LSN of the index page when we read
the index tuples from it; we do not attempt to flag index tuples as dead
if the we didn't hold the pin the entire time and the LSN has changed.
WAL Considerations WAL Considerations
------------------ ------------------
...@@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as part ...@@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as part
of the insertion into the parent level. When splitting the root page, the of the insertion into the parent level. When splitting the root page, the
metapage update is handled as part of the "new root" action. metapage update is handled as part of the "new root" action.
Each step in page deletion are logged as separate WAL entries: marking the Each step in page deletion is logged as a separate WAL entry: marking the
leaf as half-dead and removing the downlink is one record, and unlinking a leaf as half-dead and removing the downlink is one record, and unlinking a
page is a second record. If vacuum is interrupted for some reason, or the page is a second record. If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions. The system crashes, the tree is consistent for searches and insertions. The
......
...@@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, ...@@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
* If there's not enough room in the space, we try to make room by * If there's not enough room in the space, we try to make room by
* removing any LP_DEAD tuples. * removing any LP_DEAD tuples.
* *
* On entry, *buf and *offsetptr point to the first legal position * On entry, *bufptr and *offsetptr point to the first legal position
* where the new tuple could be inserted. The caller should hold an * where the new tuple could be inserted. The caller should hold an
* exclusive lock on *buf. *offsetptr can also be set to * exclusive lock on *bufptr. *offsetptr can also be set to
* InvalidOffsetNumber, in which case the function will search for the * InvalidOffsetNumber, in which case the function will search for the
* right location within the page if needed. On exit, they point to the * right location within the page if needed. On exit, they point to the
* chosen insert location. If _bt_findinsertloc decides to move right, * chosen insert location. If _bt_findinsertloc decides to move right,
......
...@@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS) ...@@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS)
/* allocate private workspace */ /* allocate private workspace */
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
so->currPos.buf = so->markPos.buf = InvalidBuffer; BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
if (scan->numberOfKeys > 0) if (scan->numberOfKeys > 0)
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData)); so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
else else
...@@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS) ...@@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS)
* scan->xs_itupdesc whether we'll need it or not, since that's so cheap. * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
*/ */
so->currTuples = so->markTuples = NULL; so->currTuples = so->markTuples = NULL;
so->currPos.nextTupleOffset = 0;
so->markPos.nextTupleOffset = 0;
scan->xs_itupdesc = RelationGetDescr(rel); scan->xs_itupdesc = RelationGetDescr(rel);
...@@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS) ...@@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS)
{ {
/* Before leaving current page, deal with any killed items */ /* Before leaving current page, deal with any killed items */
if (so->numKilled > 0) if (so->numKilled > 0)
_bt_killitems(scan, false); _bt_killitems(scan);
ReleaseBuffer(so->currPos.buf); BTScanPosUnpinIfPinned(so->currPos);
so->currPos.buf = InvalidBuffer; BTScanPosInvalidate(so->currPos);
} }
if (BTScanPosIsValid(so->markPos))
{
ReleaseBuffer(so->markPos.buf);
so->markPos.buf = InvalidBuffer;
}
so->markItemIndex = -1; so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
/* /*
* Allocate tuple workspace arrays, if needed for an index-only scan and * Allocate tuple workspace arrays, if needed for an index-only scan and
...@@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS) ...@@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS)
{ {
/* Before leaving current page, deal with any killed items */ /* Before leaving current page, deal with any killed items */
if (so->numKilled > 0) if (so->numKilled > 0)
_bt_killitems(scan, false); _bt_killitems(scan);
ReleaseBuffer(so->currPos.buf); BTScanPosUnpinIfPinned(so->currPos);
so->currPos.buf = InvalidBuffer;
} }
if (BTScanPosIsValid(so->markPos))
{
ReleaseBuffer(so->markPos.buf);
so->markPos.buf = InvalidBuffer;
}
so->markItemIndex = -1; so->markItemIndex = -1;
BTScanPosUnpinIfPinned(so->markPos);
/* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */ /* Release storage */
if (so->keyData != NULL) if (so->keyData != NULL)
...@@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS) ...@@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS)
IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
BTScanOpaque so = (BTScanOpaque) scan->opaque; BTScanOpaque so = (BTScanOpaque) scan->opaque;
/* we aren't holding any read locks, but gotta drop the pin */ /* There may be an old mark with a pin (but no lock). */
if (BTScanPosIsValid(so->markPos)) BTScanPosUnpinIfPinned(so->markPos);
{
ReleaseBuffer(so->markPos.buf);
so->markPos.buf = InvalidBuffer;
}
/* /*
* Just record the current itemIndex. If we later step to next page * Just record the current itemIndex. If we later step to next page
...@@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS) ...@@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS)
if (BTScanPosIsValid(so->currPos)) if (BTScanPosIsValid(so->currPos))
so->markItemIndex = so->currPos.itemIndex; so->markItemIndex = so->currPos.itemIndex;
else else
{
BTScanPosInvalidate(so->markPos);
so->markItemIndex = -1; so->markItemIndex = -1;
}
/* Also record the current positions of any array keys */ /* Also record the current positions of any array keys */
if (so->numArrayKeys) if (so->numArrayKeys)
...@@ -593,28 +585,54 @@ btrestrpos(PG_FUNCTION_ARGS) ...@@ -593,28 +585,54 @@ btrestrpos(PG_FUNCTION_ARGS)
if (so->markItemIndex >= 0) if (so->markItemIndex >= 0)
{ {
/* /*
* The mark position is on the same page we are currently on. Just * The scan has never moved to a new page since the last mark. Just
* restore the itemIndex. * restore the itemIndex.
*
* NB: In this case we can't count on anything in so->markPos to be
* accurate.
*/ */
so->currPos.itemIndex = so->markItemIndex; so->currPos.itemIndex = so->markItemIndex;
} }
else if (so->currPos.currPage == so->markPos.currPage)
{
/*
* so->markItemIndex < 0 but mark and current positions are on the
* same page. This would be an unusual case, where the scan moved to
* a new index page after the mark, restored, and later restored again
* without moving off the marked page. It is not clear that this code
* can currently be reached, but it seems better to make this function
* robust for this case than to Assert() or elog() that it can't
* happen.
*
* We neither want to set so->markItemIndex >= 0 (because that could
* cause a later move to a new page to redo the memcpy() executions)
* nor re-execute the memcpy() functions for a restore within the same
* page. The previous restore to this page already set everything
* except markPos as it should be.
*/
so->currPos.itemIndex = so->markPos.itemIndex;
}
else else
{ {
/* we aren't holding any read locks, but gotta drop the pin */ /*
* The scan moved to a new page after last mark or restore, and we are
* now restoring to the marked page. We aren't holding any read
* locks, but if we're still holding the pin for the current position,
* we must drop it.
*/
if (BTScanPosIsValid(so->currPos)) if (BTScanPosIsValid(so->currPos))
{ {
/* Before leaving current page, deal with any killed items */ /* Before leaving current page, deal with any killed items */
if (so->numKilled > 0 && if (so->numKilled > 0)
so->currPos.buf != so->markPos.buf) _bt_killitems(scan);
_bt_killitems(scan, false); BTScanPosUnpinIfPinned(so->currPos);
ReleaseBuffer(so->currPos.buf);
so->currPos.buf = InvalidBuffer;
} }
if (BTScanPosIsValid(so->markPos)) if (BTScanPosIsValid(so->markPos))
{ {
/* bump pin on mark buffer for assignment to current buffer */ /* bump pin on mark buffer for assignment to current buffer */
IncrBufferRefCount(so->markPos.buf); if (BTScanPosIsPinned(so->markPos))
IncrBufferRefCount(so->markPos.buf);
memcpy(&so->currPos, &so->markPos, memcpy(&so->currPos, &so->markPos,
offsetof(BTScanPosData, items[1]) + offsetof(BTScanPosData, items[1]) +
so->markPos.lastItem * sizeof(BTScanPosItem)); so->markPos.lastItem * sizeof(BTScanPosItem));
...@@ -622,6 +640,8 @@ btrestrpos(PG_FUNCTION_ARGS) ...@@ -622,6 +640,8 @@ btrestrpos(PG_FUNCTION_ARGS)
memcpy(so->currTuples, so->markTuples, memcpy(so->currTuples, so->markTuples,
so->markPos.nextTupleOffset); so->markPos.nextTupleOffset);
} }
else
BTScanPosInvalidate(so->currPos);
} }
PG_RETURN_VOID(); PG_RETURN_VOID();
......
This diff is collapsed.
...@@ -1721,27 +1721,35 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc, ...@@ -1721,27 +1721,35 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
* _bt_killitems - set LP_DEAD state for items an indexscan caller has * _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed * told us were killed
* *
* scan->so contains information about the current page and killed tuples * scan->opaque, referenced locally through so, contains information about the
* thereon (generally, this should only be called if so->numKilled > 0). * current page and killed tuples thereon (generally, this should only be
* called if so->numKilled > 0).
* *
* The caller must have pin on so->currPos.buf, but may or may not have * The caller does not have a lock on the page and may or may not have the
* read-lock, as indicated by haveLock. Note that we assume read-lock * page pinned in a buffer. Note that read-lock is sufficient for setting
* is sufficient for setting LP_DEAD status (which is only a hint). * LP_DEAD status (which is only a hint).
* *
* We match items by heap TID before assuming they are the right ones to * We match items by heap TID before assuming they are the right ones to
* delete. We cope with cases where items have moved right due to insertions. * delete. We cope with cases where items have moved right due to insertions.
* If an item has moved off the current page due to a split, we'll fail to * If an item has moved off the current page due to a split, we'll fail to
* find it and do nothing (this is not an error case --- we assume the item * find it and do nothing (this is not an error case --- we assume the item
* will eventually get marked in a future indexscan). Note that because we * will eventually get marked in a future indexscan).
* hold pin on the target page continuously from initially reading the items *
* until applying this function, VACUUM cannot have deleted any items from * Note that if we hold a pin on the target page continuously from initially
* the page, and so there is no need to search left from the recorded offset. * reading the items until applying this function, VACUUM cannot have deleted
* (This observation also guarantees that the item is still the right one * any items from the page, and so there is no need to search left from the
* to delete, which might otherwise be questionable since heap TIDs can get * recorded offset. (This observation also guarantees that the item is still
* recycled.) * the right one to delete, which might otherwise be questionable since heap
* TIDs can get recycled.) This holds true even if the page has been modified
* by inserts and page splits, so there is no need to consult the LSN.
*
* If the pin was released after reading the page, then we re-read it. If it
* has been modified since we read it (as determined by the LSN), we dare not
* flag any entries because it is possible that the old entry was vacuumed
* away and the TID was re-used by a completely different heap tuple.
*/ */
void void
_bt_killitems(IndexScanDesc scan, bool haveLock) _bt_killitems(IndexScanDesc scan)
{ {
BTScanOpaque so = (BTScanOpaque) scan->opaque; BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page; Page page;
...@@ -1749,19 +1757,56 @@ _bt_killitems(IndexScanDesc scan, bool haveLock) ...@@ -1749,19 +1757,56 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)
OffsetNumber minoff; OffsetNumber minoff;
OffsetNumber maxoff; OffsetNumber maxoff;
int i; int i;
int numKilled = so->numKilled;
bool killedsomething = false; bool killedsomething = false;
Assert(BufferIsValid(so->currPos.buf)); Assert(BTScanPosIsValid(so->currPos));
/*
* Always reset the scan state, so we don't look for same items on other
* pages.
*/
so->numKilled = 0;
if (!haveLock) if (BTScanPosIsPinned(so->currPos))
{
/*
* We have held the pin on this page since we read the index tuples,
* so all we need to do is lock it. The pin will have prevented
* re-use of any TID on the page, so there is no need to check the
* LSN.
*/
LockBuffer(so->currPos.buf, BT_READ); LockBuffer(so->currPos.buf, BT_READ);
page = BufferGetPage(so->currPos.buf); page = BufferGetPage(so->currPos.buf);
}
else
{
Buffer buf;
/* Attempt to re-read the buffer, getting pin and lock. */
buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
/* It might not exist anymore; in which case we can't hint it. */
if (!BufferIsValid(buf))
return;
page = BufferGetPage(buf);
if (PageGetLSN(page) == so->currPos.lsn)
so->currPos.buf = buf;
else
{
/* Modified while not pinned means hinting is not safe. */
_bt_relbuf(scan->indexRelation, buf);
return;
}
}
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
minoff = P_FIRSTDATAKEY(opaque); minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
for (i = 0; i < so->numKilled; i++) for (i = 0; i < numKilled; i++)
{ {
int itemIndex = so->killedItems[i]; int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex]; BTScanPosItem *kitem = &so->currPos.items[itemIndex];
...@@ -1799,14 +1844,7 @@ _bt_killitems(IndexScanDesc scan, bool haveLock) ...@@ -1799,14 +1844,7 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)
MarkBufferDirtyHint(so->currPos.buf, true); MarkBufferDirtyHint(so->currPos.buf, true);
} }
if (!haveLock) LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
/*
* Always reset the scan state, so we don't look for same items on other
* pages.
*/
so->numKilled = 0;
} }
......
...@@ -395,7 +395,8 @@ btree_xlog_vacuum(XLogReaderState *record) ...@@ -395,7 +395,8 @@ btree_xlog_vacuum(XLogReaderState *record)
* unpinned between the lastBlockVacuumed and the current block, if there * unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of * are any. This prevents replay of the VACUUM from reaching the stage of
* removing heap tuples while there could still be indexscans "in flight" * removing heap tuples while there could still be indexscans "in flight"
* to those particular tuples (see nbtree/README). * to those particular tuples for those scans which could be confused by
* finding new tuples at the old TID locations (see nbtree/README).
* *
* It might be worth checking if there are actually any backends running; * It might be worth checking if there are actually any backends running;
* if not, we could just skip this. * if not, we could just skip this.
......
...@@ -18,8 +18,8 @@ backend to pin a page more than once concurrently; the buffer manager ...@@ -18,8 +18,8 @@ backend to pin a page more than once concurrently; the buffer manager
handles this efficiently. It is considered OK to hold a pin for long handles this efficiently. It is considered OK to hold a pin for long
intervals --- for example, sequential scans hold a pin on the current page intervals --- for example, sequential scans hold a pin on the current page
until done processing all the tuples on the page, which could be quite a until done processing all the tuples on the page, which could be quite a
while if the scan is the outer scan of a join. Similarly, btree index while if the scan is the outer scan of a join. Similarly, a btree index
scans hold a pin on the current index page. This is OK because normal scan may hold a pin on the current index page. This is OK because normal
operations never wait for a page's pin count to drop to zero. (Anything operations never wait for a page's pin count to drop to zero. (Anything
that might need to do such a wait is instead handled by waiting to obtain that might need to do such a wait is instead handled by waiting to obtain
the relation-level lock, which is why you'd better hold one first.) Pins the relation-level lock, which is why you'd better hold one first.) Pins
......
...@@ -518,6 +518,8 @@ typedef struct BTScanPosData ...@@ -518,6 +518,8 @@ typedef struct BTScanPosData
{ {
Buffer buf; /* if valid, the buffer is pinned */ Buffer buf; /* if valid, the buffer is pinned */
XLogRecPtr lsn; /* pos in the WAL stream when page was read */
BlockNumber currPage; /* page we've referencd by items array */
BlockNumber nextPage; /* page's right link when we scanned it */ BlockNumber nextPage; /* page's right link when we scanned it */
/* /*
...@@ -551,7 +553,37 @@ typedef struct BTScanPosData ...@@ -551,7 +553,37 @@ typedef struct BTScanPosData
typedef BTScanPosData *BTScanPos; typedef BTScanPosData *BTScanPos;
#define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf) #define BTScanPosIsPinned(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
!BufferIsValid((scanpos).buf)), \
BufferIsValid((scanpos).buf) \
)
#define BTScanPosUnpin(scanpos) \
do { \
ReleaseBuffer((scanpos).buf); \
(scanpos).buf = InvalidBuffer; \
} while (0)
#define BTScanPosUnpinIfPinned(scanpos) \
do { \
if (BTScanPosIsPinned(scanpos)) \
BTScanPosUnpin(scanpos); \
} while (0)
#define BTScanPosIsValid(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
!BufferIsValid((scanpos).buf)), \
BlockNumberIsValid((scanpos).currPage) \
)
#define BTScanPosInvalidate(scanpos) \
do { \
(scanpos).currPage = InvalidBlockNumber; \
(scanpos).nextPage = InvalidBlockNumber; \
(scanpos).buf = InvalidBuffer; \
(scanpos).lsn = InvalidXLogRecPtr; \
(scanpos).nextTupleOffset = 0; \
} while (0);
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */ /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo typedef struct BTArrayKeyInfo
...@@ -697,7 +729,7 @@ extern void _bt_preprocess_keys(IndexScanDesc scan); ...@@ -697,7 +729,7 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
extern IndexTuple _bt_checkkeys(IndexScanDesc scan, extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
Page page, OffsetNumber offnum, Page page, OffsetNumber offnum,
ScanDirection dir, bool *continuescan); ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan, bool haveLock); extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel); extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel); extern BTCycleId _bt_start_vacuum(Relation rel);
extern void _bt_end_vacuum(Relation rel); extern void _bt_end_vacuum(Relation rel);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment