Commit 6d46f478 authored by Robert Haas's avatar Robert Haas

Improve hash index bucket split behavior.

Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency.  Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.

In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel.  There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.

This patch also removes the unworldly assumption that a split will
never be interrupted.  With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion.  While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.

Amit Kapila.  I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's.  Also
reviewed by Jesper Pedersen, Jeff Janes, and others.

Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
parent 213c0f2d
......@@ -12,7 +12,7 @@ subdir = src/backend/access/hash
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
hashsearch.o hashsort.o hashutil.o hashvalidate.o
OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
hashsort.o hashutil.o hashvalidate.o
include $(top_srcdir)/src/backend/common.mk
......@@ -126,53 +126,54 @@ the initially created buckets.
Lock Definitions
----------------
We use both lmgr locks ("heavyweight" locks) and buffer context locks
(LWLocks) to control access to a hash index. lmgr locks are needed for
long-term locking since there is a (small) risk of deadlock, which we must
be able to detect. Buffer context locks are used for short-term access
control to individual pages of the index.
LockPage(rel, page), where page is the page number of a hash bucket page,
represents the right to split or compact an individual bucket. A process
splitting a bucket must exclusive-lock both old and new halves of the
bucket until it is done. A process doing VACUUM must exclusive-lock the
bucket it is currently purging tuples from. Processes doing scans or
insertions must share-lock the bucket they are scanning or inserting into.
(It is okay to allow concurrent scans and insertions.)
The lmgr lock IDs corresponding to overflow pages are currently unused.
These are available for possible future refinements. LockPage(rel, 0)
is also currently undefined (it was previously used to represent the right
to modify the hash-code-to-bucket mapping, but it is no longer needed for
that purpose).
Note that these lock definitions are conceptually distinct from any sort
of lock on the pages whose numbers they share. A process must also obtain
read or write buffer lock on the metapage or bucket page before accessing
said page.
Processes performing hash index scans must hold share lock on the bucket
they are scanning throughout the scan. This seems to be essential, since
there is no reasonable way for a scan to cope with its bucket being split
underneath it. This creates a possibility of deadlock external to the
hash index code, since a process holding one of these locks could block
waiting for an unrelated lock held by another process. If that process
then does something that requires exclusive lock on the bucket, we have
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
can be detected and recovered from.
Processes must obtain read (share) buffer context lock on any hash index
page while reading it, and write (exclusive) lock while modifying it.
To prevent deadlock we enforce these coding rules: no buffer lock may be
held long term (across index AM calls), nor may any buffer lock be held
while waiting for an lmgr lock, nor may more than one buffer lock
be held at a time by any one process. (The third restriction is probably
stronger than necessary, but it makes the proof of no deadlock obvious.)
Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL,
cleanup lock means that we hold an exclusive lock on the buffer and have
observed at some point after acquiring the lock that we hold the only pin
on that buffer. For hash indexes, a cleanup lock on a primary bucket page
represents the right to perform an arbitrary reorganization of the entire
bucket. Therefore, scans retain a pin on the primary bucket page for the
bucket they are currently scanning. Splitting a bucket requires a cleanup
lock on both the old and new primary bucket pages. VACUUM therefore takes
a cleanup lock on every bucket page in order to remove tuples. It can also
remove tuples copied to a new bucket by any previous split operation, because
the cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress. After
cleaning each page individually, it attempts to take a cleanup lock on the
primary bucket page in order to "squeeze" the bucket down to the minimum
possible number of pages.
To avoid deadlocks, we must be consistent about the lock order in which we
lock the buckets for operations that requires locks on two different buckets.
We choose to always lock the lower-numbered bucket first. The metapage is
only ever locked after all bucket locks have been taken.
Pseudocode Algorithms
---------------------
Various flags that are used in hash index operations are described as below:
The bucket-being-split and bucket-being-populated flags indicate that split
the operation is in progress for a bucket. During split operation, a
bucket-being-split flag is set on the old bucket and bucket-being-populated
flag is set on new bucket. These flags are cleared once the split operation
is finished.
The split-cleanup flag indicates that a bucket which has been recently split
still contains tuples that were also copied to the new bucket; it essentially
marks the split as incomplete. Once we're certain that no scans which
started before the new bucket was fully populated are still in progress, we
can remove the copies from the old bucket and clear the flag. We insist that
this flag must be clear before splitting a bucket; thus, a bucket can't be
split again until the previous split is totally complete.
The moved-by-split flag on a tuple indicates that tuple is moved from old to
new bucket. Concurrent scans can skip such tuples till the split operation
is finished. Once the tuple is marked as moved-by-split, it will remain so
forever but that does no harm. We have intentionally not cleared it as that
can generate an additional I/O which is not necessary.
The operations we need to support are: readers scanning the index for
entries of a particular hash code (which by definition are all in the same
bucket); insertion of a new tuple into the correct bucket; enlarging the
......@@ -193,38 +194,48 @@ The reader algorithm is:
release meta page buffer content lock
if (correct bucket page is already locked)
break
release any existing bucket page lock (if a concurrent split happened)
take heavyweight bucket lock
release any existing bucket page buffer content lock (if a concurrent
split happened)
take the buffer content lock on bucket page in shared mode
retake meta page buffer content lock in shared mode
-- then, per read request:
release pin on metapage
read current page of bucket and take shared buffer content lock
step to next page if necessary (no chaining of locks)
if the target bucket is still being populated by a split:
release the buffer content lock on current bucket page
pin and acquire the buffer content lock on old bucket in shared mode
release the buffer content lock on old bucket, but not pin
retake the buffer content lock on new bucket
arrange to scan the old bucket normally and the new bucket for
tuples which are not moved-by-split
-- then, per read request:
reacquire content lock on current page
step to next page if necessary (no chaining of content locks, but keep
the pin on the primary bucket throughout the scan; we also maintain
a pin on the page currently being scanned)
get tuple
release buffer content lock and pin on current page
release content lock
-- at scan shutdown:
release bucket share-lock
We can't hold the metapage lock while acquiring a lock on the target bucket,
because that might result in an undetected deadlock (lwlocks do not participate
in deadlock detection). Instead, we relock the metapage after acquiring the
bucket page lock and check whether the bucket has been split. If not, we're
done. If so, we release our previously-acquired lock and repeat the process
using the new bucket number. Holding the bucket sharelock for
the remainder of the scan prevents the reader's current-tuple pointer from
being invalidated by splits or compactions. Notice that the reader's lock
does not prevent other buckets from being split or compacted.
release all pins still held
Holding the buffer pin on the primary bucket page for the whole scan prevents
the reader's current-tuple pointer from being invalidated by splits or
compactions. (Of course, other buckets can still be split or compacted.)
To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
their current scan position after re-acquiring the page sharelock. Since
deletion is not possible while a reader holds the bucket sharelock, and
we assume that heap tuple TIDs are unique, this can be implemented by
their current scan position after re-acquiring the buffer content lock on
page. Since deletion is not possible while a reader holds the pin on bucket,
and we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned. Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.
To allow for scans during a bucket split, if at the start of the scan, the
bucket is marked as bucket-being-populated, it scan all the tuples in that
bucket except for those that are marked as moved-by-split. Once it finishes
the scan of all the tuples in the current bucket, it scans the old bucket from
which this bucket is formed by split.
The insertion algorithm is rather similar:
pin meta page and take buffer content lock in shared mode
......@@ -233,18 +244,29 @@ The insertion algorithm is rather similar:
release meta page buffer content lock
if (correct bucket page is already locked)
break
release any existing bucket page lock (if a concurrent split happened)
take heavyweight bucket lock in shared mode
release any existing bucket page buffer content lock (if a concurrent
split happened)
take the buffer content lock on bucket page in exclusive mode
retake meta page buffer content lock in shared mode
-- (so far same as reader)
release pin on metapage
pin current page of bucket and take exclusive buffer content lock
if full, release, read/exclusive-lock next page; repeat as needed
-- (so far same as reader, except for acquisition of buffer content lock in
exclusive mode on primary bucket page)
if the bucket-being-split flag is set for a bucket and pin count on it is
one, then finish the split
release the buffer content lock on current bucket
get the "new" bucket which was being populated by the split
scan the new bucket and form the hash table of TIDs
conditionally get the cleanup lock on old and new buckets
if we get the lock on both the buckets
finish the split using algorithm mentioned below for split
release the pin on old bucket and restart the insert from beginning.
if current page is full, release lock but not pin, read/exclusive-lock
next page; repeat as needed
>> see below if no space in any page of bucket
insert tuple at appropriate place in page
mark current page dirty and release buffer content lock and pin
release heavyweight share-lock
pin meta page and take buffer content lock in shared mode
if the current page is not a bucket page, release the pin on bucket page
pin meta page and take buffer content lock in exclusive mode
increment tuple count, decide if split needed
mark meta page dirty and release buffer content lock and pin
done if no split needed, else enter Split algorithm below
......@@ -256,11 +278,13 @@ bucket that is being actively scanned, because readers can cope with this
as explained above. We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.
It is clearly impossible for readers and inserters to deadlock, and in
fact this algorithm allows them a very high degree of concurrency.
(The exclusive metapage lock taken to update the tuple count is stronger
than necessary, since readers do not care about the tuple count, but the
lock is held for such a short time that this is probably not an issue.)
To avoid deadlock between readers and inserters, whenever there is a need
to lock multiple buckets, we always take in the order suggested in Lock
Definitions above. This algorithm allows them a very high degree of
concurrency. (The exclusive metapage lock taken to update the tuple count
is stronger than necessary, since readers do not care about the tuple count,
but the lock is held for such a short time that this is probably not an
issue.)
When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
......@@ -271,46 +295,45 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
pin meta page and take buffer content lock in exclusive mode
check split still needed
if split not needed anymore, drop buffer content lock and pin and exit
decide which bucket to split
Attempt to X-lock old bucket number (definitely could fail)
Attempt to X-lock new bucket number (shouldn't fail, but...)
if above fail, drop locks and pin and exit
update meta page to reflect new number of buckets
mark meta page dirty and release buffer content lock and pin
-- now, accesses to all other buckets can proceed.
Perform actual split of bucket, moving tuples as needed
>> see below about acquiring needed extra space
Release X-locks of old and new buckets
Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.
Split's attempt to X-lock the old bucket number could fail if another
process holds S-lock on it. We do not want to wait if that happens, first
because we don't want to wait while holding the metapage exclusive-lock,
and second because it could very easily result in deadlock. (The other
process might be out of the hash AM altogether, and could do something
that blocks on another lock this process holds; so even if the hash
algorithm itself is deadlock-free, a user-induced deadlock could occur.)
So, this is a conditional LockAcquire operation, and if it fails we just
abandon the attempt to split. This is all right since the index is
overfull but perfectly functional. Every subsequent inserter will try to
split, and eventually one will succeed. If multiple inserters failed to
split, the index might still be overfull, but eventually, the index will
pin meta page and take buffer content lock in exclusive mode
check split still needed
if split not needed anymore, drop buffer content lock and pin and exit
decide which bucket to split
try to take a cleanup lock on that bucket; if fail, give up
if that bucket is still being split or has split-cleanup work:
try to finish the split and the cleanup work
if that succeeds, start over; if it fails, give up
mark the old and new buckets indicating split is in progress
copy the tuples that belongs to new bucket from old bucket, marking
them as moved-by-split
release lock but not pin for primary bucket page of old bucket,
read/shared-lock next page; repeat as needed
clear the bucket-being-split and bucket-being-populated flags
mark the old bucket indicating split-cleanup
The split operation's attempt to acquire cleanup-lock on the old bucket number
could fail if another process holds any lock or pin on it. We do not want to
wait if that happens, because we don't want to wait while holding the metapage
exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
it fails we just abandon the attempt to split. This is all right since the
index is overfull but perfectly functional. Every subsequent inserter will
try to split, and eventually one will succeed. If multiple inserters failed
to split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop. (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)
A problem is that if a split fails partway through (eg due to insufficient
disk space) the index is left corrupt. The probability of that could be
made quite low if we grab a free page or two before we update the meta
page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
go-round.
If a split fails partway through (e.g. due to insufficient disk space or an
interrupt), the index will not be corrupted. Instead, we'll retry the split
every time a tuple is inserted into the old bucket prior to inserting the new
tuple; eventually, we should succeed. The fact that a split is left
unfinished doesn't prevent subsequent buckets from being split, but we won't
try to split the bucket again until the prior split is finished. In other
words, a bucket can be in the middle of being split for some time, but it can't
be in the middle of two splits at the same time.
Although we can survive a failure to split a bucket, a crash is likely to
corrupt the index, since hash indexes are not yet WAL-logged.
The fourth operation is garbage collection (bulk deletion):
......@@ -319,9 +342,17 @@ The fourth operation is garbage collection (bulk deletion):
fetch current max bucket number
release meta page buffer content lock and pin
while next bucket <= max bucket do
Acquire X lock on target bucket
Scan and remove tuples, compact free space as needed
Release X lock
acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space
release the pin on primary bucket page
next bucket ++
end loop
pin metapage and take buffer content lock in exclusive mode
......@@ -330,20 +361,24 @@ The fourth operation is garbage collection (bulk deletion):
else update metapage tuple count
mark meta page dirty and release buffer content lock and pin
Note that this is designed to allow concurrent splits. If a split occurs,
tuples relocated into the new bucket will be visited twice by the scan,
but that does no harm. (We must however be careful about the statistics
Note that this is designed to allow concurrent splits and scans. If a split
occurs, tuples relocated into the new bucket will be visited twice by the
scan, but that does no harm. As we release the lock on bucket page during
cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
and ensures that scan will always be behind cleanup. It is must to keep scans
behind cleanup, else vacuum could decrease the TIDs that are required to
complete the scan. Now, as the scan that returns multiple tuples from the
same bucket page always expect next valid TID to be greater than or equal to
the current TID, it might miss the tuples. This holds true for backward scans
as well (backward scans first traverse each bucket starting from first bucket
to last overflow page in the chain). We must be careful about the statistics
reported by the VACUUM operation. What we can do is count the number of
tuples scanned, and believe this in preference to the stored tuple count
if the stored tuple count and number of buckets did *not* change at any
time during the scan. This provides a way of correcting the stored tuple
count if it gets out of sync for some reason. But if a split or insertion
does occur concurrently, the scan count is untrustworthy; instead,
subtract the number of tuples deleted from the stored tuple count and
use that.)
The exclusive lock request could deadlock in some strange scenarios, but
we can just error out without any great harm being done.
tuples scanned, and believe this in preference to the stored tuple count if
the stored tuple count and number of buckets did *not* change at any time
during the scan. This provides a way of correcting the stored tuple count if
it gets out of sync for some reason. But if a split or insertion does occur
concurrently, the scan count is untrustworthy; instead, subtract the number of
tuples deleted from the stored tuple count and use that.
Free Space Management
......@@ -417,13 +452,11 @@ free page; there can be no other process holding lock on it.
Bucket splitting uses a similar algorithm if it has to extend the new
bucket, but it need not worry about concurrent extension since it has
exclusive lock on the new bucket.
buffer content lock in exclusive mode on the new bucket.
Freeing an overflow page is done by garbage collection and by bucket
splitting (the old bucket may contain no-longer-needed overflow pages).
In both cases, the process holds exclusive lock on the containing bucket,
so need not worry about other accessors of pages in the bucket. The
algorithm is:
Freeing an overflow page requires the process to hold buffer content lock in
exclusive mode on the containing bucket, so need not worry about other
accessors of pages in the bucket. The algorithm is:
delink overflow page from bucket chain
(this requires read/update/write/release of fore and aft siblings)
......@@ -454,14 +487,6 @@ locks. Since they need no lmgr locks, deadlock is not possible.
Other Notes
-----------
All the shenanigans with locking prevent a split occurring while *another*
process is stopped in a given bucket. They do not ensure that one of
our *own* backend's scans is not stopped in the bucket, because lmgr
doesn't consider a process's own locks to conflict. So the Split
algorithm must check for that case separately before deciding it can go
ahead with the split. VACUUM does not have this problem since nothing
else can be happening within the vacuuming backend.
Should we instead try to fix the state of any conflicting local scan?
Seems mighty ugly --- got to move the held bucket S-lock as well as lots
of other messiness. For now, just punt and don't split.
Clean up locks prevent a split from occurring while *another* process is stopped
in a given bucket. It also ensures that one of our *own* backend's scans is not
stopped in the bucket.
......@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
/*
* An insertion into the current index page could have happened while
* we didn't have read lock on it. Re-find our position by looking
* for the TID we previously returned. (Because we hold share lock on
* the bucket, no deletions or splits could have occurred; therefore
* we can expect that the TID still exists in the current index page,
* at an offset >= where we were.)
* for the TID we previously returned. (Because we hold a pin on the
* primary bucket page, no deletions or splits could have occurred;
* therefore we can expect that the TID still exists in the current
* index page, at an offset >= where we were.)
*/
OffsetNumber maxoffnum;
......@@ -424,17 +424,17 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
scan = RelationGetIndexScan(rel, nkeys, norderbys);
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
so->hashso_bucket_buf = InvalidBuffer;
so->hashso_split_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
scan->opaque = so;
so->hashso_buc_populated = false;
so->hashso_buc_split = false;
/* register scan in case we change pages it's using */
_hash_regscan(scan);
scan->opaque = so;
return scan;
}
......@@ -449,15 +449,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
/* release any pin we still hold */
if (BufferIsValid(so->hashso_curbuf))
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
/* release lock on bucket, too */
if (so->hashso_bucket_blkno)
_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
so->hashso_bucket_blkno = 0;
_hash_dropscanbuf(rel, so);
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
......@@ -469,8 +461,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
memmove(scan->keyData,
scankey,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
so->hashso_buc_populated = false;
so->hashso_buc_split = false;
}
/*
......@@ -482,18 +476,7 @@ hashendscan(IndexScanDesc scan)
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
/* don't need scan registered anymore */
_hash_dropscan(scan);
/* release any pin we still hold */
if (BufferIsValid(so->hashso_curbuf))
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
/* release lock on bucket, too */
if (so->hashso_bucket_blkno)
_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
so->hashso_bucket_blkno = 0;
_hash_dropscanbuf(rel, so);
pfree(so);
scan->opaque = NULL;
......@@ -504,6 +487,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
* This function also deletes the tuples that are moved by split to other
* bucket.
*
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
......@@ -548,83 +534,47 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
bool bucket_dirty = false;
Buffer bucket_buf;
Buffer buf;
HashPageOpaque bucket_opaque;
Page page;
bool split_cleanup = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
/* Exclusive-lock the bucket so we can shrink it */
_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
/* Scan each page in bucket */
blkno = bucket_blkno;
while (BlockNumberIsValid(blkno))
{
Buffer buf;
Page page;
HashPageOpaque opaque;
OffsetNumber offno;
OffsetNumber maxoffno;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
vacuum_delay_point();
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
info->strategy);
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == cur_bucket);
/* Scan each tuple in page */
maxoffno = PageGetMaxOffsetNumber(page);
for (offno = FirstOffsetNumber;
offno <= maxoffno;
offno = OffsetNumberNext(offno))
{
IndexTuple itup;
ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offno));
htup = &(itup->t_tid);
if (callback(htup, callback_state))
{
/* mark the item for deletion */
deletable[ndeletable++] = offno;
tuples_removed += 1;
}
else
num_index_tuples += 1;
}
/*
* We need to acquire a cleanup lock on the primary bucket page to out
* wait concurrent scans before deleting the dead tuples.
*/
buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_hash_checkpage(rel, buf, LH_BUCKET_PAGE);
/*
* Apply deletions and write page if needed, advance to next page.
*/
blkno = opaque->hasho_nextblkno;
page = BufferGetPage(buf);
bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
if (ndeletable > 0)
{
PageIndexMultiDelete(page, deletable, ndeletable);
_hash_wrtbuf(rel, buf);
bucket_dirty = true;
}
else
_hash_relbuf(rel, buf);
}
/*
* If the bucket contains tuples that are moved by split, then we need
* to delete such tuples. We can't delete such tuples if the split
* operation on bucket is not finished as those are needed by scans.
*/
if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
split_cleanup = true;
bucket_buf = buf;
/* If we deleted anything, try to compact free space */
if (bucket_dirty)
_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
info->strategy);
hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
local_metapage.hashm_maxbucket,
local_metapage.hashm_highmask,
local_metapage.hashm_lowmask, &tuples_removed,
&num_index_tuples, split_cleanup,
callback, callback_state);
/* Release bucket lock */
_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
_hash_dropbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
......@@ -705,6 +655,210 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
/*
* Helper function to perform deletion of index entries from a bucket.
*
* This function expects that the caller has acquired a cleanup lock on the
* primary bucket page, and will return with a write lock again held on the
* primary bucket page. The lock won't necessarily be held continuously,
* though, because we'll release it when visiting overflow pages.
*
* It would be very bad if this function cleaned a page while some other
* backend was in the midst of scanning it, because hashgettuple assumes
* that the next valid TID will be greater than or equal to the current
* valid TID. There can't be any concurrent scans in progress when we first
* enter this function because of the cleanup lock we hold on the primary
* bucket page, but as soon as we release that lock, there might be. We
* handle that by conspiring to prevent those scans from passing our cleanup
* scan. To do that, we lock the next page in the bucket chain before
* releasing the lock on the previous page. (This type of lock chaining is
* not ideal, so we might want to look for a better solution at some point.)
*
* We need to retain a pin on the primary bucket to ensure that no concurrent
* split can start.
*/
void
hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
uint32 maxbucket, uint32 highmask, uint32 lowmask,
double *tuples_removed, double *num_index_tuples,
bool split_cleanup,
IndexBulkDeleteCallback callback, void *callback_state)
{
BlockNumber blkno;
Buffer buf;
Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
bool bucket_dirty = false;
blkno = bucket_blkno;
buf = bucket_buf;
if (split_cleanup)
new_bucket = _hash_get_newbucket_from_oldbucket(rel, cur_bucket,
lowmask, maxbucket);
/* Scan each page in bucket */
for (;;)
{
HashPageOpaque opaque;
OffsetNumber offno;
OffsetNumber maxoffno;
Buffer next_buf;
Page page;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
bool retain_pin = false;
bool curr_page_dirty = false;
vacuum_delay_point();
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
/* Scan each tuple in page */
maxoffno = PageGetMaxOffsetNumber(page);
for (offno = FirstOffsetNumber;
offno <= maxoffno;
offno = OffsetNumberNext(offno))
{
ItemPointer htup;
IndexTuple itup;
Bucket bucket;
bool kill_tuple = false;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offno));
htup = &(itup->t_tid);
/*
* To remove the dead tuples, we strictly want to rely on results
* of callback function. refer btvacuumpage for detailed reason.
*/
if (callback && callback(htup, callback_state))
{
kill_tuple = true;
if (tuples_removed)
*tuples_removed += 1;
}
else if (split_cleanup)
{
/* delete the tuples that are moved by split. */
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket,
highmask,
lowmask);
/* mark the item for deletion */
if (bucket != cur_bucket)
{
/*
* We expect tuples to either belong to curent bucket or
* new_bucket. This is ensured because we don't allow
* further splits from bucket that contains garbage. See
* comments in _hash_expandtable.
*/
Assert(bucket == new_bucket);
kill_tuple = true;
}
}
if (kill_tuple)
{
/* mark the item for deletion */
deletable[ndeletable++] = offno;
}
else
{
/* we're keeping it, so count it */
if (num_index_tuples)
*num_index_tuples += 1;
}
}
/* retain the pin on primary bucket page till end of bucket scan */
if (blkno == bucket_blkno)
retain_pin = true;
else
retain_pin = false;
blkno = opaque->hasho_nextblkno;
/*
* Apply deletions, advance to next page and write page if needed.
*/
if (ndeletable > 0)
{
PageIndexMultiDelete(page, deletable, ndeletable);
bucket_dirty = true;
curr_page_dirty = true;
}
/* bail out if there are no more pages to scan. */
if (!BlockNumberIsValid(blkno))
break;
next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
LH_OVERFLOW_PAGE,
bstrategy);
/*
* release the lock on previous page after acquiring the lock on next
* page
*/
if (curr_page_dirty)
{
if (retain_pin)
_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
else
_hash_wrtbuf(rel, buf);
curr_page_dirty = false;
}
else if (retain_pin)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, buf);
buf = next_buf;
}
/*
* lock the bucket page to clear the garbage flag and squeeze the bucket.
* if the current buffer is same as bucket buffer, then we already have
* lock on bucket page.
*/
if (buf != bucket_buf)
{
_hash_relbuf(rel, buf);
_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
}
/*
* Clear the garbage flag from bucket after deleting the tuples that are
* moved by split. We purposefully clear the flag before squeeze bucket,
* so that after restart, vacuum shouldn't again try to delete the moved
* by split tuples.
*/
if (split_cleanup)
{
HashPageOpaque bucket_opaque;
Page page;
page = BufferGetPage(bucket_buf);
bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
}
/*
* If we have deleted anything, try to compact free space. For squeezing
* the bucket, we must have a cleanup lock, else it can impact the
* ordering of tuples for a scan that has started before it.
*/
if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
bstrategy);
else
_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
}
void
hash_redo(XLogReaderState *record)
......
......@@ -28,18 +28,22 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
Buffer buf;
Buffer buf = InvalidBuffer;
Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
BlockNumber oldblkno = InvalidBlockNumber;
bool retry = false;
BlockNumber oldblkno;
bool retry;
Page page;
HashPageOpaque pageopaque;
Size itemsz;
bool do_expand;
uint32 hashkey;
Bucket bucket;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
......@@ -51,6 +55,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
restart_insert:
/* Read the metapage */
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
metap = HashPageGetMeta(BufferGetPage(metabuf));
......@@ -69,6 +74,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
itemsz, HashMaxItemSize((Page) metap)),
errhint("Values larger than a buffer page cannot be indexed.")));
oldblkno = InvalidBlockNumber;
retry = false;
/*
* Loop until we get a lock on the correct target bucket.
*/
......@@ -84,21 +92,32 @@ _hash_doinsert(Relation rel, IndexTuple itup)
blkno = BUCKET_TO_BLKNO(metap, bucket);
/*
* Copy bucket mapping info now; refer the comment in
* _hash_expandtable where we copy this information before calling
* _hash_splitbucket to see why this is okay.
*/
maxbucket = metap->hashm_maxbucket;
highmask = metap->hashm_highmask;
lowmask = metap->hashm_lowmask;
/* Release metapage lock, but keep pin. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
/*
* If the previous iteration of this loop locked what is still the
* correct target bucket, we are done. Otherwise, drop any old lock
* and lock what now appears to be the correct bucket.
* If the previous iteration of this loop locked the primary page of
* what is still the correct target bucket, we are done. Otherwise,
* drop any old lock before acquiring the new one.
*/
if (retry)
{
if (oldblkno == blkno)
break;
_hash_droplock(rel, oldblkno, HASH_SHARE);
_hash_relbuf(rel, buf);
}
_hash_getlock(rel, blkno, HASH_SHARE);
/* Fetch and lock the primary bucket page for the target bucket */
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
/*
* Reacquire metapage lock and check that no bucket split has taken
......@@ -109,12 +128,36 @@ _hash_doinsert(Relation rel, IndexTuple itup)
retry = true;
}
/* Fetch the primary bucket page for the bucket */
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
/* remember the primary bucket buffer to release the pin on it at end. */
bucket_buf = buf;
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
/*
* If this bucket is in the process of being split, try to finish the
* split before inserting, because that might create room for the
* insertion to proceed without allocating an additional overflow page.
* It's only interesting to finish the split if we're trying to insert
* into the bucket from which we're removing tuples (the "old" bucket),
* not if we're trying to insert into the bucket into which tuples are
* being moved (the "new" bucket).
*/
if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
{
/* release the lock on bucket buffer, before completing the split. */
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
maxbucket, highmask, lowmask);
/* release the pin on old and meta buffer. retry for insert. */
_hash_dropbuf(rel, buf);
_hash_dropbuf(rel, metabuf);
goto restart_insert;
}
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
......@@ -127,9 +170,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
* find out next pass through the loop test above.
* find out next pass through the loop test above. we always
* release both the lock and pin if this is an overflow page, but
* only the lock if this is the primary bucket page, since the pin
* on the primary bucket must be retained throughout the scan.
*/
_hash_relbuf(rel, buf);
if (buf != bucket_buf)
_hash_relbuf(rel, buf);
else
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
......@@ -144,7 +193,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
buf = _hash_addovflpage(rel, metabuf, buf);
buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
page = BufferGetPage(buf);
/* should fit now, given test above */
......@@ -158,11 +207,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
/* write and release the modified page */
/*
* write and release the modified page. if the page we modified was an
* overflow page, we also need to separately drop the pin we retained on
* the primary bucket page.
*/
_hash_wrtbuf(rel, buf);
/* We can drop the bucket lock now */
_hash_droplock(rel, blkno, HASH_SHARE);
if (buf != bucket_buf)
_hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
......
......@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
* anymore). The returned overflow page will be pinned and write-locked;
* it is guaranteed to be empty.
* anymore) if not asked to retain. The pin will be retained only for the
* primary bucket. The returned overflow page will be pinned and
* write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
* The caller must hold at least share lock on the bucket, to ensure that
* no one else tries to compact the bucket meanwhile. This guarantees that
* 'buf' won't stop being part of the bucket while it's unlocked.
*
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
......@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
_hash_relbuf(rel, buf);
if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
......@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
_hash_wrtbuf(rel, buf);
if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
else
_hash_wrtbuf(rel, buf);
return ovflbuf;
}
......@@ -369,21 +372,25 @@ _hash_firstfreebit(uint32 map)
* Returns the block number of the page that followed the given page
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
* adjacent in the bucket chain. The caller had better hold exclusive lock
* on the bucket, too.
* NB: caller must not hold lock on metapage, nor on page, that's next to
* ovflbuf in the bucket chain. We don't acquire the lock on page that's
* prior to ovflbuf in chain if it is same as wbuf because the caller already
* has a lock on same. This function releases the lock on wbuf and caller
* is responsible for releasing the pin on same.
*/
BlockNumber
_hash_freeovflpage(Relation rel, Buffer ovflbuf,
BufferAccessStrategy bstrategy)
_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
bool wbuf_dirty, BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
Buffer metabuf;
Buffer mapbuf;
Buffer prevbuf = InvalidBuffer;
BlockNumber ovflblkno;
BlockNumber prevblkno;
BlockNumber blkno;
BlockNumber nextblkno;
BlockNumber writeblkno;
HashPageOpaque ovflopaque;
Page ovflpage;
Page mappage;
......@@ -400,6 +407,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
nextblkno = ovflopaque->hasho_nextblkno;
prevblkno = ovflopaque->hasho_prevblkno;
writeblkno = BufferGetBlockNumber(wbuf);
bucket = ovflopaque->hasho_bucket;
/*
......@@ -413,23 +421,39 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
* deleted. No concurrency issues since we hold exclusive lock on the
* entire bucket.
* deleted. Concurrency issues are avoided by using lock chaining as
* described atop hashbucketcleanup.
*/
if (BlockNumberIsValid(prevblkno))
{
Buffer prevbuf = _hash_getbuf_with_strategy(rel,
prevblkno,
HASH_WRITE,
Page prevpage;
HashPageOpaque prevopaque;
if (prevblkno == writeblkno)
prevbuf = wbuf;
else
prevbuf = _hash_getbuf_with_strategy(rel,
prevblkno,
HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
bstrategy);
Page prevpage = BufferGetPage(prevbuf);
HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
bstrategy);
prevpage = BufferGetPage(prevbuf);
prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
Assert(prevopaque->hasho_bucket == bucket);
prevopaque->hasho_nextblkno = nextblkno;
_hash_wrtbuf(rel, prevbuf);
if (prevblkno != writeblkno)
_hash_wrtbuf(rel, prevbuf);
}
/* write and unlock the write buffer */
if (wbuf_dirty)
_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
else
_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
if (BlockNumberIsValid(nextblkno))
{
Buffer nextbuf = _hash_getbuf_with_strategy(rel,
......@@ -570,8 +594,15 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
* Caller must hold exclusive lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
* Caller must acquire cleanup lock on the primary page of the target
* bucket to exclude any scans that are in progress, which could easily
* be confused into returning the same tuple more than once or some tuples
* not at all by the rearrangement we are performing here. To prevent
* any concurrent scan to cross the squeeze scan we use lock chaining
* similar to hasbucketcleanup. Refer comments atop hashbucketcleanup.
*
* We need to retain a pin on the primary bucket to ensure that no concurrent
* split can start.
*
* Since this function is invoked in VACUUM, we provide an access strategy
* parameter that controls fetches of the bucket pages.
......@@ -580,6 +611,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
......@@ -593,23 +625,20 @@ _hash_squeezebucket(Relation rel,
bool wbuf_dirty;
/*
* start squeezing into the base bucket page.
* start squeezing into the primary bucket page.
*/
wblkno = bucket_blkno;
wbuf = _hash_getbuf_with_strategy(rel,
wblkno,
HASH_WRITE,
LH_BUCKET_PAGE,
bstrategy);
wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
* if there aren't any overflow pages, there's nothing to squeeze.
* if there aren't any overflow pages, there's nothing to squeeze. caller
* is responsible for releasing the pin on primary bucket page.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
{
_hash_relbuf(rel, wbuf);
_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
return;
}
......@@ -646,6 +675,7 @@ _hash_squeezebucket(Relation rel,
OffsetNumber maxroffnum;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
bool retain_pin = false;
/* Scan each tuple in "read" page */
maxroffnum = PageGetMaxOffsetNumber(rpage);
......@@ -671,13 +701,37 @@ _hash_squeezebucket(Relation rel,
*/
while (PageGetFreeSpace(wpage) < itemsz)
{
Buffer next_wbuf = InvalidBuffer;
Assert(!PageIsEmpty(wpage));
if (wblkno == bucket_blkno)
retain_pin = true;
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
/* don't need to move to next page if we reached the read page */
if (wblkno != rblkno)
next_wbuf = _hash_getbuf_with_strategy(rel,
wblkno,
HASH_WRITE,
LH_OVERFLOW_PAGE,
bstrategy);
/*
* release the lock on previous page after acquiring the lock
* on next page
*/
if (wbuf_dirty)
_hash_wrtbuf(rel, wbuf);
{
if (retain_pin)
_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
else
_hash_wrtbuf(rel, wbuf);
}
else if (retain_pin)
_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, wbuf);
......@@ -695,15 +749,12 @@ _hash_squeezebucket(Relation rel,
return;
}
wbuf = _hash_getbuf_with_strategy(rel,
wblkno,
HASH_WRITE,
LH_OVERFLOW_PAGE,
bstrategy);
wbuf = next_wbuf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
retain_pin = false;
}
/*
......@@ -728,28 +779,29 @@ _hash_squeezebucket(Relation rel,
* Tricky point here: if our read and write pages are adjacent in the
* bucket chain, our write lock on wbuf will conflict with
* _hash_freeovflpage's attempt to update the sibling links of the
* removed page. However, in that case we are done anyway, so we can
* simply drop the write lock before calling _hash_freeovflpage.
* removed page. In that case, we don't need to lock it again and we
* always release the lock on wbuf in _hash_freeovflpage and then
* retake it again here. This will not only simplify the code, but is
* required to atomically log the changes which will be helpful when
* we write WAL for hash indexes.
*/
rblkno = ropaque->hasho_prevblkno;
Assert(BlockNumberIsValid(rblkno));
/* free this overflow page (releases rbuf) */
_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
/* yes, so release wbuf lock first */
if (wbuf_dirty)
_hash_wrtbuf(rel, wbuf);
else
_hash_relbuf(rel, wbuf);
/* free this overflow page (releases rbuf) */
_hash_freeovflpage(rel, rbuf, bstrategy);
/* done */
/* retain the pin on primary bucket page till end of bucket scan */
if (wblkno != bucket_blkno)
_hash_dropbuf(rel, wbuf);
return;
}
/* free this overflow page, then get the previous one */
_hash_freeovflpage(rel, rbuf, bstrategy);
/* lock the overflow page being written, then get the previous one */
_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
......
......@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
BlockNumber start_oblkno,
Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket, Buffer obuf,
Buffer nbuf, HTAB *htab, uint32 maxbucket,
uint32 highmask, uint32 lowmask);
/*
......@@ -54,46 +58,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
#define USELOCKING(rel) (!RELATION_IS_LOCAL(rel))
/*
* _hash_getlock() -- Acquire an lmgr lock.
*
* 'whichlock' should the block number of a bucket's primary bucket page to
* acquire the per-bucket lock. (See README for details of the use of these
* locks.)
*
* 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
*/
void
_hash_getlock(Relation rel, BlockNumber whichlock, int access)
{
if (USELOCKING(rel))
LockPage(rel, whichlock, access);
}
/*
* _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
*
* Same as above except we return FALSE without blocking if lock isn't free.
*/
bool
_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
{
if (USELOCKING(rel))
return ConditionalLockPage(rel, whichlock, access);
else
return true;
}
/*
* _hash_droplock() -- Release an lmgr lock.
*/
void
_hash_droplock(Relation rel, BlockNumber whichlock, int access)
{
if (USELOCKING(rel))
UnlockPage(rel, whichlock, access);
}
/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
......@@ -131,6 +95,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
return buf;
}
/*
* _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
*
* We read the page and try to acquire a cleanup lock. If we get it,
* we return the buffer; otherwise, we return InvalidBuffer.
*/
Buffer
_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
{
Buffer buf;
if (blkno == P_NEW)
elog(ERROR, "hash AM does not use P_NEW");
buf = ReadBuffer(rel, blkno);
if (!ConditionalLockBufferForCleanup(buf))
{
ReleaseBuffer(buf);
return InvalidBuffer;
}
/* ref count and lock type are correct */
_hash_checkpage(rel, buf, flags);
return buf;
}
/*
* _hash_getinitbuf() -- Get and initialize a buffer by block number.
*
......@@ -265,6 +258,37 @@ _hash_dropbuf(Relation rel, Buffer buf)
ReleaseBuffer(buf);
}
/*
* _hash_dropscanbuf() -- release buffers used in scan.
*
* This routine unpins the buffers used during scan on which we
* hold no lock.
*/
void
_hash_dropscanbuf(Relation rel, HashScanOpaque so)
{
/* release pin we hold on primary bucket page */
if (BufferIsValid(so->hashso_bucket_buf) &&
so->hashso_bucket_buf != so->hashso_curbuf)
_hash_dropbuf(rel, so->hashso_bucket_buf);
so->hashso_bucket_buf = InvalidBuffer;
/* release pin we hold on primary bucket page of bucket being split */
if (BufferIsValid(so->hashso_split_bucket_buf) &&
so->hashso_split_bucket_buf != so->hashso_curbuf)
_hash_dropbuf(rel, so->hashso_split_bucket_buf);
so->hashso_split_bucket_buf = InvalidBuffer;
/* release any pin we still hold */
if (BufferIsValid(so->hashso_curbuf))
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
/* reset split scan */
so->hashso_buc_populated = false;
so->hashso_buc_split = false;
}
/*
* _hash_wrtbuf() -- write a hash page to disk.
*
......@@ -489,9 +513,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
* This will silently do nothing if it cannot get the needed locks.
* This will silently do nothing if we don't get cleanup lock on old or
* new bucket.
*
* The caller should hold no locks on the hash index.
* Complete the pending splits and remove the tuples from old bucket,
* if there are any left over from the previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
......@@ -506,10 +532,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
Buffer buf_oblkno;
Page opage;
HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
restart_expand:
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
......@@ -548,11 +579,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
* Determine which bucket is to be split, and attempt to lock the old
* bucket. If we can't get the lock, give up.
* Determine which bucket is to be split, and attempt to take cleanup lock
* on the old bucket. If we can't get the lock, give up.
*
* The cleanup lock protects us not only against other backends, but
* against our own backend as well.
*
* The lock protects us against other backends, but not against our own
* backend. Must check for active scans separately.
* The cleanup lock is mainly to protect the split from concurrent
* inserts. See src/backend/access/hash/README, Lock Definitions for
* further details. Due to this locking restriction, if there is any
* pending scan, the split will give up which is not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
......@@ -560,14 +596,78 @@ _hash_expandtable(Relation rel, Buffer metabuf)
start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
if (_hash_has_active_scan(rel, old_bucket))
buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
if (!buf_oblkno)
goto fail;
if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
goto fail;
opage = BufferGetPage(buf_oblkno);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
/*
* We want to finish the split from a bucket as there is no apparent
* benefit by not doing so and it will make the code complicated to finish
* the split that involves multiple buckets considering the case where new
* split also fails. We don't need to consider the new bucket for
* completing the split here as it is not possible that a re-split of new
* bucket starts when there is still a pending split from old bucket.
*/
if (H_BUCKET_BEING_SPLIT(oopaque))
{
/*
* Copy bucket mapping info now; refer the comment in code below where
* we copy this information before calling _hash_splitbucket to see
* why this is okay.
*/
maxbucket = metap->hashm_maxbucket;
highmask = metap->hashm_highmask;
lowmask = metap->hashm_lowmask;
/*
* Release the lock on metapage and old_bucket, before completing the
* split.
*/
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
_hash_chgbufaccess(rel, buf_oblkno, HASH_READ, HASH_NOLOCK);
_hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket,
highmask, lowmask);
/* release the pin on old buffer and retry for expand. */
_hash_dropbuf(rel, buf_oblkno);
goto restart_expand;
}
/*
* Likewise lock the new bucket (should never fail).
* Clean the tuples remained from the previous split. This operation
* requires cleanup lock and we already have one on the old bucket, so
* let's do it. We also don't want to allow further splits from the bucket
* till the garbage of previous split is cleaned. This has two
* advantages; first, it helps in avoiding the bloat due to garbage and
* second is, during cleanup of bucket, we are always sure that the
* garbage tuples belong to most recently split bucket. On the contrary,
* if we allow cleanup of bucket after meta page is updated to indicate
* the new split and before the actual split, the cleanup operation won't
* be able to decide whether the tuple has been moved to the newly created
* bucket and ended up deleting such tuples.
*/
if (H_NEEDS_SPLIT_CLEANUP(oopaque))
{
/* Release the metapage lock. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
metap->hashm_maxbucket, metap->hashm_highmask,
metap->hashm_lowmask, NULL,
NULL, true, NULL, NULL);
_hash_dropbuf(rel, buf_oblkno);
goto restart_expand;
}
/*
* There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
......@@ -576,12 +676,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
*/
start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
elog(ERROR, "could not get lock on supposedly new bucket");
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
......@@ -600,8 +694,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
......@@ -609,9 +702,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
* disk space.
* disk space. Ideally, we don't need to check for cleanup lock on new
* bucket as no other backend could find this bucket unless meta page is
* updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
if (!IsBufferCleanupOK(buf_nblkno))
{
_hash_relbuf(rel, buf_oblkno);
_hash_relbuf(rel, buf_nblkno);
goto fail;
}
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
......@@ -665,13 +767,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
start_oblkno, buf_nblkno,
buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
/* Release bucket locks, allowing others to access them */
_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
return;
/* Here if decide not to split or fail to acquire old bucket lock */
......@@ -738,13 +836,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* belong in the new bucket, and compress out any free space in the old
* bucket.
*
* The caller must hold exclusive locks on both buckets to ensure that
* The caller must hold cleanup locks on both buckets to ensure that
* no one else is trying to access them (see README).
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
* Split needs to retain pin on primary bucket pages of both old and new
* buckets till end of operation. This is to prevent vacuum from starting
* while a split is in progress.
*
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
......@@ -756,37 +858,86 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
BlockNumber start_oblkno,
Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
/*
* It should be okay to simultaneously write-lock pages from each bucket,
* since no one else can be trying to acquire buffer lock on pages of
* either bucket.
*/
obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
/*
* Mark the old bucket to indicate that split is in progress. At
* operation end, we clear split-in-progress flag.
*/
oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
npage = BufferGetPage(nbuf);
/* initialize the new bucket's primary page */
/*
* initialize the new bucket's primary page and mark it to indicate that
* split is in progress.
*/
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
nopaque->hasho_flag = LH_BUCKET_PAGE;
nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
nopaque->hasho_page_id = HASHO_PAGE_ID;
_hash_splitbucket_guts(rel, metabuf, obucket,
nbucket, obuf, nbuf, NULL,
maxbucket, highmask, lowmask);
/* all done, now release the locks and pins on primary buckets. */
_hash_relbuf(rel, obuf);
_hash_relbuf(rel, nbuf);
}
/*
* _hash_splitbucket_guts -- Helper function to perform the split operation
*
* This routine is used to partition the tuples between old and new bucket and
* to finish incomplete split operations. To finish the previously
* interrupted split operation, caller needs to fill htab. If htab is set, then
* we skip the movement of tuples that exists in htab, otherwise NULL value of
* htab indicates movement of all the tuples that belong to new bucket.
*
* Caller needs to lock and unlock the old and new primary buckets.
*/
static void
_hash_splitbucket_guts(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
Buffer obuf,
Buffer nbuf,
HTAB *htab,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
Buffer bucket_obuf;
Buffer bucket_nbuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
bucket_obuf = obuf;
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
bucket_nbuf = nbuf;
npage = BufferGetPage(nbuf);
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
......@@ -798,8 +949,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
......@@ -810,33 +959,52 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
bool found = false;
/* skip dead tuples */
if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
continue;
/*
* Fetch the item's hash key (conveniently stored in the item) and
* determine which bucket it now belongs in.
* Before inserting a tuple, probe the hash table containing TIDs
* of tuples belonging to new bucket, if we find a match, then
* skip that tuple, else fetch the item's hash key (conveniently
* stored in the item) and determine which bucket it now belongs
* in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
if (htab)
(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
if (found)
continue;
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
IndexTuple new_itup;
/*
* make a copy of index tuple as we have to scribble on it.
*/
new_itup = CopyIndexTuple(itup);
/*
* mark the index tuple as moved by split, such tuples are
* skipped by scan if there is split in progress for a bucket.
*/
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
* overflow page and place the tuple on that page instead.
*
* XXX we have a problem here if we fail to get space for a
* new overflow page: we'll error out leaving the bucket split
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
itemsz = IndexTupleDSize(*itup);
itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
......@@ -844,9 +1012,9 @@ _hash_splitbucket(Relation rel,
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
nbuf = _hash_addovflpage(rel, metabuf, nbuf);
nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
npage = BufferGetPage(nbuf);
/* we don't need nopaque within the loop */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
......@@ -856,12 +1024,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
/*
* Mark tuple for deletion from old page.
*/
deletable[ndeletable++] = ooffnum;
/* be tidy */
pfree(new_itup);
}
else
{
......@@ -874,15 +1040,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
/*
* Done scanning this old page. If we moved any tuples, delete them
* from the old page.
*/
if (ndeletable > 0)
{
PageIndexMultiDelete(opage, deletable, ndeletable);
_hash_wrtbuf(rel, obuf);
}
/* retain the pin on the old primary bucket */
if (obuf == bucket_obuf)
_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
......@@ -891,18 +1051,169 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
* the tuples. Before quitting, call _hash_squeezebucket to ensure the
* tuples remaining in the old bucket (including the overflow pages) are
* packed as tightly as possible. The new bucket is already tight.
* the tuples. Mark the old and new buckets to indicate split is
* finished.
*
* To avoid deadlocks due to locking order of buckets, first lock the old
* bucket and then the new bucket.
*/
_hash_wrtbuf(rel, nbuf);
if (nbuf == bucket_nbuf)
_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
else
_hash_wrtbuf(rel, nbuf);
_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
opage = BufferGetPage(bucket_obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
npage = BufferGetPage(bucket_nbuf);
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
/*
* After the split is finished, mark the old bucket to indicate that it
* contains deletable tuples. Vacuum will clear split-cleanup flag after
* deleting such tuples.
*/
oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP;
/*
* now write the buffers, here we don't release the locks as caller is
* responsible to release locks.
*/
MarkBufferDirty(bucket_obuf);
MarkBufferDirty(bucket_nbuf);
}
/*
* _hash_finish_split() -- Finish the previously interrupted split operation
*
* To complete the split operation, we form the hash table of TIDs in new
* bucket which is then used by split operation to skip tuples that are
* already moved before the split operation was previously interrupted.
*
* The caller must hold a pin, but no lock, on the metapage and old bucket's
* primay page buffer. The buffers are returned in the same state. (The
* metapage is only touched if it becomes necessary to add or remove overflow
* pages.)
*/
void
_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
uint32 maxbucket, uint32 highmask, uint32 lowmask)
{
HASHCTL hash_ctl;
HTAB *tidhtab;
Buffer bucket_nbuf = InvalidBuffer;
Buffer nbuf;
Page npage;
BlockNumber nblkno;
BlockNumber bucket_nblkno;
HashPageOpaque npageopaque;
Bucket nbucket;
bool found;
/* Initialize hash tables used to track TIDs */
memset(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(ItemPointerData);
hash_ctl.entrysize = sizeof(ItemPointerData);
hash_ctl.hcxt = CurrentMemoryContext;
tidhtab =
hash_create("bucket ctids",
256, /* arbitrary initial size */
&hash_ctl,
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
bucket_nblkno = nblkno = _hash_get_newblock_from_oldbucket(rel, obucket);
/*
* Scan the new bucket and build hash table of TIDs
*/
for (;;)
{
OffsetNumber noffnum;
OffsetNumber nmaxoffnum;
nbuf = _hash_getbuf(rel, nblkno, HASH_READ,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
/* remember the primary bucket buffer to acquire cleanup lock on it. */
if (nblkno == bucket_nblkno)
bucket_nbuf = nbuf;
npage = BufferGetPage(nbuf);
npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
/* Scan each tuple in new page */
nmaxoffnum = PageGetMaxOffsetNumber(npage);
for (noffnum = FirstOffsetNumber;
noffnum <= nmaxoffnum;
noffnum = OffsetNumberNext(noffnum))
{
IndexTuple itup;
/* Fetch the item's TID and insert it in hash table. */
itup = (IndexTuple) PageGetItem(npage,
PageGetItemId(npage, noffnum));
(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
Assert(!found);
}
nblkno = npageopaque->hasho_nextblkno;
/*
* release our write lock without modifying buffer and ensure to
* retain the pin on primary bucket.
*/
if (nbuf == bucket_nbuf)
_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, nbuf);
/* Exit loop if no more overflow pages in new bucket */
if (!BlockNumberIsValid(nblkno))
break;
}
/*
* Conditionally get the cleanup lock on old and new buckets to perform
* the split operation. If we don't get the cleanup locks, silently give
* up and next insertion on old bucket will try again to complete the
* split.
*/
if (!ConditionalLockBufferForCleanup(obuf))
{
hash_destroy(tidhtab);
return;
}
if (!ConditionalLockBufferForCleanup(bucket_nbuf))
{
_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
hash_destroy(tidhtab);
return;
}
npage = BufferGetPage(bucket_nbuf);
npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nbucket = npageopaque->hasho_bucket;
_hash_splitbucket_guts(rel, metabuf, obucket,
nbucket, obuf, bucket_nbuf, tidhtab,
maxbucket, highmask, lowmask);
_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
_hash_relbuf(rel, bucket_nbuf);
_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
hash_destroy(tidhtab);
}
/*-------------------------------------------------------------------------
*
* hashscan.c
* manage scans on hash tables
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
* src/backend/access/hash/hashscan.c
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include "access/hash.h"
#include "access/relscan.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/resowner.h"
/*
* We track all of a backend's active scans on hash indexes using a list
* of HashScanListData structs, which are allocated in TopMemoryContext.
* It's okay to use a long-lived context because we rely on the ResourceOwner
* mechanism to clean up unused entries after transaction or subtransaction
* abort. We can't safely keep the entries in the executor's per-query
* context, because that might be already freed before we get a chance to
* clean up the list. (XXX seems like there should be a better way to
* manage this...)
*/
typedef struct HashScanListData
{
IndexScanDesc hashsl_scan;
ResourceOwner hashsl_owner;
struct HashScanListData *hashsl_next;
} HashScanListData;
typedef HashScanListData *HashScanList;
static HashScanList HashScans = NULL;
/*
* ReleaseResources_hash() --- clean up hash subsystem resources.
*
* This is here because it needs to touch this module's static var HashScans.
*/
void
ReleaseResources_hash(void)
{
HashScanList l;
HashScanList prev;
HashScanList next;
/*
* Release all HashScanList items belonging to the current ResourceOwner.
* Note that we do not release the underlying IndexScanDesc; that's in
* executor memory and will go away on its own (in fact quite possibly has
* gone away already, so we mustn't try to touch it here).
*
* Note: this should be a no-op during normal query shutdown. However, in
* an abort situation ExecutorEnd is not called and so there may be open
* index scans to clean up.
*/
prev = NULL;
for (l = HashScans; l != NULL; l = next)
{
next = l->hashsl_next;
if (l->hashsl_owner == CurrentResourceOwner)
{
if (prev == NULL)
HashScans = next;
else
prev->hashsl_next = next;
pfree(l);
/* prev does not change */
}
else
prev = l;
}
}
/*
* _hash_regscan() -- register a new scan.
*/
void
_hash_regscan(IndexScanDesc scan)
{
HashScanList new_el;
new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
sizeof(HashScanListData));
new_el->hashsl_scan = scan;
new_el->hashsl_owner = CurrentResourceOwner;
new_el->hashsl_next = HashScans;
HashScans = new_el;
}
/*
* _hash_dropscan() -- drop a scan from the scan list
*/
void
_hash_dropscan(IndexScanDesc scan)
{
HashScanList chk,
last;
last = NULL;
for (chk = HashScans;
chk != NULL && chk->hashsl_scan != scan;
chk = chk->hashsl_next)
last = chk;
if (chk == NULL)
elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
if (last == NULL)
HashScans = chk->hashsl_next;
else
last->hashsl_next = chk->hashsl_next;
pfree(chk);
}
/*
* Is there an active scan in this bucket?
*/
bool
_hash_has_active_scan(Relation rel, Bucket bucket)
{
Oid relid = RelationGetRelid(rel);
HashScanList l;
for (l = HashScans; l != NULL; l = l->hashsl_next)
{
if (relid == l->hashsl_scan->indexRelation->rd_id)
{
HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
if (so->hashso_bucket_valid &&
so->hashso_bucket == bucket)
return true;
}
}
return false;
}
......@@ -63,38 +63,94 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
}
/*
* Advance to next page in a bucket, if any.
* Advance to next page in a bucket, if any. If we are scanning the bucket
* being populated during split operation then this function advances to the
* bucket being split after the last bucket page of bucket being populated.
*/
static void
_hash_readnext(Relation rel,
_hash_readnext(IndexScanDesc scan,
Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
{
BlockNumber blkno;
Relation rel = scan->indexRelation;
HashScanOpaque so = (HashScanOpaque) scan->opaque;
bool block_found = false;
blkno = (*opaquep)->hasho_nextblkno;
_hash_relbuf(rel, *bufp);
/*
* Retain the pin on primary bucket page till the end of scan. Refer the
* comments in _hash_first to know the reason of retaining pin.
*/
if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, *bufp);
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
if (BlockNumberIsValid(blkno))
{
*bufp = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
block_found = true;
}
else if (so->hashso_buc_populated && !so->hashso_buc_split)
{
/*
* end of bucket, scan bucket being split if there was a split in
* progress at the start of scan.
*/
*bufp = so->hashso_split_bucket_buf;
/*
* buffer for bucket being split must be valid as we acquire the pin
* on it before the start of scan and retain it till end of scan.
*/
Assert(BufferIsValid(*bufp));
_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
/*
* setting hashso_buc_split to true indicates that we are scanning
* bucket being split.
*/
so->hashso_buc_split = true;
block_found = true;
}
if (block_found)
{
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
}
}
/*
* Advance to previous page in a bucket, if any.
* Advance to previous page in a bucket, if any. If the current scan has
* started during split operation then this function advances to bucket
* being populated after the first bucket page of bucket being split.
*/
static void
_hash_readprev(Relation rel,
_hash_readprev(IndexScanDesc scan,
Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
{
BlockNumber blkno;
Relation rel = scan->indexRelation;
HashScanOpaque so = (HashScanOpaque) scan->opaque;
blkno = (*opaquep)->hasho_prevblkno;
_hash_relbuf(rel, *bufp);
/*
* Retain the pin on primary bucket page till the end of scan. Refer the
* comments in _hash_first to know the reason of retaining pin.
*/
if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, *bufp);
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
......@@ -104,6 +160,41 @@ _hash_readprev(Relation rel,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
/*
* We always maintain the pin on bucket page for whole scan operation,
* so releasing the additional pin we have acquired here.
*/
if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
_hash_dropbuf(rel, *bufp);
}
else if (so->hashso_buc_populated && so->hashso_buc_split)
{
/*
* end of bucket, scan bucket being populated if there was a split in
* progress at the start of scan.
*/
*bufp = so->hashso_bucket_buf;
/*
* buffer for bucket being populated must be valid as we acquire the
* pin on it before the start of scan and retain it till end of scan.
*/
Assert(BufferIsValid(*bufp));
_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
/* move to the end of bucket chain */
while (BlockNumberIsValid((*opaquep)->hasho_nextblkno))
_hash_readnext(scan, bufp, pagep, opaquep);
/*
* setting hashso_buc_split to false indicates that we are scanning
* bucket being populated.
*/
so->hashso_buc_split = false;
}
}
......@@ -218,9 +309,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
{
if (oldblkno == blkno)
break;
_hash_droplock(rel, oldblkno, HASH_SHARE);
_hash_relbuf(rel, buf);
}
_hash_getlock(rel, blkno, HASH_SHARE);
/* Fetch the primary bucket page for the bucket */
buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
/*
* Reacquire metapage lock and check that no bucket split has taken
......@@ -234,22 +327,73 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* done with the metapage */
_hash_dropbuf(rel, metabuf);
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
so->hashso_bucket_blkno = blkno;
/* Fetch the primary bucket page for the bucket */
buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
so->hashso_bucket_buf = buf;
/*
* If the bucket split is in progress, then while scanning the bucket
* being populated, we need to skip tuples that are moved from bucket
* being split. We need to maintain the pin on bucket being split to
* ensure that split-cleanup work done by vacuum doesn't remove tuples
* from it till this scan is done. We need to main to maintain the pin on
* bucket being populated to ensure that vacuum doesn't squeeze that
* bucket till this scan is complete, otherwise the ordering of tuples
* can't be maintained during forward and backward scans. Here, we have
* to be cautious about locking order, first acquire the lock on bucket
* being split, release the lock on it, but not pin, then acquire the lock
* on bucket being populated and again re-verify whether the bucket split
* still is in progress. First acquiring lock on bucket being split
* ensures that the vacuum waits for this scan to finish.
*/
if (H_BUCKET_BEING_POPULATED(opaque))
{
BlockNumber old_blkno;
Buffer old_buf;
old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket);
/*
* release the lock on new bucket and re-acquire it after acquiring
* the lock on old bucket.
*/
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
/*
* remember the split bucket buffer so as to use it later for
* scanning.
*/
so->hashso_split_bucket_buf = old_buf;
_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
if (H_BUCKET_BEING_POPULATED(opaque))
so->hashso_buc_populated = true;
else
{
_hash_dropbuf(rel, so->hashso_split_bucket_buf);
so->hashso_split_bucket_buf = InvalidBuffer;
}
}
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
while (BlockNumberIsValid(opaque->hasho_nextblkno))
_hash_readnext(rel, &buf, &page, &opaque);
/*
* Backward scans that start during split needs to start from end of
* bucket being split.
*/
while (BlockNumberIsValid(opaque->hasho_nextblkno) ||
(so->hashso_buc_populated && !so->hashso_buc_split))
_hash_readnext(scan, &buf, &page, &opaque);
}
/* Now find the first tuple satisfying the qualification */
......@@ -273,6 +417,12 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
* Here we need to ensure that if the scan has started during split, then
* skip the tuples that are moved by split while scanning bucket being
* populated and then scan the bucket being split to cover all such
* tuples. This is done to ensure that we don't miss tuples in the scans
* that are started during split.
*
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
......@@ -338,6 +488,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
/*
* skip the tuples that are moved by split operation
* for the scan that has started when split was in
* progress
*/
if (so->hashso_buc_populated && !so->hashso_buc_split &&
(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
{
offnum = OffsetNumberNext(offnum); /* move forward */
continue;
}
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
......@@ -345,7 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
/*
* ran off the end of this page, try the next
*/
_hash_readnext(rel, &buf, &page, &opaque);
_hash_readnext(scan, &buf, &page, &opaque);
if (BufferIsValid(buf))
{
maxoff = PageGetMaxOffsetNumber(page);
......@@ -353,7 +516,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
/* end of bucket */
itup = NULL;
break; /* exit for-loop */
}
......@@ -379,6 +541,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
/*
* skip the tuples that are moved by split operation
* for the scan that has started when split was in
* progress
*/
if (so->hashso_buc_populated && !so->hashso_buc_split &&
(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
{
offnum = OffsetNumberPrev(offnum); /* move back */
continue;
}
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
......@@ -386,7 +561,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
/*
* ran off the end of this page, try the next
*/
_hash_readprev(rel, &buf, &page, &opaque);
_hash_readprev(scan, &buf, &page, &opaque);
if (BufferIsValid(buf))
{
maxoff = PageGetMaxOffsetNumber(page);
......@@ -394,7 +569,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
/* end of bucket */
itup = NULL;
break; /* exit for-loop */
}
......@@ -410,9 +584,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
if (itup == NULL)
{
/* we ran off the end of the bucket without finding a match */
/*
* We ran off the end of the bucket without finding a match.
* Release the pin on bucket buffers. Normally, such pins are
* released at end of scan, however scrolling cursors can
* reacquire the bucket lock and pin in the same scan multiple
* times.
*/
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
_hash_dropscanbuf(rel, so);
return false;
}
......
......@@ -20,6 +20,8 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#define CALC_NEW_BUCKET(old_bucket, lowmask) \
old_bucket | (lowmask + 1)
/*
* _hash_checkqual -- does the index tuple satisfy the scan conditions?
......@@ -352,3 +354,95 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
/*
* _hash_get_oldblock_from_newbucket() -- get the block number of a bucket
* from which current (new) bucket is being split.
*/
BlockNumber
_hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket)
{
Bucket old_bucket;
uint32 mask;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
/*
* To get the old bucket from the current bucket, we need a mask to modulo
* into lower half of table. This mask is stored in meta page as
* hashm_lowmask, but here we can't rely on the same, because we need a
* value of lowmask that was prevalent at the time when bucket split was
* started. Masking the most significant bit of new bucket would give us
* old bucket.
*/
mask = (((uint32) 1) << (fls(new_bucket) - 1)) - 1;
old_bucket = new_bucket & mask;
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
metap = HashPageGetMeta(BufferGetPage(metabuf));
blkno = BUCKET_TO_BLKNO(metap, old_bucket);
_hash_relbuf(rel, metabuf);
return blkno;
}
/*
* _hash_get_newblock_from_oldbucket() -- get the block number of a bucket
* that will be generated after split from old bucket.
*
* This is used to find the new bucket from old bucket based on current table
* half. It is mainly required to finish the incomplete splits where we are
* sure that not more than one bucket could have split in progress from old
* bucket.
*/
BlockNumber
_hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket)
{
Bucket new_bucket;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
metap = HashPageGetMeta(BufferGetPage(metabuf));
new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket,
metap->hashm_lowmask,
metap->hashm_maxbucket);
blkno = BUCKET_TO_BLKNO(metap, new_bucket);
_hash_relbuf(rel, metabuf);
return blkno;
}
/*
* _hash_get_newbucket_from_oldbucket() -- get the new bucket that will be
* generated after split from current (old) bucket.
*
* This is used to find the new bucket from old bucket. New bucket can be
* obtained by OR'ing old bucket with most significant bit of current table
* half (lowmask passed in this function can be used to identify msb of
* current table half). There could be multiple buckets that could have
* been split from current bucket. We need the first such bucket that exists.
* Caller must ensure that no more than one split has happened from old
* bucket.
*/
Bucket
_hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
uint32 lowmask, uint32 maxbucket)
{
Bucket new_bucket;
new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
}
return new_bucket;
}
......@@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintFileLeakWarning(res);
FileClose(res);
}
/* Clean up index scans too */
ReleaseResources_hash();
}
/* Let add-on modules get a chance too */
......
......@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
......@@ -32,6 +33,8 @@
*/
typedef uint32 Bucket;
#define InvalidBucket ((Bucket) 0xFFFFFFFF)
#define BUCKET_TO_BLKNO(metap,B) \
((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
......@@ -51,6 +54,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
#define LH_BUCKET_BEING_POPULATED (1 << 4)
#define LH_BUCKET_BEING_SPLIT (1 << 5)
#define LH_BUCKET_NEEDS_SPLIT_CLEANUP (1 << 6)
typedef struct HashPageOpaqueData
{
......@@ -63,6 +69,10 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
#define H_NEEDS_SPLIT_CLEANUP(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
#define H_BUCKET_BEING_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
#define H_BUCKET_BEING_POPULATED(opaque) ((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
......@@ -79,19 +89,6 @@ typedef struct HashScanOpaqueData
/* Hash value of the scan key, ie, the hash key we seek */
uint32 hashso_sk_hash;
/*
* By definition, a hash scan should be examining only one bucket. We
* record the bucket number here as soon as it is known.
*/
Bucket hashso_bucket;
bool hashso_bucket_valid;
/*
* If we have a share lock on the bucket, we record it here. When
* hashso_bucket_blkno is zero, we have no such lock.
*/
BlockNumber hashso_bucket_blkno;
/*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
......@@ -100,11 +97,30 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
/* remember the buffer associated with primary bucket */
Buffer hashso_bucket_buf;
/*
* remember the buffer associated with primary bucket page of bucket being
* split. it is required during the scan of the bucket which is being
* populated during split operation.
*/
Buffer hashso_split_bucket_buf;
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
/* Whether scan starts on bucket being populated due to split */
bool hashso_buc_populated;
/*
* Whether scanning bucket being split? The value of this parameter is
* referred only when hashso_buc_populated is true.
*/
bool hashso_buc_split;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
......@@ -175,6 +191,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
......@@ -223,9 +241,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
#define HASH_SHARE ShareLock
#define HASH_EXCLUSIVE ExclusiveLock
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
......@@ -297,21 +312,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
BufferAccessStrategy bstrategy);
extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
bool wbuf_dirty, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
BlockNumber blkno, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
ForkNumber forkNum);
......@@ -320,6 +335,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
BufferAccessStrategy bstrategy);
extern void _hash_relbuf(Relation rel, Buffer buf);
extern void _hash_dropbuf(Relation rel, Buffer buf);
extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
extern void _hash_wrtbuf(Relation rel, Buffer buf);
extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
int to_access);
......@@ -327,12 +343,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
extern void _hash_dropscan(IndexScanDesc scan);
extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
extern void ReleaseResources_hash(void);
extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
Bucket obucket, uint32 maxbucket, uint32 highmask,
uint32 lowmask);
/* hashsearch.c */
extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
......@@ -362,5 +375,18 @@ extern bool _hash_convert_tuple(Relation index,
Datum *index_values, bool *index_isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket);
extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
uint32 lowmask, uint32 maxbucket);
/* hash.c */
extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
Buffer bucket_buf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy,
uint32 maxbucket, uint32 highmask, uint32 lowmask,
double *tuples_removed, double *num_index_tuples,
bool bucket_has_garbage,
IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
......@@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
* t_info manipulation macros
*/
#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */
/* bit 0x2000 is reserved for index-AM specific usage */
#define INDEX_VAR_MASK 0x4000
#define INDEX_NULL_MASK 0x8000
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment