Commit c11453ce authored by Robert Haas's avatar Robert Haas

hash: Add write-ahead logging support.

The warning about hash indexes not being write-ahead logged and their
use being discouraged has been removed.  "snapshot too old" is now
supported for tables with hash indexes.  Most importantly, barring
bugs, hash indexes will now be crash-safe and usable on standbys.

This commit doesn't yet add WAL consistency checking for hash
indexes, as we now have for other index types; a separate patch has
been submitted to cure that lack.

Amit Kapila, reviewed and slightly modified by me.  The larger patch
series of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen.

Discussion: http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
parent 2b32ac2a
CREATE TABLE test_hash (a int, b text);
INSERT INTO test_hash VALUES (1, 'one');
CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
WARNING: hash indexes are not WAL-logged and their use is discouraged
\x
SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
-[ RECORD 1 ]--+---------
......
......@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
(1 row)
create index test_hashidx on test using hash (b);
WARNING: hash indexes are not WAL-logged and their use is discouraged
select * from pgstathashindex('test_hashidx');
version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent
---------+--------------+----------------+--------------+------------+------------+------------+--------------
......@@ -226,7 +225,6 @@ ERROR: "test_partition" is not an index
-- an actual index of a partitioned table should work though
create index test_partition_idx on test_partition(a);
create index test_partition_hash_idx on test_partition using hash (a);
WARNING: hash indexes are not WAL-logged and their use is discouraged
-- these should work
select pgstatindex('test_partition_idx');
pgstatindex
......
......@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
technique. These will probably be fixed in future releases:
<itemizedlist>
<listitem>
<para>
Operations on hash indexes are not presently WAL-logged, so
replay will not update these indexes. This will mean that any new inserts
will be ignored by the index, updated rows will apparently disappear and
deleted rows will still retain pointers. In other words, if you modify a
table with a hash index on it then you will get incorrect query results
on a standby server. When recovery completes it is recommended that you
manually <xref linkend="sql-reindex">
each such index after completing a recovery operation.
</para>
</listitem>
<listitem>
<para>
If a <xref linkend="sql-createdatabase">
......
......@@ -2153,10 +2153,9 @@ include_dir 'conf.d'
has materialized a result set, no error will be generated even if the
underlying rows in the referenced table have been vacuumed away.
Some tables cannot safely be vacuumed early, and so will not be
affected by this setting. Examples include system catalogs and any
table which has a hash index. For such tables this setting will
neither reduce bloat nor create a possibility of a <literal>snapshot
too old</> error on scanning.
affected by this setting, such as system catalogs. For such tables
this setting will neither reduce bloat nor create a possibility
of a <literal>snapshot too old</> error on scanning.
</para>
</listitem>
</varlistentry>
......
......@@ -2351,12 +2351,6 @@ LOG: database system is ready to accept read only connections
These can and probably will be fixed in future releases:
<itemizedlist>
<listitem>
<para>
Operations on hash indexes are not presently WAL-logged, so
replay will not update these indexes.
</para>
</listitem>
<listitem>
<para>
Full knowledge of running transactions is required before snapshots
......
......@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
</synopsis>
</para>
<caution>
<para>
Hash index operations are not presently WAL-logged,
so hash indexes might need to be rebuilt with <command>REINDEX</>
after a database crash if there were unwritten changes.
Also, changes to hash indexes are not replicated over streaming or
file-based replication after the initial base backup, so they
give wrong answers to queries that subsequently use them.
For these reasons, hash index use is presently discouraged.
</para>
</caution>
<para>
<indexterm>
<primary>index</primary>
......
......@@ -510,19 +510,6 @@ Indexes:
they can be useful.
</para>
<caution>
<para>
Hash index operations are not presently WAL-logged,
so hash indexes might need to be rebuilt with <command>REINDEX</>
after a database crash if there were unwritten changes.
Also, changes to hash indexes are not replicated over streaming or
file-based replication after the initial base backup, so they
give wrong answers to queries that subsequently use them.
Hash indexes are also not properly restored during point-in-time
recovery. For these reasons, hash index use is presently discouraged.
</para>
</caution>
<para>
Currently, only the B-tree, GiST, GIN, and BRIN index methods support
multicolumn indexes. Up to 32 fields can be specified by default.
......
......@@ -13,6 +13,6 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
hashsort.o hashutil.o hashvalidate.o
hashsort.o hashutil.o hashvalidate.o hash_xlog.o
include $(top_srcdir)/src/backend/common.mk
This diff is collapsed.
......@@ -28,6 +28,7 @@
#include "utils/builtins.h"
#include "utils/index_selfuncs.h"
#include "utils/rel.h"
#include "miscadmin.h"
/* Working state for hashbuild and its callback */
......@@ -303,6 +304,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
buf = so->hashso_curbuf;
Assert(BufferIsValid(buf));
page = BufferGetPage(buf);
/*
* We don't need test for old snapshot here as the current buffer is
* pinned, so vacuum can't clean the page.
*/
maxoffnum = PageGetMaxOffsetNumber(page);
for (offnum = ItemPointerGetOffsetNumber(current);
offnum <= maxoffnum;
......@@ -623,6 +629,7 @@ loop_top:
}
/* Okay, we're really done. Update tuple count in metapage. */
START_CRIT_SECTION();
if (orig_maxbucket == metap->hashm_maxbucket &&
orig_ntuples == metap->hashm_ntuples)
......@@ -649,6 +656,26 @@ loop_top:
}
MarkBufferDirty(metabuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_hash_update_meta_page xlrec;
XLogRecPtr recptr;
xlrec.ntuples = metap->hashm_ntuples;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
PageSetLSN(BufferGetPage(metabuf), recptr);
}
END_CRIT_SECTION();
_hash_relbuf(rel, metabuf);
/* return statistics */
......@@ -816,9 +843,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
*/
if (ndeletable > 0)
{
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
PageIndexMultiDelete(page, deletable, ndeletable);
bucket_dirty = true;
MarkBufferDirty(buf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_hash_delete xlrec;
XLogRecPtr recptr;
xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
/*
* bucket buffer needs to be registered to ensure that we can
* acquire a cleanup lock on it during replay.
*/
if (!xlrec.is_primary_bucket_page)
XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
XLogRegisterBufData(1, (char *) deletable,
ndeletable * sizeof(OffsetNumber));
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
PageSetLSN(BufferGetPage(buf), recptr);
}
END_CRIT_SECTION();
}
/* bail out if there are no more pages to scan. */
......@@ -866,8 +924,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
page = BufferGetPage(bucket_buf);
bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
MarkBufferDirty(bucket_buf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
XLogRecPtr recptr;
XLogBeginInsert();
XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
}
/*
......@@ -881,9 +956,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
else
LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
}
void
hash_redo(XLogReaderState *record)
{
elog(PANIC, "hash_redo: unimplemented");
}
This diff is collapsed.
......@@ -16,6 +16,8 @@
#include "postgres.h"
#include "access/hash.h"
#include "access/hash_xlog.h"
#include "miscadmin.h"
#include "utils/rel.h"
......@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
bool do_expand;
uint32 hashkey;
Bucket bucket;
OffsetNumber itup_off;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
......@@ -158,25 +161,20 @@ restart_insert:
Assert(pageopaque->hasho_bucket == bucket);
}
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
/*
* dirty and release the modified page. if the page we modified was an
* overflow page, we also need to separately drop the pin we retained on
* the primary bucket page.
*/
MarkBufferDirty(buf);
_hash_relbuf(rel, buf);
if (buf != bucket_buf)
_hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
* incrementing it, check to see if it's time for a split.
*/
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
/* Do the update. No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
/* found page with enough space, so add the item here */
itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
MarkBufferDirty(buf);
/* metapage operations */
metap = HashPageGetMeta(metapage);
metap->hashm_ntuples += 1;
......@@ -184,10 +182,43 @@ restart_insert:
do_expand = metap->hashm_ntuples >
(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
/* Write out the metapage and drop lock, but keep pin */
MarkBufferDirty(metabuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_hash_insert xlrec;
XLogRecPtr recptr;
xlrec.offnum = itup_off;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
PageSetLSN(BufferGetPage(buf), recptr);
PageSetLSN(BufferGetPage(metabuf), recptr);
}
END_CRIT_SECTION();
/* drop lock on metapage, but keep pin */
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
/*
* Release the modified page and ensure to release the pin on primary
* page.
*/
_hash_relbuf(rel, buf);
if (buf != bucket_buf)
_hash_dropbuf(rel, bucket_buf);
/* Attempt to split if a split is needed */
if (do_expand)
_hash_expandtable(rel, metabuf);
......
......@@ -18,6 +18,8 @@
#include "postgres.h"
#include "access/hash.h"
#include "access/hash_xlog.h"
#include "miscadmin.h"
#include "utils/rel.h"
......@@ -136,6 +138,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
* page is released, then finally acquire the lock on new overflow buffer.
* We need this locking order to avoid deadlock with backends that are
* doing inserts.
*
* Note: We could have avoided locking many buffers here if we made two
* WAL records for acquiring an overflow page (one to allocate an overflow
* page and another to add it to overflow bucket chain). However, doing
* so can leak an overflow page, if the system crashes after allocation.
* Needless to say, it is better to have a single record from a
* performance point of view as well.
*/
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
......@@ -303,8 +312,12 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
found:
/*
* Do the update.
* Do the update. No ereport(ERROR) until changes are logged. We want to
* log the changes for bitmap page and overflow page together to avoid
* loss of pages in case the new page is added.
*/
START_CRIT_SECTION();
if (page_found)
{
Assert(BufferIsValid(mapbuf));
......@@ -362,6 +375,51 @@ found:
MarkBufferDirty(buf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
XLogRecPtr recptr;
xl_hash_add_ovfl_page xlrec;
xlrec.bmpage_found = page_found;
xlrec.bmsize = metap->hashm_bmsize;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
if (BufferIsValid(mapbuf))
{
XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
}
if (BufferIsValid(newmapbuf))
XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
PageSetLSN(BufferGetPage(ovflbuf), recptr);
PageSetLSN(BufferGetPage(buf), recptr);
if (BufferIsValid(mapbuf))
PageSetLSN(BufferGetPage(mapbuf), recptr);
if (BufferIsValid(newmapbuf))
PageSetLSN(BufferGetPage(newmapbuf), recptr);
PageSetLSN(BufferGetPage(metabuf), recptr);
}
END_CRIT_SECTION();
if (retain_pin)
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
else
......@@ -408,7 +466,11 @@ _hash_firstfreebit(uint32 map)
* Remove this overflow page from its bucket's chain, and mark the page as
* free. On entry, ovflbuf is write-locked; it is released before exiting.
*
* Add the tuples (itups) to wbuf.
* Add the tuples (itups) to wbuf in this function. We could do that in the
* caller as well, but the advantage of doing it here is we can easily write
* the WAL for XLOG_HASH_SQUEEZE_PAGE operation. Addition of tuples and
* removal of overflow page has to done as an atomic operation, otherwise
* during replay on standby users might find duplicate records.
*
* Since this function is invoked in VACUUM, we provide an access strategy
* parameter that controls fetches of the bucket pages.
......@@ -430,8 +492,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
HashMetaPage metap;
Buffer metabuf;
Buffer mapbuf;
Buffer prevbuf = InvalidBuffer;
Buffer nextbuf = InvalidBuffer;
BlockNumber ovflblkno;
BlockNumber prevblkno;
BlockNumber blkno;
......@@ -445,6 +505,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
int32 bitmappage,
bitmapbit;
Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
Buffer prevbuf = InvalidBuffer;
Buffer nextbuf = InvalidBuffer;
bool update_metap = false;
/* Get information from the doomed page */
_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
......@@ -508,6 +571,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
/* Get write-lock on metapage to update firstfree */
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
/* This operation needs to log multiple tuples, prepare WAL for that */
if (RelationNeedsWAL(rel))
XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
START_CRIT_SECTION();
/*
* we have to insert tuples on the "write" page, being careful to preserve
* hashkey ordering. (If we insert many tuples into the same "write" page
......@@ -519,7 +588,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
MarkBufferDirty(wbuf);
}
/* Initialize the freed overflow page. */
/*
* Initialize the freed overflow page. Just zeroing the page won't work,
* because WAL replay routines expect pages to be initialized. See
* explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
*/
_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
MarkBufferDirty(ovflbuf);
......@@ -550,9 +623,83 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
if (ovflbitno < metap->hashm_firstfree)
{
metap->hashm_firstfree = ovflbitno;
update_metap = true;
MarkBufferDirty(metabuf);
}
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_hash_squeeze_page xlrec;
XLogRecPtr recptr;
int i;
xlrec.prevblkno = prevblkno;
xlrec.nextblkno = nextblkno;
xlrec.ntups = nitups;
xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf);
xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf);
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
/*
* bucket buffer needs to be registered to ensure that we can acquire
* a cleanup lock on it during replay.
*/
if (!xlrec.is_prim_bucket_same_wrt)
XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
if (xlrec.ntups > 0)
{
XLogRegisterBufData(1, (char *) itup_offsets,
nitups * sizeof(OffsetNumber));
for (i = 0; i < nitups; i++)
XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
}
XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
/*
* If prevpage and the writepage (block in which we are moving tuples
* from overflow) are same, then no need to separately register
* prevpage. During replay, we can directly update the nextblock in
* writepage.
*/
if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
if (BufferIsValid(nextbuf))
XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
if (update_metap)
{
XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
}
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
PageSetLSN(BufferGetPage(wbuf), recptr);
PageSetLSN(BufferGetPage(ovflbuf), recptr);
if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
PageSetLSN(BufferGetPage(prevbuf), recptr);
if (BufferIsValid(nextbuf))
PageSetLSN(BufferGetPage(nextbuf), recptr);
PageSetLSN(BufferGetPage(mapbuf), recptr);
if (update_metap)
PageSetLSN(BufferGetPage(metabuf), recptr);
}
END_CRIT_SECTION();
/* release previous bucket if it is not same as write bucket */
if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
_hash_relbuf(rel, prevbuf);
......@@ -601,7 +748,11 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
freep = HashPageGetBitmap(pg);
MemSet(freep, 0xFF, bmsize);
/* Set pd_lower just past the end of the bitmap page data. */
/*
* Set pd_lower just past the end of the bitmap page data. We could even
* set pd_lower equal to pd_upper, but this is more precise and makes the
* page look compressible to xlog.c.
*/
((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
}
......@@ -760,6 +911,15 @@ readpage:
{
Assert(nitups == ndeletable);
/*
* This operation needs to log multiple tuples, prepare
* WAL for that.
*/
if (RelationNeedsWAL(rel))
XLogEnsureRecordSpace(0, 3 + nitups);
START_CRIT_SECTION();
/*
* we have to insert tuples on the "write" page, being
* careful to preserve hashkey ordering. (If we insert
......@@ -773,6 +933,43 @@ readpage:
PageIndexMultiDelete(rpage, deletable, ndeletable);
MarkBufferDirty(rbuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
XLogRecPtr recptr;
xl_hash_move_page_contents xlrec;
xlrec.ntups = nitups;
xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
/*
* bucket buffer needs to be registered to ensure that
* we can acquire a cleanup lock on it during replay.
*/
if (!xlrec.is_prim_bucket_same_wrt)
XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
XLogRegisterBufData(1, (char *) itup_offsets,
nitups * sizeof(OffsetNumber));
for (i = 0; i < nitups; i++)
XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) deletable,
ndeletable * sizeof(OffsetNumber));
recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
PageSetLSN(BufferGetPage(wbuf), recptr);
PageSetLSN(BufferGetPage(rbuf), recptr);
}
END_CRIT_SECTION();
tups_moved = true;
}
......
This diff is collapsed.
......@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
if (block_found)
{
*pagep = BufferGetPage(*bufp);
TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
}
}
......@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
*bufp = _hash_getbuf(rel, blkno, HASH_READ,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
*pagep = BufferGetPage(*bufp);
TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
/*
......@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
page = BufferGetPage(buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
bucket = opaque->hasho_bucket;
......@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
/*
* remember the split bucket buffer so as to use it later for
......@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
_hash_readprev(scan, &buf, &page, &opaque);
if (BufferIsValid(buf))
{
TestForOldSnapshot(scan->xs_snapshot, rel, page);
maxoff = PageGetMaxOffsetNumber(page);
offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
}
......
......@@ -19,10 +19,142 @@
void
hash_desc(StringInfo buf, XLogReaderState *record)
{
char *rec = XLogRecGetData(record);
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
switch (info)
{
case XLOG_HASH_INIT_META_PAGE:
{
xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
appendStringInfo(buf, "num_tuples %g, fillfactor %d",
xlrec->num_tuples, xlrec->ffactor);
break;
}
case XLOG_HASH_INIT_BITMAP_PAGE:
{
xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
break;
}
case XLOG_HASH_INSERT:
{
xl_hash_insert *xlrec = (xl_hash_insert *) rec;
appendStringInfo(buf, "off %u", xlrec->offnum);
break;
}
case XLOG_HASH_ADD_OVFL_PAGE:
{
xl_hash_add_ovfl_page *xlrec = (xl_hash_add_ovfl_page *) rec;
appendStringInfo(buf, "bmsize %d, bmpage_found %c",
xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
break;
}
case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
{
xl_hash_split_allocate_page *xlrec = (xl_hash_split_allocate_page *) rec;
appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
xlrec->new_bucket,
(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
(xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
break;
}
case XLOG_HASH_SPLIT_COMPLETE:
{
xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
appendStringInfo(buf, "old_bucket_flag %u, new_bucket_flag %u",
xlrec->old_bucket_flag, xlrec->new_bucket_flag);
break;
}
case XLOG_HASH_MOVE_PAGE_CONTENTS:
{
xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
appendStringInfo(buf, "ntups %d, is_primary %c",
xlrec->ntups,
xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
break;
}
case XLOG_HASH_SQUEEZE_PAGE:
{
xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
xlrec->prevblkno,
xlrec->nextblkno,
xlrec->ntups,
xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
break;
}
case XLOG_HASH_DELETE:
{
xl_hash_delete *xlrec = (xl_hash_delete *) rec;
appendStringInfo(buf, "is_primary %c",
xlrec->is_primary_bucket_page ? 'T' : 'F');
break;
}
case XLOG_HASH_UPDATE_META_PAGE:
{
xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
appendStringInfo(buf, "ntuples %g",
xlrec->ntuples);
break;
}
}
}
const char *
hash_identify(uint8 info)
{
return NULL;
const char *id = NULL;
switch (info & ~XLR_INFO_MASK)
{
case XLOG_HASH_INIT_META_PAGE:
id = "INIT_META_PAGE";
break;
case XLOG_HASH_INIT_BITMAP_PAGE:
id = "INIT_BITMAP_PAGE";
break;
case XLOG_HASH_INSERT:
id = "INSERT";
break;
case XLOG_HASH_ADD_OVFL_PAGE:
id = "ADD_OVFL_PAGE";
break;
case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
id = "SPLIT_ALLOCATE_PAGE";
break;
case XLOG_HASH_SPLIT_PAGE:
id = "SPLIT_PAGE";
break;
case XLOG_HASH_SPLIT_COMPLETE:
id = "SPLIT_COMPLETE";
break;
case XLOG_HASH_MOVE_PAGE_CONTENTS:
id = "MOVE_PAGE_CONTENTS";
break;
case XLOG_HASH_SQUEEZE_PAGE:
id = "SQUEEZE_PAGE";
break;
case XLOG_HASH_DELETE:
id = "DELETE";
break;
case XLOG_HASH_SPLIT_CLEANUP:
id = "SPLIT_CLEANUP";
break;
case XLOG_HASH_UPDATE_META_PAGE:
id = "UPDATE_META_PAGE";
break;
}
return id;
}
......@@ -506,11 +506,6 @@ DefineIndex(Oid relationId,
accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
if (strcmp(accessMethodName, "hash") == 0 &&
RelationNeedsWAL(rel))
ereport(WARNING,
(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
if (stmt->unique && !amRoutine->amcanunique)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
......
......@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
/*
* Tells whether any index for the relation is unlogged.
*
* Any index using the hash AM is implicitly unlogged.
*
* Note: There doesn't seem to be any way to have an unlogged index attached
* to a permanent table except to create a hash index, but it seems best to
* keep this general so that it returns sensible results even when they seem
* obvious (like for an unlogged table) and to handle possible future unlogged
* indexes on permanent tables.
* to a permanent table, but it seems best to keep this general so that it
* returns sensible results even when they seem obvious (like for an unlogged
* table) and to handle possible future unlogged indexes on permanent tables.
*/
bool
RelationHasUnloggedIndex(Relation rel)
......@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
elog(ERROR, "cache lookup failed for relation %u", indexoid);
reltup = (Form_pg_class) GETSTRUCT(tp);
if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
|| reltup->relam == HASH_AM_OID)
if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
result = true;
ReleaseSysCache(tp);
......
......@@ -16,7 +16,239 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/off.h"
/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
#define HASH_XLOG_FREE_OVFL_BUFS 6
/*
* XLOG records for hash operations
*/
#define XLOG_HASH_INIT_META_PAGE 0x00 /* initialize the meta page */
#define XLOG_HASH_INIT_BITMAP_PAGE 0x10 /* initialize the bitmap page */
#define XLOG_HASH_INSERT 0x20 /* add index tuple without split */
#define XLOG_HASH_ADD_OVFL_PAGE 0x30 /* add overflow page */
#define XLOG_HASH_SPLIT_ALLOCATE_PAGE 0x40 /* allocate new page for split */
#define XLOG_HASH_SPLIT_PAGE 0x50 /* split page */
#define XLOG_HASH_SPLIT_COMPLETE 0x60 /* completion of split
* operation */
#define XLOG_HASH_MOVE_PAGE_CONTENTS 0x70 /* remove tuples from one page
* and add to another page */
#define XLOG_HASH_SQUEEZE_PAGE 0x80 /* add tuples to one of the previous
* pages in chain and free the ovfl
* page */
#define XLOG_HASH_DELETE 0x90 /* delete index tuples from a page */
#define XLOG_HASH_SPLIT_CLEANUP 0xA0 /* clear split-cleanup flag in primary
* bucket page after deleting tuples
* that are moved due to split */
#define XLOG_HASH_UPDATE_META_PAGE 0xB0 /* update meta page after
* vacuum */
/*
* xl_hash_split_allocate_page flag values, 8 bits are available.
*/
#define XLH_SPLIT_META_UPDATE_MASKS (1<<0)
#define XLH_SPLIT_META_UPDATE_SPLITPOINT (1<<1)
/*
* This is what we need to know about a HASH index create.
*
* Backup block 0: metapage
*/
typedef struct xl_hash_createidx
{
double num_tuples;
RegProcedure procid;
uint16 ffactor;
} xl_hash_createidx;
#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
/*
* This is what we need to know about simple (without split) insert.
*
* This data record is used for XLOG_HASH_INSERT
*
* Backup Blk 0: original page (data contains the inserted tuple)
* Backup Blk 1: metapage (HashMetaPageData)
*/
typedef struct xl_hash_insert
{
OffsetNumber offnum;
} xl_hash_insert;
#define SizeOfHashInsert (offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
/*
* This is what we need to know about addition of overflow page.
*
* This data record is used for XLOG_HASH_ADD_OVFL_PAGE
*
* Backup Blk 0: newly allocated overflow page
* Backup Blk 1: page before new overflow page in the bucket chain
* Backup Blk 2: bitmap page
* Backup Blk 3: new bitmap page
* Backup Blk 4: metapage
*/
typedef struct xl_hash_add_ovfl_page
{
uint16 bmsize;
bool bmpage_found;
} xl_hash_add_ovfl_page;
#define SizeOfHashAddOvflPage \
(offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))
/*
* This is what we need to know about allocating a page for split.
*
* This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
*
* Backup Blk 0: page for old bucket
* Backup Blk 1: page for new bucket
* Backup Blk 2: metapage
*/
typedef struct xl_hash_split_allocate_page
{
uint32 new_bucket;
uint16 old_bucket_flag;
uint16 new_bucket_flag;
uint8 flags;
} xl_hash_split_allocate_page;
#define SizeOfHashSplitAllocPage \
(offsetof(xl_hash_split_allocate_page, flags) + sizeof(uint8))
/*
* This is what we need to know about completing the split operation.
*
* This data record is used for XLOG_HASH_SPLIT_COMPLETE
*
* Backup Blk 0: page for old bucket
* Backup Blk 1: page for new bucket
*/
typedef struct xl_hash_split_complete
{
uint16 old_bucket_flag;
uint16 new_bucket_flag;
} xl_hash_split_complete;
#define SizeOfHashSplitComplete \
(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
/*
* This is what we need to know about move page contents required during
* squeeze operation.
*
* This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
*
* Backup Blk 0: bucket page
* Backup Blk 1: page containing moved tuples
* Backup Blk 2: page from which tuples will be removed
*/
typedef struct xl_hash_move_page_contents
{
uint16 ntups;
bool is_prim_bucket_same_wrt; /* TRUE if the page to which
* tuples are moved is same as
* primary bucket page */
} xl_hash_move_page_contents;
#define SizeOfHashMovePageContents \
(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
/*
* This is what we need to know about the squeeze page operation.
*
* This data record is used for XLOG_HASH_SQUEEZE_PAGE
*
* Backup Blk 0: page containing tuples moved from freed overflow page
* Backup Blk 1: freed overflow page
* Backup Blk 2: page previous to the freed overflow page
* Backup Blk 3: page next to the freed overflow page
* Backup Blk 4: bitmap page containing info of freed overflow page
* Backup Blk 5: meta page
*/
typedef struct xl_hash_squeeze_page
{
BlockNumber prevblkno;
BlockNumber nextblkno;
uint16 ntups;
bool is_prim_bucket_same_wrt; /* TRUE if the page to which
* tuples are moved is same as
* primary bucket page */
bool is_prev_bucket_same_wrt; /* TRUE if the page to which
* tuples are moved is the
* page previous to the freed
* overflow page */
} xl_hash_squeeze_page;
#define SizeOfHashSqueezePage \
(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
/*
* This is what we need to know about the deletion of index tuples from a page.
*
* This data record is used for XLOG_HASH_DELETE
*
* Backup Blk 0: primary bucket page
* Backup Blk 1: page from which tuples are deleted
*/
typedef struct xl_hash_delete
{
bool is_primary_bucket_page; /* TRUE if the operation is for
* primary bucket page */
} xl_hash_delete;
#define SizeOfHashDelete (offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
/*
* This is what we need for metapage update operation.
*
* This data record is used for XLOG_HASH_UPDATE_META_PAGE
*
* Backup Blk 0: meta page
*/
typedef struct xl_hash_update_meta_page
{
double ntuples;
} xl_hash_update_meta_page;
#define SizeOfHashUpdateMetaPage \
(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
/*
* This is what we need to initialize metapage.
*
* This data record is used for XLOG_HASH_INIT_META_PAGE
*
* Backup Blk 0: meta page
*/
typedef struct xl_hash_init_meta_page
{
double num_tuples;
RegProcedure procid;
uint16 ffactor;
} xl_hash_init_meta_page;
#define SizeOfHashInitMetaPage \
(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
/*
* This is what we need to initialize bitmap page.
*
* This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
*
* Backup Blk 0: bitmap page
* Backup Blk 1: meta page
*/
typedef struct xl_hash_init_bitmap_page
{
uint16 bmsize;
} xl_hash_init_bitmap_page;
#define SizeOfHashInitBitmapPage \
(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
extern void hash_redo(XLogReaderState *record);
extern void hash_desc(StringInfo buf, XLogReaderState *record);
......
......@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
-- HASH
--
CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
WARNING: hash indexes are not WAL-logged and their use is discouraged
CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
WARNING: hash indexes are not WAL-logged and their use is discouraged
CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
WARNING: hash indexes are not WAL-logged and their use is discouraged
CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
WARNING: hash indexes are not WAL-logged and their use is discouraged
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
DROP TABLE unlogged_hash_table;
......@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
-- maintenance_work_mem setting and fillfactor:
SET maintenance_work_mem = '1MB';
CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
WARNING: hash indexes are not WAL-logged and their use is discouraged
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
QUERY PLAN
......
......@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
-- Hash index / opclass with the = operator
--
CREATE INDEX enumtest_hash ON enumtest USING hash (col);
WARNING: hash indexes are not WAL-logged and their use is discouraged
SELECT * FROM enumtest WHERE col = 'orange';
col
--------
......
......@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
--
CREATE TABLE hash_split_heap (keycol INT);
CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
WARNING: hash indexes are not WAL-logged and their use is discouraged
INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
VACUUM FULL hash_split_heap;
-- Let's do a backward scan.
......@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
CREATE TABLE hash_heap_float4 (x float4, y int);
INSERT INTO hash_heap_float4 VALUES (1.1,1);
CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
WARNING: hash indexes are not WAL-logged and their use is discouraged
DROP TABLE hash_heap_float4 CASCADE;
......@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
WARNING: hash indexes are not WAL-logged and their use is discouraged
SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
a | b | trunc
----+-------------------+-------------------
......
......@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
WARNING: hash indexes are not WAL-logged and their use is discouraged
CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
-- default is 'd'/DEFAULT for user created tables
......
......@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
-- btree and hash index creation test
CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
CREATE INDEX guid1_hash ON guid1 USING HASH (guid_field);
WARNING: hash indexes are not WAL-logged and their use is discouraged
-- unique index test
CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
-- should fail
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment