Commit dd299df8 authored by Peter Geoghegan's avatar Peter Geoghegan

Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
parent e5adcb78
......@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
--
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
-- Delete many entries, and vacuum. This causes page deletions.
DELETE FROM delete_test_table WHERE a > 40000;
VACUUM delete_test_table;
DELETE FROM delete_test_table WHERE a > 10;
-- Delete most entries, and vacuum, deleting internal pages and creating "fast
-- root"
DELETE FROM delete_test_table WHERE a < 79990;
VACUUM delete_test_table;
SELECT bt_index_parent_check('delete_test_table_pkey', true);
bt_index_parent_check
......
......@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
--
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
-- Delete many entries, and vacuum. This causes page deletions.
DELETE FROM delete_test_table WHERE a > 40000;
VACUUM delete_test_table;
DELETE FROM delete_test_table WHERE a > 10;
-- Delete most entries, and vacuum, deleting internal pages and creating "fast
-- root"
DELETE FROM delete_test_table WHERE a < 79990;
VACUUM delete_test_table;
SELECT bt_index_parent_check('delete_test_table_pkey', true);
......
This diff is collapsed.
......@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
* Get values of extended metadata if available, use default values
* otherwise.
*/
if (metad->btm_version == BTREE_VERSION)
if (metad->btm_version >= BTREE_NOVAC_VERSION)
{
values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
......
......@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
SELECT * FROM bt_metap('test1_a_idx');
-[ RECORD 1 ]-----------+-------
magic | 340322
version | 3
version | 4
root | 1
level | 0
fastroot | 1
......
......@@ -48,7 +48,7 @@ select version, tree_level,
from pgstatindex('test_pkey');
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)
select version, tree_level,
......@@ -58,7 +58,7 @@ select version, tree_level,
from pgstatindex('test_pkey'::text);
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)
select version, tree_level,
......@@ -68,7 +68,7 @@ select version, tree_level,
from pgstatindex('test_pkey'::name);
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)
select version, tree_level,
......@@ -78,7 +78,7 @@ select version, tree_level,
from pgstatindex('test_pkey'::regclass);
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)
select pg_relpages('test');
......@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
select pgstatindex('test_partition_idx');
pgstatindex
------------------------------
(3,0,8192,0,0,0,0,0,NaN,NaN)
(4,0,8192,0,0,0,0,0,NaN,NaN)
(1 row)
select pgstathashindex('test_partition_hash_idx');
......
......@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
<para>
By default, B-tree indexes store their entries in ascending order
with nulls last. This means that a forward scan of an index on
column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
with nulls last (table TID is treated as a tiebreaker column among
otherwise equal entries). This means that a forward scan of an
index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
(or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>). The
index can also be scanned backward, producing output satisfying
<literal>ORDER BY x DESC</literal>
......@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
the extra columns are trailing columns; making them be leading columns is
unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
However, this method doesn't support the case where you want the index to
enforce uniqueness on the key column(s). Also, explicitly marking
non-searchable columns as <literal>INCLUDE</literal> columns makes the
index slightly smaller, because such columns need not be stored in upper
tree levels.
enforce uniqueness on the key column(s).
</para>
<para>
<firstterm>Suffix truncation</firstterm> always removes non-key
columns from upper B-Tree levels. As payload columns, they are
never used to guide index scans. The truncation process also
removes one or more trailing key column(s) when the remaining
prefix of key column(s) happens to be sufficient to describe tuples
on the lowest B-Tree level. In practice, covering indexes without
an <literal>INCLUDE</literal> clause often avoid storing columns
that are effectively payload in the upper levels. However,
explicitly defining payload columns as non-key columns
<emphasis>reliably</emphasis> keeps the tuples in upper levels
small.
</para>
<para>
......
......@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
bool isnull[INDEX_MAX_KEYS];
IndexTuple truncated;
Assert(leavenatts < sourceDescriptor->natts);
Assert(leavenatts <= sourceDescriptor->natts);
/* Easy case: no truncation actually required */
if (leavenatts == sourceDescriptor->natts)
return CopyIndexTuple(source);
/* Create temporary descriptor to scribble on */
truncdesc = palloc(TupleDescSize(sourceDescriptor));
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
if (metad->btm_version < BTREE_VERSION)
if (metad->btm_version < BTREE_NOVAC_VERSION)
{
/*
* Do cleanup if metapage needs upgrade, because we don't have
......
......@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* downlink (block) to uniquely identify the index entry, in case it
* moves right while we're working lower in the tree. See the paper
* by Lehman and Yao for how this is detected and handled. (We use the
* child link to disambiguate duplicate keys in the index -- Lehman
* and Yao disallow duplicate keys.)
* child link during the second half of a page split -- if caller ends
* up splitting the child it usually ends up inserting a new pivot
* tuple for child's new right sibling immediately after the original
* bts_offset offset recorded here. The downlink block will be needed
* to check if bts_offset remains the position of this same pivot
* tuple.)
*/
new_stack = (BTStack) palloc(sizeof(BTStackData));
new_stack->bts_blkno = par_blkno;
......@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
* and we need to move right. (If the scan key is equal to the high key,
* we might or might not need to move right; have to scan the page first
* anyway.)
* and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
* have some duplicates to the right as well as the left, but that's
* something that's only ever dealt with on the leaf level, after
* _bt_search has found an initial leaf page.)
*
* When nextkey = true: move right if the scan key is >= page's high key.
* (Note that key.scantid cannot be set in this case.)
*
* The page could even have split more than once, so scan as far as
* needed.
......@@ -347,6 +353,9 @@ _bt_binsrch(Relation rel,
int32 result,
cmpval;
/* Requesting nextkey semantics while using scantid seems nonsensical */
Assert(!key->nextkey || key->scantid == NULL);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
......@@ -554,10 +563,14 @@ _bt_compare(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
IndexTuple itup;
ItemPointer heapTid;
ScanKey scankey;
int ncmpkey;
int ntupatts;
Assert(_bt_check_natts(rel, page, offnum));
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
Assert(key->heapkeyspace || key->scantid == NULL);
/*
* Force result ">" if target item is first data item on an internal page
......@@ -567,6 +580,7 @@ _bt_compare(Relation rel,
return 1;
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
ntupatts = BTreeTupleGetNAtts(itup, rel);
/*
* The scan key is set up with the attribute number associated with each
......@@ -580,8 +594,10 @@ _bt_compare(Relation rel,
* _bt_first).
*/
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
scankey = key->scankeys;
for (int i = 1; i <= key->keysz; i++)
for (int i = 1; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
......@@ -632,8 +648,77 @@ _bt_compare(Relation rel,
scankey++;
}
/* if we get here, the keys are equal */
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
* key attribute value that would otherwise be compared directly.
*
* Note: it doesn't matter if ntupatts includes non-key attributes;
* scankey won't, so explicitly excluding non-key attributes isn't
* necessary.
*/
if (key->keysz > ntupatts)
return 1;
/*
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
*/
heapTid = BTreeTupleGetHeapTID(itup);
if (key->scantid == NULL)
{
/*
* Most searches have a scankey that is considered greater than a
* truncated pivot tuple if and when the scankey has equal values for
* attributes up to and including the least significant untruncated
* attribute in tuple.
*
* For example, if an index has the minimum two attributes (single
* user key attribute, plus heap TID attribute), and a page's high key
* is ('foo', -inf), and scankey is ('foo', <omitted>), the search
* will not descend to the page to the left. The search will descend
* right instead. The truncated attribute in pivot tuple means that
* all non-pivot tuples on the page to the left are strictly < 'foo',
* so it isn't necessary to descend left. In other words, search
* doesn't have to descend left because it isn't interested in a match
* that has a heap TID value of -inf.
*
* However, some searches (pivotsearch searches) actually require that
* we descend left when this happens. -inf is treated as a possible
* match for omitted scankey attribute(s). This is needed by page
* deletion, which must re-find leaf pages that are targets for
* deletion using their high keys.
*
* Note: the heap TID part of the test ensures that scankey is being
* compared to a pivot tuple with one or more truncated key
* attributes.
*
* Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
* left here, since they have no heap TID attribute (and cannot have
* any -inf key values in any case, since truncation can only remove
* non-key attributes). !heapkeyspace searches must always be
* prepared to deal with matches on both sides of the pivot once the
* leaf level is reached.
*/
if (key->heapkeyspace && !key->pivotsearch &&
key->keysz == ntupatts && heapTid == NULL)
return 1;
/* All provided scankey arguments found to be equal */
return 0;
}
/*
* Treat truncated heap TID as minus infinity, since scankey has a key
* attribute value (scantid) that would otherwise be compared directly
*/
Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
if (heapTid == NULL)
return 1;
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
return ItemPointerCompare(key->scantid, heapTid);
}
/*
......@@ -1148,7 +1233,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
inskey.scantid = NULL;
inskey.keysz = keysCount;
/*
......
......@@ -755,6 +755,7 @@ _bt_sortaddtup(Page page,
{
trunctuple = *itup;
trunctuple.t_info = sizeof(IndexTupleData);
/* Deliberately zero INDEX_ALT_TID_MASK bits */
BTreeTupleSetNAtts(&trunctuple, 0);
itup = &trunctuple;
itemsize = sizeof(IndexTupleData);
......@@ -808,8 +809,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
OffsetNumber last_off;
Size pgspc;
Size itupsz;
int indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
......@@ -826,27 +825,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
itupsz = MAXALIGN(itupsz);
/*
* Check whether the item can fit on a btree page at all. (Eventually, we
* ought to try to apply TOAST methods if not.) We actually need to be
* able to fit three items on every page, so restrict any one item to 1/3
* the per-page available space. Note that at this point, itupsz doesn't
* include the ItemId.
* Check whether the item can fit on a btree page at all.
*
* NOTE: similar code appears in _bt_insertonpg() to defend against
* oversize items being inserted into an already-existing index. But
* during creation of an index, we don't go through there.
*/
if (itupsz > BTMaxItemSize(npage))
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
itupsz, BTMaxItemSize(npage),
RelationGetRelationName(wstate->index)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(wstate->heap,
RelationGetRelationName(wstate->index))));
* Every newly built index will treat heap TID as part of the keyspace,
* which imposes the requirement that new high keys must occasionally have
* a heap TID appended within _bt_truncate(). That may leave a new pivot
* tuple one or two MAXALIGN() quantums larger than the original first
* right tuple it's derived from. v4 deals with the problem by decreasing
* the limit on the size of tuples inserted on the leaf level by the same
* small amount. Enforce the new v4+ limit on the leaf level, and the old
* limit on internal levels, since pivot tuples may need to make use of
* the resered space. This should never fail on internal pages.
*/
if (unlikely(itupsz > BTMaxItemSize(npage)))
_bt_check_third_page(wstate->index, wstate->heap,
state->btps_level == 0, npage, itup);
/*
* Check to see if page is "full". It's definitely full if the item won't
......@@ -892,24 +885,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
ItemIdSetUnused(ii); /* redundant */
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
if (indnkeyatts != indnatts && P_ISLEAF(opageop))
if (P_ISLEAF(opageop))
{
IndexTuple lastleft;
IndexTuple truncated;
Size truncsz;
/*
* Truncate any non-key attributes from high key on leaf level
* (i.e. truncate on leaf level if we're building an INCLUDE
* index). This is only done at the leaf level because downlinks
* Truncate away any unneeded attributes from high key on leaf
* level. This is only done at the leaf level because downlinks
* in internal pages are either negative infinity items, or get
* their contents from copying from one level down. See also:
* _bt_split().
*
* We don't try to bias our choice of split point to make it more
* likely that _bt_truncate() can truncate away more attributes,
* whereas the split point passed to _bt_split() is chosen much
* more delicately. Suffix truncation is mostly useful because it
* improves space utilization for workloads with random
* insertions. It doesn't seem worthwhile to add logic for
* choosing a split point here for a benefit that is bound to be
* much smaller.
*
* Since the truncated tuple is probably smaller than the
* original, it cannot just be copied in place (besides, we want
* to actually save space on the leaf page). We delete the
* original high key, and add our own truncated high key at the
* same offset.
* same offset. It's okay if the truncated tuple is slightly
* larger due to containing a heap TID value, since this case is
* known to _bt_check_third_page(), which reserves space.
*
* Note that the page layout won't be changed very much. oitup is
* already located at the physical beginning of tuple space, so we
......@@ -917,7 +921,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the latter portion of the space occupied by the original tuple.
* This is fairly cheap.
*/
truncated = _bt_nonkey_truncate(wstate->index, oitup);
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
truncated = _bt_truncate(wstate->index, lastleft, oitup,
wstate->inskey);
truncsz = IndexTupleSize(truncated);
PageIndexTupleDelete(opage, P_HIKEY);
_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
......@@ -936,8 +944,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (state->btps_next == NULL)
state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
P_LEFTMOST(opageop));
Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
!P_LEFTMOST(opageop));
......@@ -982,7 +991,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the first item for a page is copied from the prior page in the code
* above. Since the minimum key for an entire level is only used as a
* minus infinity downlink, and never as a high key, there is no need to
* truncate away non-key attributes at this point.
* truncate away suffix attributes at this point.
*/
if (last_off == P_HIKEY)
{
......@@ -1041,8 +1050,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
else
{
Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
P_LEFTMOST(opaque));
Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
!P_LEFTMOST(opaque));
......@@ -1135,6 +1145,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else if (itup != NULL)
{
int32 compare = 0;
for (i = 1; i <= keysz; i++)
{
SortSupport entry;
......@@ -1142,7 +1154,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
attrDatum2;
bool isNull1,
isNull2;
int32 compare;
entry = sortKeys + i - 1;
attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
......@@ -1159,6 +1170,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (compare < 0)
break;
}
/*
* If key values are equal, we sort on ItemPointer. This is
* required for btree indexes, since heap TID is treated as an
* implicit last key attribute in order to ensure that all
* keys in the index are physically unique.
*/
if (compare == 0)
{
compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
Assert(compare != 0);
if (compare > 0)
load1 = false;
}
}
else
load1 = false;
......
This diff is collapsed.
......@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
md = BTPageGetMeta(metapg);
md->btm_magic = BTREE_MAGIC;
md->btm_version = BTREE_VERSION;
md->btm_version = xlrec->version;
md->btm_root = xlrec->root;
md->btm_level = xlrec->level;
md->btm_fastroot = xlrec->fastroot;
......@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
}
static void
btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
btree_xlog_split(bool onleft, XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
......@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
BTPageOpaque ropaque;
char *datapos;
Size datalen;
IndexTuple left_hikey = NULL;
Size left_hikeysz = 0;
BlockNumber leftsib;
BlockNumber rightsib;
BlockNumber rnext;
......@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
_bt_restore_page(rpage, datapos, datalen);
/*
* When the high key isn't present is the wal record, then we assume it to
* be equal to the first key on the right page. It must be from the leaf
* level.
*/
if (!lhighkey)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
Assert(isleaf);
left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}
PageSetLSN(rpage, lsn);
MarkBufferDirty(rbuf);
......@@ -282,8 +266,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
Page lpage = (Page) BufferGetPage(lbuf);
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL;
Size newitemsz = 0;
IndexTuple newitem,
left_hikey;
Size newitemsz,
left_hikeysz;
Page newlpage;
OffsetNumber leftoff;
......@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
if (lhighkey)
{
left_hikey = (IndexTuple) datapos;
left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
datapos += left_hikeysz;
datalen -= left_hikeysz;
}
Assert(datalen == 0);
......@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
btree_xlog_insert(false, true, record);
break;
case XLOG_BTREE_SPLIT_L:
btree_xlog_split(true, false, record);
break;
case XLOG_BTREE_SPLIT_L_HIGHKEY:
btree_xlog_split(true, true, record);
btree_xlog_split(true, record);
break;
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, false, record);
break;
case XLOG_BTREE_SPLIT_R_HIGHKEY:
btree_xlog_split(false, true, record);
btree_xlog_split(false, record);
break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
......
......@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
}
case XLOG_BTREE_SPLIT_L:
case XLOG_BTREE_SPLIT_R:
case XLOG_BTREE_SPLIT_L_HIGHKEY:
case XLOG_BTREE_SPLIT_R_HIGHKEY:
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
......@@ -130,12 +128,6 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
case XLOG_BTREE_SPLIT_L_HIGHKEY:
id = "SPLIT_L_HIGHKEY";
break;
case XLOG_BTREE_SPLIT_R_HIGHKEY:
id = "SPLIT_R_HIGHKEY";
break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
......
......@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
}
/*
* If key values are equal, we sort on ItemPointer. This does not affect
* validity of the finished index, but it may be useful to have index
* scans in physical order.
* If key values are equal, we sort on ItemPointer. This is required for
* btree indexes, since heap TID is treated as an implicit last key
* attribute in order to ensure that all keys in the index are physically
* unique.
*/
{
BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
......@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
return (pos1 < pos2) ? -1 : 1;
}
/* ItemPointer values should never be equal */
Assert(false);
return 0;
}
......@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
return (pos1 < pos2) ? -1 : 1;
}
/* ItemPointer values should never be equal */
Assert(false);
return 0;
}
......
This diff is collapsed.
......@@ -28,8 +28,7 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
/* 0x50 and 0x60 are unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
......@@ -47,6 +46,7 @@
*/
typedef struct xl_btree_metadata
{
uint32 version;
BlockNumber root;
uint32 level;
BlockNumber fastroot;
......@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
* whole page image. The left page, however, is handled in the normal
* incremental-update fashion.
*
* Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
* The _L and _R variants indicate whether the inserted tuple went into the
* left or right split page (and thus, whether newitemoff and the new item
* are stored or not). The _HIGHKEY variants indicate that we've logged
* explicitly left page high key value, otherwise redo should use right page
* leftmost key as a left page high key. _HIGHKEY is specified for internal
* pages where right page leftmost key is suppressed, and for leaf pages
* of covering indexes where high key have non-key attributes truncated.
* Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
* There are two variants to indicate whether the inserted tuple went into the
* left or right split page (and thus, whether newitemoff and the new item are
* stored or not). We always log the left page high key because suffix
* truncation can generate a new leaf high key using user-defined code. This
* is also necessary on internal pages, since the first right item that the
* left page's high key was based on will have been truncated to zero
* attributes in the right page (the original is unavailable from the right
* page).
*
* Backup Blk 0: original page / new left page
*
* The left page's data portion contains the new item, if it's the _L variant.
* (In the _R variants, the new item is one of the right page's tuples.)
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as the
* leftmost key in the new right page.
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the
* form used by _bt_restore_page.
* The right page's data portion contains the right page's tuples in the form
* used by _bt_restore_page. This includes the new item, if it's the _R
* variant. The right page's tuples also include the right page's high key
* with either variant (moved from the left/original page during the split),
* unless the split happened to be of the rightmost page on its level, where
* there is no high key for new right page.
*
* Backup Blk 2: next block (orig page's rightlink), if any
* Backup Blk 3: child's left sibling, if non-leaf split
......
......@@ -199,28 +199,22 @@ reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
--
-- Test B-tree page deletion. In particular, deleting a non-leaf page.
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
-- First create a tree that's at least four levels deep. The text inserted
-- is long and poorly compressible. That way only a few index tuples fit on
-- each page, allowing us to get a tall tree with fewer pages.
create table btree_tall_tbl(id int4, t text);
create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
insert into btree_tall_tbl
select g, g::text || '_' ||
(select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
from generate_series(1, 100) g;
-- Delete most entries, and vacuum. This causes page deletions.
delete from btree_tall_tbl where id < 950;
vacuum btree_tall_tbl;
--
-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
-- WAL record type). This happens when a "fast root" page is split.
-- First create a tree that's at least three levels deep (i.e. has one level
-- between the root and leaf levels). The text inserted is long. It won't be
-- compressed because we use plain storage in the table. Only a few index
-- tuples fit on each internal page, allowing us to get a tall tree with few
-- pages. (A tall tree is required to trigger caching.)
--
-- The vacuum above should've turned the leaf page into a fast root. We just
-- need to insert some rows to cause the fast root page to split.
insert into btree_tall_tbl (id, t)
select g, repeat('x', 100) from generate_series(1, 500) g;
-- The text column must be the leading column in the index, since suffix
-- truncation would otherwise truncate tuples on internal pages, leaving us
-- with a short tree.
create table btree_tall_tbl(id int4, t text);
alter table btree_tall_tbl alter COLUMN t set storage plain;
create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
insert into btree_tall_tbl select g, repeat('x', 250)
from generate_series(1, 130) g;
--
-- Test vacuum_cleanup_index_scale_factor
--
......
......@@ -3225,11 +3225,22 @@ explain (costs off)
CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
-- Delete many entries, and vacuum. This causes page deletions.
DELETE FROM delete_test_table WHERE a > 40000;
VACUUM delete_test_table;
DELETE FROM delete_test_table WHERE a > 10;
-- Delete most entries, and vacuum, deleting internal pages and creating "fast
-- root"
DELETE FROM delete_test_table WHERE a < 79990;
VACUUM delete_test_table;
--
-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
-- WAL record type). This happens when a "fast root" page is split. This
-- also creates coverage for nbtree FSM page recycling.
--
-- The vacuum above should've turned the leaf page into a fast root. We just
-- need to insert some rows to cause the fast root page to split.
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
--
-- REINDEX (VERBOSE)
--
CREATE TABLE reindex_verbose(id integer primary key);
......
......@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
-- doesn't work: grant still exists
DROP USER regress_dep_user1;
ERROR: role "regress_dep_user1" cannot be dropped because some objects depend on it
DETAIL: owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
DETAIL: privileges for table deptest1
privileges for database regression
privileges for table deptest1
owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
DROP OWNED BY regress_dep_user1;
DROP USER regress_dep_user1;
\set VERBOSITY terse
......
......@@ -187,9 +187,9 @@ ERROR: event trigger "regress_event_trigger" does not exist
-- should fail, regress_evt_user owns some objects
drop role regress_evt_user;
ERROR: role "regress_evt_user" cannot be dropped because some objects depend on it
DETAIL: owner of event trigger regress_event_trigger3
DETAIL: owner of user mapping for regress_evt_user on server useless_server
owner of default privileges on new relations belonging to role regress_evt_user
owner of user mapping for regress_evt_user on server useless_server
owner of event trigger regress_event_trigger3
-- cleanup before next test
-- these are all OK; the second one should emit a NOTICE
drop event trigger if exists regress_event_trigger2;
......
......@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
RESET ROLE;
DROP ROLE regress_test_indirect; -- ERROR
ERROR: role "regress_test_indirect" cannot be dropped because some objects depend on it
DETAIL: owner of server s1
privileges for foreign-data wrapper foo
DETAIL: privileges for foreign-data wrapper foo
owner of server s1
\des+
List of foreign servers
Name | Owner | Foreign-data wrapper | Access privileges | Type | Version | FDW options | Description
......@@ -1995,16 +1995,13 @@ ERROR: cannot attach a permanent relation as partition of temporary relation "t
DROP FOREIGN TABLE foreign_part;
DROP TABLE temp_parted;
-- Cleanup
\set VERBOSITY terse
DROP SCHEMA foreign_schema CASCADE;
DROP ROLE regress_test_role; -- ERROR
ERROR: role "regress_test_role" cannot be dropped because some objects depend on it
DETAIL: privileges for server s4
privileges for foreign-data wrapper foo
owner of user mapping for regress_test_role on server s6
DROP SERVER t1 CASCADE;
NOTICE: drop cascades to user mapping for public on server t1
DROP USER MAPPING FOR regress_test_role SERVER s6;
\set VERBOSITY terse
DROP FOREIGN DATA WRAPPER foo CASCADE;
NOTICE: drop cascades to 5 other objects
DROP SERVER s8 CASCADE;
......
......@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
SAVEPOINT q;
DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
DETAIL: target of policy p on table tbl1
privileges for table tbl1
DETAIL: privileges for table tbl1
target of policy p on table tbl1
ROLLBACK TO q;
ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
SAVEPOINT q;
......
......@@ -84,32 +84,23 @@ reset enable_indexscan;
reset enable_bitmapscan;
--
-- Test B-tree page deletion. In particular, deleting a non-leaf page.
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
-- First create a tree that's at least four levels deep. The text inserted
-- is long and poorly compressible. That way only a few index tuples fit on
-- each page, allowing us to get a tall tree with fewer pages.
create table btree_tall_tbl(id int4, t text);
create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
insert into btree_tall_tbl
select g, g::text || '_' ||
(select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
from generate_series(1, 100) g;
-- Delete most entries, and vacuum. This causes page deletions.
delete from btree_tall_tbl where id < 950;
vacuum btree_tall_tbl;
-- First create a tree that's at least three levels deep (i.e. has one level
-- between the root and leaf levels). The text inserted is long. It won't be
-- compressed because we use plain storage in the table. Only a few index
-- tuples fit on each internal page, allowing us to get a tall tree with few
-- pages. (A tall tree is required to trigger caching.)
--
-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
-- WAL record type). This happens when a "fast root" page is split.
--
-- The vacuum above should've turned the leaf page into a fast root. We just
-- need to insert some rows to cause the fast root page to split.
insert into btree_tall_tbl (id, t)
select g, repeat('x', 100) from generate_series(1, 500) g;
-- The text column must be the leading column in the index, since suffix
-- truncation would otherwise truncate tuples on internal pages, leaving us
-- with a short tree.
create table btree_tall_tbl(id int4, t text);
alter table btree_tall_tbl alter COLUMN t set storage plain;
create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
insert into btree_tall_tbl select g, repeat('x', 250)
from generate_series(1, 130) g;
--
-- Test vacuum_cleanup_index_scale_factor
......
......@@ -1146,11 +1146,23 @@ explain (costs off)
CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
-- Delete many entries, and vacuum. This causes page deletions.
DELETE FROM delete_test_table WHERE a > 40000;
VACUUM delete_test_table;
DELETE FROM delete_test_table WHERE a > 10;
-- Delete most entries, and vacuum, deleting internal pages and creating "fast
-- root"
DELETE FROM delete_test_table WHERE a < 79990;
VACUUM delete_test_table;
--
-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
-- WAL record type). This happens when a "fast root" page is split. This
-- also creates coverage for nbtree FSM page recycling.
--
-- The vacuum above should've turned the leaf page into a fast root. We just
-- need to insert some rows to cause the fast root page to split.
INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
--
-- REINDEX (VERBOSE)
--
......
......@@ -805,11 +805,11 @@ DROP FOREIGN TABLE foreign_part;
DROP TABLE temp_parted;
-- Cleanup
\set VERBOSITY terse
DROP SCHEMA foreign_schema CASCADE;
DROP ROLE regress_test_role; -- ERROR
DROP SERVER t1 CASCADE;
DROP USER MAPPING FOR regress_test_role SERVER s6;
\set VERBOSITY terse
DROP FOREIGN DATA WRAPPER foo CASCADE;
DROP SERVER s8 CASCADE;
\set VERBOSITY default
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment