Commit 0d861bbb authored by Peter Geoghegan's avatar Peter Geoghegan

Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
parent 612a1ab7
This diff is collapsed.
...@@ -557,11 +557,208 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool ...@@ -557,11 +557,208 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
<sect1 id="btree-implementation"> <sect1 id="btree-implementation">
<title>Implementation</title> <title>Implementation</title>
<para>
This section covers B-Tree index implementation details that may be
of use to advanced users. See
<filename>src/backend/access/nbtree/README</filename> in the source
distribution for a much more detailed, internals-focused description
of the B-Tree implementation.
</para>
<sect2 id="btree-structure">
<title>B-Tree Structure</title>
<para>
<productname>PostgreSQL</productname> B-Tree indexes are
multi-level tree structures, where each level of the tree can be
used as a doubly-linked list of pages. A single metapage is stored
in a fixed position at the start of the first segment file of the
index. All other pages are either leaf pages or internal pages.
Leaf pages are the pages on the lowest level of the tree. All
other levels consist of internal pages. Each leaf page contains
tuples that point to table rows. Each internal page contains
tuples that point to the next level down in the tree. Typically,
over 99% of all pages are leaf pages. Both internal pages and leaf
pages use the standard page format described in <xref
linkend="storage-page-layout"/>.
</para>
<para>
New leaf pages are added to a B-Tree index when an existing leaf
page cannot fit an incoming tuple. A <firstterm>page
split</firstterm> operation makes room for items that originally
belonged on the overflowing page by moving a portion of the items
to a new page. Page splits must also insert a new
<firstterm>downlink</firstterm> to the new page in the parent page,
which may cause the parent to split in turn. Page splits
<quote>cascade upwards</quote> in a recursive fashion. When the
root page finally cannot fit a new downlink, a <firstterm>root page
split</firstterm> operation takes place. This adds a new level to
the tree structure by creating a new root page that is one level
above the original root page.
</para>
</sect2>
<sect2 id="btree-deduplication">
<title>Deduplication</title>
<para>
A duplicate is a leaf page tuple (a tuple that points to a table
row) where <emphasis>all</emphasis> indexed key columns have values
that match corresponding column values from at least one other leaf
page tuple that's close by in the same index. Duplicate tuples are
quite common in practice. B-Tree indexes can use a special,
space-efficient representation for duplicates when an optional
technique is enabled: <firstterm>deduplication</firstterm>.
</para>
<para>
Deduplication works by periodically merging groups of duplicate
tuples together, forming a single posting list tuple for each
group. The column key value(s) only appear once in this
representation. This is followed by a sorted array of
<acronym>TID</acronym>s that point to rows in the table. This
significantly reduces the storage size of indexes where each value
(or each distinct combination of column values) appears several
times on average. The latency of queries can be reduced
significantly. Overall query throughput may increase
significantly. The overhead of routine index vacuuming may also be
reduced significantly.
</para>
<note>
<para>
While NULL is generally not considered to be equal to any other
value, including NULL, NULL is nevertheless treated as just
another value from the domain of indexed values by the B-Tree
implementation (except when enforcing uniqueness in a unique
index). B-Tree deduplication is therefore just as effective with
<quote>duplicates</quote> that contain a NULL value.
</para>
</note>
<para>
The deduplication process occurs lazily, when a new item is
inserted that cannot fit on an existing leaf page. This prevents
(or at least delays) leaf page splits. Unlike GIN posting list
tuples, B-Tree posting list tuples do not need to expand every time
a new duplicate is inserted; they are merely an alternative
physical representation of the original logical contents of the
leaf page. This design prioritizes consistent performance with
mixed read-write workloads. Most client applications will at least
see a moderate performance benefit from using deduplication.
Deduplication is enabled by default.
</para>
<para>
Write-heavy workloads that don't benefit from deduplication due to
having few or no duplicate values in indexes will incur a small,
fixed performance penalty (unless deduplication is explicitly
disabled). The <literal>deduplicate_items</literal> storage
parameter can be used to disable deduplication within individual
indexes. There is never any performance penalty with read-only
workloads, since reading posting list tuples is at least as
efficient as reading the standard tuple representation. Disabling
deduplication isn't usually helpful.
</para>
<para>
B-Tree indexes are not directly aware that under MVCC, there might
be multiple extant versions of the same logical table row; to an
index, each tuple is an independent object that needs its own index
entry. Thus, an update of a row always creates all-new index
entries for the row, even if the key values did not change. Some
workloads suffer from index bloat caused by these
implementation-level version duplicates (this is typically a
problem for <command>UPDATE</command>-heavy workloads that cannot
apply the <acronym>HOT</acronym> optimization due to modifying at
least one indexed column). B-Tree deduplication does not
distinguish between these implementation-level version duplicates
and conventional duplicates. Deduplication can nevertheless help
with controlling index bloat caused by implementation-level version
churn.
</para>
<tip>
<para>
A special heuristic is applied to determine whether a
deduplication pass in a unique index should take place. It can
often skip straight to splitting a leaf page, avoiding a
performance penalty from wasting cycles on unhelpful deduplication
passes. If you're concerned about the overhead of deduplication,
consider setting <literal>deduplicate_items = off</literal>
selectively. Leaving deduplication enabled in unique indexes has
little downside.
</para>
</tip>
<para>
Deduplication cannot be used in all cases due to
implementation-level restrictions. Deduplication safety is
determined when <command>CREATE INDEX</command> or
<command>REINDEX</command> run.
</para>
<para>
Note that deduplication is deemed unsafe and cannot be used in the
following cases involving semantically significant differences
among equal datums:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<type>text</type>, <type>varchar</type>, and <type>char</type>
cannot use deduplication when a
<emphasis>nondeterministic</emphasis> collation is used. Case
and accent differences must be preserved among equal datums.
</para>
</listitem>
<listitem>
<para>
<type>numeric</type> cannot use deduplication. Numeric display
scale must be preserved among equal datums.
</para>
</listitem>
<listitem>
<para>
<type>jsonb</type> cannot use deduplication, since the
<type>jsonb</type> B-Tree operator class uses
<type>numeric</type> internally.
</para>
</listitem>
<listitem>
<para>
<type>float4</type> and <type>float8</type> cannot use
deduplication. These types have distinct representations for
<literal>-0</literal> and <literal>0</literal>, which are
nevertheless considered equal. This difference must be
preserved.
</para>
</listitem>
</itemizedlist>
</para>
<para>
There is one further implementation-level restriction that may be
lifted in a future version of
<productname>PostgreSQL</productname>:
</para>
<para>
<itemizedlist>
<listitem>
<para>
Container types (such as composite types, arrays, or range
types) cannot use deduplication.
</para>
</listitem>
</itemizedlist>
</para>
<para>
There is one further implementation-level restriction that applies
regardless of the operator class or collation used:
</para>
<para> <para>
An introduction to the btree index implementation can be found in <itemizedlist>
<filename>src/backend/access/nbtree/README</filename>. <listitem>
<para>
<literal>INCLUDE</literal> indexes can never use deduplication.
</para>
</listitem>
</itemizedlist>
</para> </para>
</sect2>
</sect1> </sect1>
</chapter> </chapter>
...@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr ...@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
nondeterministic collations give a more <quote>correct</quote> behavior, nondeterministic collations give a more <quote>correct</quote> behavior,
especially when considering the full power of Unicode and its many especially when considering the full power of Unicode and its many
special cases, they also have some drawbacks. Foremost, their use leads special cases, they also have some drawbacks. Foremost, their use leads
to a performance penalty. Also, certain operations are not possible with to a performance penalty. Note, in particular, that B-tree cannot use
nondeterministic collations, such as pattern matching operations. deduplication with indexes that use a nondeterministic collation. Also,
Therefore, they should be used only in cases where they are specifically certain operations are not possible with nondeterministic collations,
wanted. such as pattern matching operations. Therefore, they should be used
only in cases where they are specifically wanted.
</para> </para>
</sect3> </sect3>
</sect2> </sect2>
......
...@@ -233,9 +233,10 @@ SELECT * FROM users WHERE nick = 'Larry'; ...@@ -233,9 +233,10 @@ SELECT * FROM users WHERE nick = 'Larry';
<para> <para>
<type>citext</type> is not as efficient as <type>text</type> because the <type>citext</type> is not as efficient as <type>text</type> because the
operator functions and the B-tree comparison functions must make copies operator functions and the B-tree comparison functions must make copies
of the data and convert it to lower case for comparisons. It is, of the data and convert it to lower case for comparisons. Also, only
however, slightly more efficient than using <function>lower</function> to get <type>text</type> can support B-Tree deduplication. However,
case-insensitive matching. <type>citext</type> is slightly more efficient than using
<function>lower</function> to get case-insensitive matching.
</para> </para>
</listitem> </listitem>
......
...@@ -16561,10 +16561,11 @@ AND ...@@ -16561,10 +16561,11 @@ AND
rows. Two rows might have a different binary representation even rows. Two rows might have a different binary representation even
though comparisons of the two rows with the equality operator is true. though comparisons of the two rows with the equality operator is true.
The ordering of rows under these comparison operators is deterministic The ordering of rows under these comparison operators is deterministic
but not otherwise meaningful. These operators are used internally for but not otherwise meaningful. These operators are used internally
materialized views and might be useful for other specialized purposes for materialized views and might be useful for other specialized
such as replication but are not intended to be generally useful for purposes such as replication and B-Tree deduplication (see <xref
writing queries. linkend="btree-deduplication"/>). They are not intended to be
generally useful for writing queries, though.
</para> </para>
</sect2> </sect2>
</sect1> </sect1>
......
...@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class= ...@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
maximum size allowed for the index type, data insertion will fail. maximum size allowed for the index type, data insertion will fail.
In any case, non-key columns duplicate data from the index's table In any case, non-key columns duplicate data from the index's table
and bloat the size of the index, thus potentially slowing searches. and bloat the size of the index, thus potentially slowing searches.
Furthermore, B-tree deduplication is never used with indexes
that have a non-key column.
</para> </para>
<para> <para>
...@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class= ...@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</variablelist> </variablelist>
<para> <para>
B-tree indexes additionally accept this parameter: B-tree indexes also accept these parameters:
</para> </para>
<variablelist> <variablelist>
<varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
<term><literal>deduplicate_items</literal>
<indexterm>
<primary><varname>deduplicate_items</varname></primary>
<secondary>storage parameter</secondary>
</indexterm>
</term>
<listitem>
<para>
Controls usage of the B-tree deduplication technique described
in <xref linkend="btree-deduplication"/>. Set to
<literal>ON</literal> or <literal>OFF</literal> to enable or
disable the optimization. (Alternative spellings of
<literal>ON</literal> and <literal>OFF</literal> are allowed as
described in <xref linkend="config-setting"/>.) The default is
<literal>ON</literal>.
</para>
<note>
<para>
Turning <literal>deduplicate_items</literal> off via
<command>ALTER INDEX</command> prevents future insertions from
triggering deduplication, but does not in itself make existing
posting list tuples use the standard tuple representation.
</para>
</note>
</listitem>
</varlistentry>
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor"> <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
<term><literal>vacuum_cleanup_index_scale_factor</literal> <term><literal>vacuum_cleanup_index_scale_factor</literal>
<indexterm> <indexterm>
...@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class= ...@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
This setting controls usage of the fast update technique described in This setting controls usage of the fast update technique described in
<xref linkend="gin-fast-update"/>. It is a Boolean parameter: <xref linkend="gin-fast-update"/>. It is a Boolean parameter:
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it. <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
(Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are The default is <literal>ON</literal>.
allowed as described in <xref linkend="config-setting"/>.) The
default is <literal>ON</literal>.
</para> </para>
<note> <note>
...@@ -805,6 +834,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating); ...@@ -805,6 +834,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating);
</programlisting> </programlisting>
</para> </para>
<para>
To create a B-Tree index with deduplication disabled:
<programlisting>
CREATE INDEX title_idx ON films (title) WITH (deduplicate_items = off);
</programlisting>
</para>
<para> <para>
To create an index on the expression <literal>lower(title)</literal>, To create an index on the expression <literal>lower(title)</literal>,
allowing efficient case-insensitive searches: allowing efficient case-insensitive searches:
......
...@@ -158,6 +158,16 @@ static relopt_bool boolRelOpts[] = ...@@ -158,6 +158,16 @@ static relopt_bool boolRelOpts[] =
}, },
true true
}, },
{
{
"deduplicate_items",
"Enables \"deduplicate items\" feature for this btree index",
RELOPT_KIND_BTREE,
ShareUpdateExclusiveLock /* since it applies only to later
* inserts */
},
true
},
/* list terminator */ /* list terminator */
{{NULL}} {{NULL}}
}; };
......
...@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation, ...@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/* /*
* Get the latestRemovedXid from the table entries pointed at by the index * Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted. * tuples being deleted.
*
* Note: index access methods that don't consistently use the standard
* IndexTuple + heap TID item pointer representation will need to provide
* their own version of this function.
*/ */
TransactionId TransactionId
index_compute_xid_horizon_for_tuples(Relation irel, index_compute_xid_horizon_for_tuples(Relation irel,
......
...@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global ...@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \ OBJS = \
nbtcompare.o \ nbtcompare.o \
nbtdedup.o \
nbtinsert.o \ nbtinsert.o \
nbtpage.o \ nbtpage.o \
nbtree.o \ nbtree.o \
......
...@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly ...@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and we are otherwise faced with having to split a page to do an insertion (and
hence have exclusive lock on it already). hence have exclusive lock on it already). Deduplication can also prevent
a page split, but removing LP_DEAD tuples is the preferred approach.
(Note that posting list tuples can only have their LP_DEAD bit set when
every table TID within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current that still exists in the heap. This is not a problem for the current
...@@ -726,6 +729,134 @@ if it must. When a page that's already full of duplicates must be split, ...@@ -726,6 +729,134 @@ if it must. When a page that's already full of duplicates must be split,
the fallback strategy assumes that duplicates are mostly inserted in the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty. half of the page mostly full, and the right half of the page mostly empty.
The overall effect is that leaf page splits gracefully adapt to inserts of
large groups of duplicates, maximizing space utilization. Note also that
"trapping" large groups of duplicates on the same leaf page like this makes
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
Notes about deduplication
-------------------------
We deduplicate non-pivot tuples in non-unique indexes to reduce storage
overhead, and to avoid (or at least delay) page splits. Note that the
goals for deduplication in unique indexes are rather different; see later
section for details. Deduplication alters the physical representation of
tuples without changing the logical contents of the index, and without
adding overhead to read queries. Non-pivot tuples are merged together
into a single physical tuple with a posting list (a simple array of heap
TIDs with the standard item pointer format). Deduplication is always
applied lazily, at the point where it would otherwise be necessary to
perform a page split. It occurs only when LP_DEAD items have been
removed, as our last line of defense against splitting a leaf page. We
can set the LP_DEAD bit with posting list tuples, though only when all
TIDs are known dead.
Our lazy approach to deduplication allows the page space accounting used
during page splits to have absolutely minimal special case logic for
posting lists. Posting lists can be thought of as extra payload that
suffix truncation will reliably truncate away as needed during page
splits, just like non-key columns from an INCLUDE index tuple.
Incoming/new tuples can generally be treated as non-overlapping plain
items (though see section on posting list splits for information about how
overlapping new/incoming items are really handled).
The representation of posting lists is almost identical to the posting
lists used by GIN, so it would be straightforward to apply GIN's varbyte
encoding compression scheme to individual posting lists. Posting list
compression would break the assumptions made by posting list splits about
page space accounting (see later section), so it's not clear how
compression could be integrated with nbtree. Besides, posting list
compression does not offer a compelling trade-off for nbtree, since in
general nbtree is optimized for consistent performance with many
concurrent readers and writers.
A major goal of our lazy approach to deduplication is to limit the
performance impact of deduplication with random updates. Even concurrent
append-only inserts of the same key value will tend to have inserts of
individual index tuples in an order that doesn't quite match heap TID
order. Delaying deduplication minimizes page level fragmentation.
Deduplication in unique indexes
-------------------------------
Very often, the range of values that can be placed on a given leaf page in
a unique index is fixed and permanent. For example, a primary key on an
identity column will usually only have page splits caused by the insertion
of new logical rows within the rightmost leaf page. If there is a split
of a non-rightmost leaf page, then the split must have been triggered by
inserts associated with an UPDATE of an existing logical row. Splitting a
leaf page purely to store multiple versions should be considered
pathological, since it permanently degrades the index structure in order
to absorb a temporary burst of duplicates. Deduplication in unique
indexes helps to prevent these pathological page splits. Storing
duplicates in a space efficient manner is not the goal, since in the long
run there won't be any duplicates anyway. Rather, we're buying time for
standard garbage collection mechanisms to run before a page split is
needed.
Unique index leaf pages only get a deduplication pass when an insertion
(that might have to split the page) observed an existing duplicate on the
page in passing. This is based on the assumption that deduplication will
only work out when _all_ new insertions are duplicates from UPDATEs. This
may mean that we miss an opportunity to delay a page split, but that's
okay because our ultimate goal is to delay leaf page splits _indefinitely_
(i.e. to prevent them altogether). There is little point in trying to
delay a split that is probably inevitable anyway. This allows us to avoid
the overhead of attempting to deduplicate with unique indexes that always
have few or no duplicates.
Posting list splits
-------------------
When the incoming tuple happens to overlap with an existing posting list,
a posting list split is performed. Like a page split, a posting list
split resolves a situation where a new/incoming item "won't fit", while
inserting the incoming item in passing (i.e. as part of the same atomic
action). It's possible (though not particularly likely) that an insert of
a new item on to an almost-full page will overlap with a posting list,
resulting in both a posting list split and a page split. Even then, the
atomic action that splits the posting list also inserts the new item
(since page splits always insert the new item in passing). Including the
posting list split in the same atomic action as the insert avoids problems
caused by concurrent inserts into the same posting list -- the exact
details of how we change the posting list depend upon the new item, and
vice-versa. A single atomic action also minimizes the volume of extra
WAL required for a posting list split, since we don't have to explicitly
WAL-log the original posting list tuple.
Despite piggy-backing on the same atomic action that inserts a new tuple,
posting list splits can be thought of as a separate, extra action to the
insert itself (or to the page split itself). Posting list splits
conceptually "rewrite" an insert that overlaps with an existing posting
list into an insert that adds its final new item just to the right of the
posting list instead. The size of the posting list won't change, and so
page space accounting code does not need to care about posting list splits
at all. This is an important upside of our design; the page split point
choice logic is very subtle even without it needing to deal with posting
list splits.
Only a few isolated extra steps are required to preserve the illusion that
the new item never overlapped with an existing posting list in the first
place: the heap TID of the incoming tuple is swapped with the rightmost/max
heap TID from the existing/originally overlapping posting list. Also, the
posting-split-with-page-split case must generate a new high key based on
an imaginary version of the original page that has both the final new item
and the after-list-split posting tuple (page splits usually just operate
against an imaginary version that contains the new item/item that won't
fit).
This approach avoids inventing an "eager" atomic posting split operation
that splits the posting list without simultaneously finishing the insert
of the incoming item. This alternative design might seem cleaner, but it
creates subtle problems for page space accounting. In general, there
might not be enough free space on the page to split a posting list such
that the incoming/new item no longer overlaps with either posting list
half --- the operation could fail before the actual retail insert of the
new item even begins. We'd end up having to handle posting list splits
that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
Notes About Data Representation Notes About Data Representation
------------------------------- -------------------------------
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -95,6 +95,10 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -95,6 +95,10 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact); BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno, static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno); BlockNumber orig_blkno);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
int *nremaining);
/* /*
...@@ -161,7 +165,7 @@ btbuildempty(Relation index) ...@@ -161,7 +165,7 @@ btbuildempty(Relation index)
/* Construct metapage. */ /* Construct metapage. */
metapage = (Page) palloc(BLCKSZ); metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0); _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
/* /*
* Write the page and log it. It might seem that an immediate sync would * Write the page and log it. It might seem that an immediate sync would
...@@ -264,8 +268,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir) ...@@ -264,8 +268,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/ */
if (so->killedItems == NULL) if (so->killedItems == NULL)
so->killedItems = (int *) so->killedItems = (int *)
palloc(MaxIndexTuplesPerPage * sizeof(int)); palloc(MaxTIDsPerBTreePage * sizeof(int));
if (so->numKilled < MaxIndexTuplesPerPage) if (so->numKilled < MaxTIDsPerBTreePage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex; so->killedItems[so->numKilled++] = so->currPos.itemIndex;
} }
...@@ -1154,11 +1158,15 @@ restart: ...@@ -1154,11 +1158,15 @@ restart:
} }
else if (P_ISLEAF(opaque)) else if (P_ISLEAF(opaque))
{ {
OffsetNumber deletable[MaxOffsetNumber]; OffsetNumber deletable[MaxIndexTuplesPerPage];
int ndeletable; int ndeletable;
BTVacuumPosting updatable[MaxIndexTuplesPerPage];
int nupdatable;
OffsetNumber offnum, OffsetNumber offnum,
minoff, minoff,
maxoff; maxoff;
int nhtidsdead,
nhtidslive;
/* /*
* Trade in the initial read lock for a super-exclusive write lock on * Trade in the initial read lock for a super-exclusive write lock on
...@@ -1190,8 +1198,11 @@ restart: ...@@ -1190,8 +1198,11 @@ restart:
* point using callback. * point using callback.
*/ */
ndeletable = 0; ndeletable = 0;
nupdatable = 0;
minoff = P_FIRSTDATAKEY(opaque); minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
nhtidsdead = 0;
nhtidslive = 0;
if (callback) if (callback)
{ {
for (offnum = minoff; for (offnum = minoff;
...@@ -1199,11 +1210,9 @@ restart: ...@@ -1199,11 +1210,9 @@ restart:
offnum = OffsetNumberNext(offnum)) offnum = OffsetNumberNext(offnum))
{ {
IndexTuple itup; IndexTuple itup;
ItemPointer htup;
itup = (IndexTuple) PageGetItem(page, itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum)); PageGetItemId(page, offnum));
htup = &(itup->t_tid);
/* /*
* Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
...@@ -1226,22 +1235,82 @@ restart: ...@@ -1226,22 +1235,82 @@ restart:
* simple, and allows us to always avoid generating our own * simple, and allows us to always avoid generating our own
* conflicts. * conflicts.
*/ */
if (callback(htup, callback_state)) Assert(!BTreeTupleIsPivot(itup));
deletable[ndeletable++] = offnum; if (!BTreeTupleIsPosting(itup))
{
/* Regular tuple, standard table TID representation */
if (callback(&itup->t_tid, callback_state))
{
deletable[ndeletable++] = offnum;
nhtidsdead++;
}
else
nhtidslive++;
}
else
{
BTVacuumPosting vacposting;
int nremaining;
/* Posting list tuple */
vacposting = btreevacuumposting(vstate, itup, offnum,
&nremaining);
if (vacposting == NULL)
{
/*
* All table TIDs from the posting tuple remain, so no
* delete or update required
*/
Assert(nremaining == BTreeTupleGetNPosting(itup));
}
else if (nremaining > 0)
{
/*
* Store metadata about posting list tuple in
* updatable array for entire page. Existing tuple
* will be updated during the later call to
* _bt_delitems_vacuum().
*/
Assert(nremaining < BTreeTupleGetNPosting(itup));
updatable[nupdatable++] = vacposting;
nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
}
else
{
/*
* All table TIDs from the posting list must be
* deleted. We'll delete the index tuple completely
* (no update required).
*/
Assert(nremaining == 0);
deletable[ndeletable++] = offnum;
nhtidsdead += BTreeTupleGetNPosting(itup);
pfree(vacposting);
}
nhtidslive += nremaining;
}
} }
} }
/* /*
* Apply any needed deletes. We issue just one _bt_delitems_vacuum() * Apply any needed deletes or updates. We issue just one
* call per page, so as to minimize WAL traffic. * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
*/ */
if (ndeletable > 0) if (ndeletable > 0 || nupdatable > 0)
{ {
_bt_delitems_vacuum(rel, buf, deletable, ndeletable); Assert(nhtidsdead >= Max(ndeletable, 1));
_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
nupdatable);
stats->tuples_removed += ndeletable; stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */ /* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
/* can't leak memory here */
for (int i = 0; i < nupdatable; i++)
pfree(updatable[i]);
} }
else else
{ {
...@@ -1254,6 +1323,7 @@ restart: ...@@ -1254,6 +1323,7 @@ restart:
* We treat this like a hint-bit update because there's no need to * We treat this like a hint-bit update because there's no need to
* WAL-log it. * WAL-log it.
*/ */
Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 && if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid) opaque->btpo_cycleid == vstate->cycleid)
{ {
...@@ -1263,15 +1333,18 @@ restart: ...@@ -1263,15 +1333,18 @@ restart:
} }
/* /*
* If it's now empty, try to delete; else count the live tuples. We * If it's now empty, try to delete; else count the live tuples (live
* don't delete when recursing, though, to avoid putting entries into * table TIDs in posting lists are counted as separate live tuples).
* freePages out-of-order (doesn't seem worth any extra code to handle * We don't delete when recursing, though, to avoid putting entries
* the case). * into freePages out-of-order (doesn't seem worth any extra code to
* handle the case).
*/ */
if (minoff > maxoff) if (minoff > maxoff)
delete_now = (blkno == orig_blkno); delete_now = (blkno == orig_blkno);
else else
stats->num_index_tuples += maxoff - minoff + 1; stats->num_index_tuples += nhtidslive;
Assert(!delete_now || nhtidslive == 0);
} }
if (delete_now) if (delete_now)
...@@ -1303,9 +1376,10 @@ restart: ...@@ -1303,9 +1376,10 @@ restart:
/* /*
* This is really tail recursion, but if the compiler is too stupid to * This is really tail recursion, but if the compiler is too stupid to
* optimize it as such, we'd eat an uncomfortably large amount of stack * optimize it as such, we'd eat an uncomfortably large amount of stack
* space per recursion level (due to the deletable[] array). A failure is * space per recursion level (due to the arrays used to track details of
* improbable since the number of levels isn't likely to be large ... but * deletable/updatable items). A failure is improbable since the number
* just in case, let's hand-optimize into a loop. * of levels isn't likely to be large ... but just in case, let's
* hand-optimize into a loop.
*/ */
if (recurse_to != P_NONE) if (recurse_to != P_NONE)
{ {
...@@ -1314,6 +1388,61 @@ restart: ...@@ -1314,6 +1388,61 @@ restart:
} }
} }
/*
* btreevacuumposting --- determine TIDs still needed in posting list
*
* Returns metadata describing how to build replacement tuple without the TIDs
* that VACUUM needs to delete. Returned value is NULL in the common case
* where no changes are needed to caller's posting list tuple (we avoid
* allocating memory here as an optimization).
*
* The number of TIDs that should remain in the posting list tuple is set for
* caller in *nremaining.
*/
static BTVacuumPosting
btreevacuumposting(BTVacState *vstate, IndexTuple posting,
OffsetNumber updatedoffset, int *nremaining)
{
int live = 0;
int nitem = BTreeTupleGetNPosting(posting);
ItemPointer items = BTreeTupleGetPosting(posting);
BTVacuumPosting vacposting = NULL;
for (int i = 0; i < nitem; i++)
{
if (!vstate->callback(items + i, vstate->callback_state))
{
/* Live table TID */
live++;
}
else if (vacposting == NULL)
{
/*
* First dead table TID encountered.
*
* It's now clear that we need to delete one or more dead table
* TIDs, so start maintaining metadata describing how to update
* existing posting list tuple.
*/
vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
nitem * sizeof(uint16));
vacposting->itup = posting;
vacposting->updatedoffset = updatedoffset;
vacposting->ndeletedtids = 0;
vacposting->deletetids[vacposting->ndeletedtids++] = i;
}
else
{
/* Second or subsequent dead table TID */
vacposting->deletetids[vacposting->ndeletedtids++] = i;
}
}
*nremaining = live;
return vacposting;
}
/* /*
* btcanreturn() -- Check whether btree indexes support index-only scans. * btcanreturn() -- Check whether btree indexes support index-only scans.
* *
......
This diff is collapsed.
This diff is collapsed.
...@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel, ...@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX; state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff; state.newitemoff = newitemoff;
/* newitem cannot be a posting list item */
Assert(!BTreeTupleIsPosting(newitem));
/* /*
* maxsplits should never exceed maxoff because there will be at most as * maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once * many candidate split points as there are points _between_ tuples, once
...@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state, ...@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree, int16 leftfree,
rightfree; rightfree;
Size firstrightitemsz; Size firstrightitemsz;
Size postingsz = 0;
bool newitemisfirstonright; bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */ /* Is the new item going to be the first item on the right page? */
...@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state, ...@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
if (newitemisfirstonright) if (newitemisfirstonright)
firstrightitemsz = state->newitemsz; firstrightitemsz = state->newitemsz;
else else
{
firstrightitemsz = firstoldonrightsz; firstrightitemsz = firstoldonrightsz;
/*
* Calculate suffix truncation space saving when firstright is a
* posting list tuple, though only when the firstright is over 64
* bytes including line pointer overhead (arbitrary). This avoids
* accessing the tuple in cases where its posting list must be very
* small (if firstright has one at all).
*/
if (state->is_leaf && firstrightitemsz > 64)
{
ItemId itemid;
IndexTuple newhighkey;
itemid = PageGetItemId(state->page, firstoldonright);
newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
if (BTreeTupleIsPosting(newhighkey))
postingsz = IndexTupleSize(newhighkey) -
BTreeTupleGetPostingOffset(newhighkey);
}
}
/* Account for all the old tuples */ /* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft; leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace - rightfree = state->rightspace -
...@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state, ...@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
* If we are on the leaf level, assume that suffix truncation cannot avoid * If we are on the leaf level, assume that suffix truncation cannot avoid
* adding a heap TID to the left half's new high key when splitting at the * adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and * leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case. * will rarely be larger, but conservatively assume the worst case. We do
* go to the trouble of subtracting away posting list overhead, though
* only when it looks like it will make an appreciable difference.
* (Posting lists are the only case where truncation will typically make
* the final high key far smaller than firstright, so being a bit more
* precise there noticeably improves the balance of free space.)
*/ */
if (state->is_leaf) if (state->is_leaf)
leftfree -= (int16) (firstrightitemsz + leftfree -= (int16) (firstrightitemsz +
MAXALIGN(sizeof(ItemPointerData))); MAXALIGN(sizeof(ItemPointerData)) -
postingsz);
else else
leftfree -= (int16) firstrightitemsz; leftfree -= (int16) firstrightitemsz;
...@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff, ...@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff)); itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid); tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */ /* Do cheaper test first */
if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid)) if (BTreeTupleIsPosting(tup) ||
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false; return false;
/* Check same conditions as rightmost item case, too */ /* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem); keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
......
This diff is collapsed.
This diff is collapsed.
...@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
case XLOG_BTREE_INSERT_LEAF: case XLOG_BTREE_INSERT_LEAF:
case XLOG_BTREE_INSERT_UPPER: case XLOG_BTREE_INSERT_UPPER:
case XLOG_BTREE_INSERT_META: case XLOG_BTREE_INSERT_META:
case XLOG_BTREE_INSERT_POST:
{ {
xl_btree_insert *xlrec = (xl_btree_insert *) rec; xl_btree_insert *xlrec = (xl_btree_insert *) rec;
...@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{ {
xl_btree_split *xlrec = (xl_btree_split *) rec; xl_btree_split *xlrec = (xl_btree_split *) rec;
appendStringInfo(buf, "level %u, firstright %d, newitemoff %d", appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
xlrec->level, xlrec->firstright, xlrec->newitemoff); xlrec->level, xlrec->firstright,
xlrec->newitemoff, xlrec->postingoff);
break;
}
case XLOG_BTREE_DEDUP:
{
xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
break; break;
} }
case XLOG_BTREE_VACUUM: case XLOG_BTREE_VACUUM:
{ {
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec; xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted); appendStringInfo(buf, "ndeleted %u; nupdated %u",
xlrec->ndeleted, xlrec->nupdated);
break; break;
} }
case XLOG_BTREE_DELETE: case XLOG_BTREE_DELETE:
...@@ -130,6 +140,12 @@ btree_identify(uint8 info) ...@@ -130,6 +140,12 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R: case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R"; id = "SPLIT_R";
break; break;
case XLOG_BTREE_INSERT_POST:
id = "INSERT_POST";
break;
case XLOG_BTREE_DEDUP:
id = "DEDUP";
break;
case XLOG_BTREE_VACUUM: case XLOG_BTREE_VACUUM:
id = "VACUUM"; id = "VACUUM";
break; break;
......
...@@ -1048,8 +1048,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum) ...@@ -1048,8 +1048,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
* This is better than deleting and reinserting the tuple, because it * This is better than deleting and reinserting the tuple, because it
* avoids any data shifting when the tuple size doesn't change; and * avoids any data shifting when the tuple size doesn't change; and
* even when it does, we avoid moving the line pointers around. * even when it does, we avoid moving the line pointers around.
* Conceivably this could also be of use to an index AM that cares about * This could be used by an index AM that doesn't want to unset the
* the physical order of tuples as well as their ItemId order. * LP_DEAD bit when it happens to be set. It could conceivably also be
* used by an index AM that cares about the physical order of tuples as
* well as their logical/ItemId order.
* *
* If there's insufficient space for the new tuple, return false. Other * If there's insufficient space for the new tuple, return false. Other
* errors represent data-corruption problems, so we just elog. * errors represent data-corruption problems, so we just elog.
...@@ -1134,8 +1136,9 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum, ...@@ -1134,8 +1136,9 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
} }
} }
/* Update the item's tuple length (other fields shouldn't change) */ /* Update the item's tuple length without changing its lp_flags field */
ItemIdSetNormal(tupid, offset + size_diff, newsize); tupid->lp_off = offset + size_diff;
tupid->lp_len = newsize;
/* Copy new tuple data onto page */ /* Copy new tuple data onto page */
memcpy(PageGetItem(page, tupid), newtup, newsize); memcpy(PageGetItem(page, tupid), newtup, newsize);
......
...@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end) ...@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX <foo> SET|RESET ( */ /* ALTER INDEX <foo> SET|RESET ( */
else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "(")) else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
COMPLETE_WITH("fillfactor", COMPLETE_WITH("fillfactor",
"vacuum_cleanup_index_scale_factor", /* BTREE */ "vacuum_cleanup_index_scale_factor", "deduplicate_items", /* BTREE */
"fastupdate", "gin_pending_list_limit", /* GIN */ "fastupdate", "gin_pending_list_limit", /* GIN */
"buffering", /* GiST */ "buffering", /* GiST */
"pages_per_range", "autosummarize" /* BRIN */ "pages_per_range", "autosummarize" /* BRIN */
); );
else if (Matches("ALTER", "INDEX", MatchAny, "SET", "(")) else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
COMPLETE_WITH("fillfactor =", COMPLETE_WITH("fillfactor =",
"vacuum_cleanup_index_scale_factor =", /* BTREE */ "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", /* BTREE */
"fastupdate =", "gin_pending_list_limit =", /* GIN */ "fastupdate =", "gin_pending_list_limit =", /* GIN */
"buffering =", /* GiST */ "buffering =", /* GiST */
"pages_per_range =", "autosummarize =" /* BRIN */ "pages_per_range =", "autosummarize =" /* BRIN */
......
This diff is collapsed.
...@@ -28,7 +28,8 @@ ...@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */ #define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */ #define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */ #define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
/* 0x50 and 0x60 are unused */ #define XLOG_BTREE_INSERT_POST 0x50 /* add index tuple with posting split */
#define XLOG_BTREE_DEDUP 0x60 /* deduplicate tuples for a page */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */ #define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */ #define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */ #define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
...@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata ...@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
uint32 fastlevel; uint32 fastlevel;
TransactionId oldest_btpo_xact; TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples; float8 last_cleanup_num_heap_tuples;
bool allequalimage;
} xl_btree_metadata; } xl_btree_metadata;
/* /*
* This is what we need to know about simple (without split) insert. * This is what we need to know about simple (without split) insert.
* *
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META. * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
* Note that INSERT_META implies it's not a leaf page. * INSERT_POST. Note that INSERT_META and INSERT_UPPER implies it's not a
* leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
* page.
* *
* Backup Blk 0: original page (data contains the inserted tuple) * Backup Blk 0: original page
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META * Backup Blk 2: xl_btree_metadata, if INSERT_META
*
* Note: The new tuple is actually the "original" new item in the posting
* list split insert case (i.e. the INSERT_POST case). A split offset for
* the posting list is logged before the original new item. Recovery needs
* both, since it must do an in-place update of the existing posting list
* that was split as an extra step. Also, recovery generates a "final"
* newitem. See _bt_swap_posting() for details on posting list splits.
*/ */
typedef struct xl_btree_insert typedef struct xl_btree_insert
{ {
OffsetNumber offnum; OffsetNumber offnum;
/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
} xl_btree_insert; } xl_btree_insert;
#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber)) #define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
...@@ -92,8 +106,37 @@ typedef struct xl_btree_insert ...@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
* Backup Blk 0: original page / new left page * Backup Blk 0: original page / new left page
* *
* The left page's data portion contains the new item, if it's the _L variant. * The left page's data portion contains the new item, if it's the _L variant.
* An IndexTuple representing the high key of the left page must follow with * _R variant split records generally do not have a newitem (_R variant leaf
* either variant. * page split records that must deal with a posting list split will include an
* explicit newitem, though it is never used on the right page -- it is
* actually an orignewitem needed to update existing posting list). The new
* high key of the left/original page appears last of all (and must always be
* present).
*
* Page split records that need the REDO routine to deal with a posting list
* split directly will have an explicit newitem, which is actually an
* orignewitem (the newitem as it was before the posting list split, not
* after). A posting list split always has a newitem that comes immediately
* after the posting list being split (which would have overlapped with
* orignewitem prior to split). Usually REDO must deal with posting list
* splits with an _L variant page split record, and usually both the new
* posting list and the final newitem go on the left page (the existing
* posting list will be inserted instead of the old, and the final newitem
* will be inserted next to that). However, _R variant split records will
* include an orignewitem when the split point for the page happens to have a
* lastleft tuple that is also the posting list being split (leaving newitem
* as the page split's firstright tuple). The existence of this corner case
* does not change the basic fact about newitem/orignewitem for the REDO
* routine: it is always state used for the left page alone. (This is why the
* record's postingoff field isn't a reliable indicator of whether or not a
* posting list split occurred during the page split; a non-zero value merely
* indicates that the REDO routine must reconstruct a new posting list tuple
* that is needed for the left page.)
*
* This posting list split handling is equivalent to the xl_btree_insert REDO
* routine's INSERT_POST handling. While the details are more complicated
* here, the concept and goals are exactly the same. See _bt_swap_posting()
* for details on posting list splits.
* *
* Backup Blk 1: new right page * Backup Blk 1: new right page
* *
...@@ -111,15 +154,33 @@ typedef struct xl_btree_split ...@@ -111,15 +154,33 @@ typedef struct xl_btree_split
{ {
uint32 level; /* tree level of page being split */ uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */ OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */ OffsetNumber newitemoff; /* new item's offset */
uint16 postingoff; /* offset inside orig posting tuple */
} xl_btree_split; } xl_btree_split;
#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber)) #define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(uint16))
/*
* When page is deduplicated, consecutive groups of tuples with equal keys are
* merged together into posting list tuples.
*
* The WAL record represents a deduplication pass for a leaf page. An array
* of BTDedupInterval structs follows.
*/
typedef struct xl_btree_dedup
{
uint16 nintervals;
/* DEDUPLICATION INTERVALS FOLLOW */
} xl_btree_dedup;
#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
/* /*
* This is what we need to know about delete of individual leaf index tuples. * This is what we need to know about delete of individual leaf index tuples.
* The WAL record can represent deletion of any number of index tuples on a * The WAL record can represent deletion of any number of index tuples on a
* single index page when *not* executed by VACUUM. * single index page when *not* executed by VACUUM. Deletion of a subset of
* the TIDs within a posting list tuple is not supported.
* *
* Backup Blk 0: index page * Backup Blk 0: index page
*/ */
...@@ -150,21 +211,43 @@ typedef struct xl_btree_reuse_page ...@@ -150,21 +211,43 @@ typedef struct xl_btree_reuse_page
#define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page)) #define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page))
/* /*
* This is what we need to know about vacuum of individual leaf index tuples. * This is what we need to know about which TIDs to remove from an individual
* The WAL record can represent deletion of any number of index tuples on a * posting list tuple during vacuuming. An array of these may appear at the
* single index page when executed by VACUUM. * end of xl_btree_vacuum records.
*/
typedef struct xl_btree_update
{
uint16 ndeletedtids;
/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
} xl_btree_update;
#define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
/*
* This is what we need to know about a VACUUM of a leaf page. The WAL record
* can represent deletion of any number of index tuples on a single index page
* when executed by VACUUM. It can also support "updates" of index tuples,
* which is how deletes of a subset of TIDs contained in an existing posting
* list tuple are implemented. (Updates are only used when there will be some
* remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
* just be deleted).
* *
* Note that the WAL record in any vacuum of an index must have at least one * Updated posting list tuples are represented using xl_btree_update metadata.
* item to delete. * The REDO routine uses each xl_btree_update (plus its corresponding original
* index tuple from the target leaf page) to generate the final updated tuple.
*/ */
typedef struct xl_btree_vacuum typedef struct xl_btree_vacuum
{ {
uint32 ndeleted; uint16 ndeleted;
uint16 nupdated;
/* DELETED TARGET OFFSET NUMBERS FOLLOW */ /* DELETED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
} xl_btree_vacuum; } xl_btree_vacuum;
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32)) #define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
/* /*
* This is what we need to know about marking an empty branch for deletion. * This is what we need to know about marking an empty branch for deletion.
...@@ -245,6 +328,8 @@ typedef struct xl_btree_newroot ...@@ -245,6 +328,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record); extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record); extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info); extern const char *btree_identify(uint8 info);
extern void btree_xlog_startup(void);
extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno); extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */ #endif /* NBTXLOG_H */
...@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, ...@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL) PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask) PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask) PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask) PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask) PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask) PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask) PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
......
...@@ -31,7 +31,7 @@ ...@@ -31,7 +31,7 @@
/* /*
* Each page of XLOG file has a header like this: * Each page of XLOG file has a header like this:
*/ */
#define XLOG_PAGE_MAGIC 0xD104 /* can be used as WAL version indicator */ #define XLOG_PAGE_MAGIC 0xD105 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData typedef struct XLogPageHeaderData
{ {
......
...@@ -200,7 +200,7 @@ reset enable_indexscan; ...@@ -200,7 +200,7 @@ reset enable_indexscan;
reset enable_bitmapscan; reset enable_bitmapscan;
-- Also check LIKE optimization with binary-compatible cases -- Also check LIKE optimization with binary-compatible cases
create temp table btree_bpchar (f1 text collate "C"); create temp table btree_bpchar (f1 text collate "C");
create index on btree_bpchar(f1 bpchar_ops); create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux'); insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
-- doesn't match index: -- doesn't match index:
explain (costs off) explain (costs off)
...@@ -266,6 +266,24 @@ select * from btree_bpchar where f1::bpchar like 'foo%'; ...@@ -266,6 +266,24 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
fool fool
(2 rows) (2 rows)
-- get test coverage for "single value" deduplication strategy:
insert into btree_bpchar select 'foo' from generate_series(1,1500);
--
-- Perform unique checking, with and without the use of deduplication
--
CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
-- Generate enough garbage tuples in index to ensure that even the unique index
-- with deduplication enabled has to check multiple leaf pages during unique
-- checking (at least with a BLCKSZ of 8192 or less)
DO $$
BEGIN
FOR r IN 1..1350 LOOP
DELETE FROM dedup_unique_test_table;
INSERT INTO dedup_unique_test_table SELECT 1;
END LOOP;
END$$;
-- --
-- Test B-tree fast path (cache rightmost leaf page) optimization. -- Test B-tree fast path (cache rightmost leaf page) optimization.
-- --
......
...@@ -86,7 +86,7 @@ reset enable_bitmapscan; ...@@ -86,7 +86,7 @@ reset enable_bitmapscan;
-- Also check LIKE optimization with binary-compatible cases -- Also check LIKE optimization with binary-compatible cases
create temp table btree_bpchar (f1 text collate "C"); create temp table btree_bpchar (f1 text collate "C");
create index on btree_bpchar(f1 bpchar_ops); create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux'); insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
-- doesn't match index: -- doesn't match index:
explain (costs off) explain (costs off)
...@@ -103,6 +103,26 @@ explain (costs off) ...@@ -103,6 +103,26 @@ explain (costs off)
select * from btree_bpchar where f1::bpchar like 'foo%'; select * from btree_bpchar where f1::bpchar like 'foo%';
select * from btree_bpchar where f1::bpchar like 'foo%'; select * from btree_bpchar where f1::bpchar like 'foo%';
-- get test coverage for "single value" deduplication strategy:
insert into btree_bpchar select 'foo' from generate_series(1,1500);
--
-- Perform unique checking, with and without the use of deduplication
--
CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
-- Generate enough garbage tuples in index to ensure that even the unique index
-- with deduplication enabled has to check multiple leaf pages during unique
-- checking (at least with a BLCKSZ of 8192 or less)
DO $$
BEGIN
FOR r IN 1..1350 LOOP
DELETE FROM dedup_unique_test_table;
INSERT INTO dedup_unique_test_table SELECT 1;
END LOOP;
END$$;
-- --
-- Test B-tree fast path (cache rightmost leaf page) optimization. -- Test B-tree fast path (cache rightmost leaf page) optimization.
-- --
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment