Commit d168b666 authored by Peter Geoghegan's avatar Peter Geoghegan

Enhance nbtree index tuple deletion.

Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions.  This is "bottom-up
deletion".  Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page.  This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).

The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates.  It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_.  Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index.  The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.

The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs.  That is left as work for a
later commit.

Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing.  This increases the number of index tuples deleted
significantly in many cases.  The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples.  We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).

Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages.  That is, it almost never fails
to delete at least a few extra index tuples.  It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples.  It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).

Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples.  It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly).  Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes.  Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade.  The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: default avatarHeikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: default avatarVictor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
parent 9dc718bd
......@@ -629,6 +629,109 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</sect2>
<sect2 id="btree-deletion">
<title>Bottom-up index deletion</title>
<para>
B-Tree indexes are not directly aware that under MVCC, there might
be multiple extant versions of the same logical table row; to an
index, each tuple is an independent object that needs its own index
entry. <quote>Version churn</quote> tuples may sometimes
accumulate and adversely affect query latency and throughput. This
typically occurs with <command>UPDATE</command>-heavy workloads
where most individual updates cannot apply the
<acronym>HOT</acronym> optimization. Changing the value of only
one column covered by one index during an <command>UPDATE</command>
<emphasis>always</emphasis> necessitates a new set of index tuples
&mdash; one for <emphasis>each and every</emphasis> index on the
table. Note in particular that this includes indexes that were not
<quote>logically modified</quote> by the <command>UPDATE</command>.
All indexes will need a successor physical index tuple that points
to the latest version in the table. Each new tuple within each
index will generally need to coexist with the original
<quote>updated</quote> tuple for a short period of time (typically
until shortly after the <command>UPDATE</command> transaction
commits).
</para>
<para>
B-Tree indexes incrementally delete version churn index tuples by
performing <firstterm>bottom-up index deletion</firstterm> passes.
Each deletion pass is triggered in reaction to an anticipated
<quote>version churn page split</quote>. This only happens with
indexes that are not logically modified by
<command>UPDATE</command> statements, where concentrated build up
of obsolete versions in particular pages would occur otherwise. A
page split will usually be avoided, though it's possible that
certain implementation-level heuristics will fail to identify and
delete even one garbage index tuple (in which case a page split or
deduplication pass resolves the issue of an incoming new tuple not
fitting on a leaf page). The worst case number of versions that
any index scan must traverse (for any single logical row) is an
important contributor to overall system responsiveness and
throughput. A bottom-up index deletion pass targets suspected
garbage tuples in a single leaf page based on
<emphasis>qualitative</emphasis> distinctions involving logical
rows and versions. This contrasts with the <quote>top-down</quote>
index cleanup performed by autovacuum workers, which is triggered
when certain <emphasis>quantitative</emphasis> table-level
thresholds are exceeded (see <xref linkend="autovacuum"/>).
</para>
<note>
<para>
Not all deletion operations that are performed within B-Tree
indexes are bottom-up deletion operations. There is a distinct
category of index tuple deletion: <firstterm>simple index tuple
deletion</firstterm>. This is a deferred maintenance operation
that deletes index tuples that are known to be safe to delete
(those whose item identifier's <literal>LP_DEAD</literal> bit is
already set). Like bottom-up index deletion, simple index
deletion takes place at the point that a page split is anticipated
as a way of avoiding the split.
</para>
<para>
Simple deletion is opportunistic in the sense that it can only
take place when recent index scans set the
<literal>LP_DEAD</literal> bits of affected items in passing.
Prior to <productname>PostgreSQL</productname> 14, the only
category of B-Tree deletion was simple deletion. The main
differences between it and bottom-up deletion are that only the
former is opportunistically driven by the activity of passing
index scans, while only the latter specifically targets version
churn from <command>UPDATE</command>s that do not logically modify
indexed columns.
</para>
</note>
<para>
Bottom-up index deletion performs the vast majority of all garbage
index tuple cleanup for particular indexes with certain workloads.
This is expected with any B-Tree index that is subject to
significant version churn from <command>UPDATE</command>s that
rarely or never logically modify the columns that the index covers.
The average and worst case number of versions per logical row can
be kept low purely through targeted incremental deletion passes.
It's quite possible that the on-disk size of certain indexes will
never increase by even one single page/block despite
<emphasis>constant</emphasis> version churn from
<command>UPDATE</command>s. Even then, an exhaustive <quote>clean
sweep</quote> by a <command>VACUUM</command> operation (typically
run in an autovacuum worker process) will eventually be required as
a part of <emphasis>collective</emphasis> cleanup of the table and
each of its indexes.
</para>
<para>
Unlike <command>VACUUM</command>, bottom-up index deletion does not
provide any strong guarantees about how old the oldest garbage
index tuple may be. No index can be permitted to retain
<quote>floating garbage</quote> index tuples that became dead prior
to a conservative cutoff point shared by the table and all of its
indexes collectively. This fundamental table-level invariant makes
it safe to recycle table <acronym>TID</acronym>s. This is how it
is possible for distinct logical rows to reuse the same table
<acronym>TID</acronym> over time (though this can never happen with
two logical rows whose lifetimes span the same
<command>VACUUM</command> cycle).
</para>
</sect2>
<sect2 id="btree-deduplication">
<title>Deduplication</title>
<para>
......@@ -666,15 +769,17 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</note>
<para>
The deduplication process occurs lazily, when a new item is
inserted that cannot fit on an existing leaf page. This prevents
(or at least delays) leaf page splits. Unlike GIN posting list
tuples, B-Tree posting list tuples do not need to expand every time
a new duplicate is inserted; they are merely an alternative
physical representation of the original logical contents of the
leaf page. This design prioritizes consistent performance with
mixed read-write workloads. Most client applications will at least
see a moderate performance benefit from using deduplication.
Deduplication is enabled by default.
inserted that cannot fit on an existing leaf page, though only when
index tuple deletion could not free sufficient space for the new
item (typically deletion is briefly considered and then skipped
over). Unlike GIN posting list tuples, B-Tree posting list tuples
do not need to expand every time a new duplicate is inserted; they
are merely an alternative physical representation of the original
logical contents of the leaf page. This design prioritizes
consistent performance with mixed read-write workloads. Most
client applications will at least see a moderate performance
benefit from using deduplication. Deduplication is enabled by
default.
</para>
<para>
<command>CREATE INDEX</command> and <command>REINDEX</command>
......@@ -702,25 +807,16 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
deduplication isn't usually helpful.
</para>
<para>
B-Tree indexes are not directly aware that under MVCC, there might
be multiple extant versions of the same logical table row; to an
index, each tuple is an independent object that needs its own index
entry. <quote>Version duplicates</quote> may sometimes accumulate
and adversely affect query latency and throughput. This typically
occurs with <command>UPDATE</command>-heavy workloads where most
individual updates cannot apply the <acronym>HOT</acronym>
optimization (often because at least one indexed column gets
modified, necessitating a new set of index tuple versions &mdash;
one new tuple for <emphasis>each and every</emphasis> index). In
effect, B-Tree deduplication ameliorates index bloat caused by
version churn. Note that even the tuples from a unique index are
not necessarily <emphasis>physically</emphasis> unique when stored
on disk due to version churn. The deduplication optimization is
selectively applied within unique indexes. It targets those pages
that appear to have version duplicates. The high level goal is to
give <command>VACUUM</command> more time to run before an
<quote>unnecessary</quote> page split caused by version churn can
take place.
It is sometimes possible for unique indexes (as well as unique
constraints) to use deduplication. This allows leaf pages to
temporarily <quote>absorb</quote> extra version churn duplicates.
Deduplication in unique indexes augments bottom-up index deletion,
especially in cases where a long-running transactions holds a
snapshot that blocks garbage collection. The goal is to buy time
for the bottom-up index deletion strategy to become effective
again. Delaying page splits until a single long-running
transaction naturally goes away can allow a bottom-up deletion pass
to succeed where an earlier deletion pass failed.
</para>
<tip>
<para>
......
......@@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
<para>
The fillfactor for an index is a percentage that determines how full
the index method will try to pack index pages. For B-trees, leaf pages
are filled to this percentage during initial index build, and also
are filled to this percentage during initial index builds, and also
when extending the index at the right (adding new largest key values).
If pages
subsequently become completely full, they will be split, leading to
gradual degradation in the index's efficiency. B-trees use a default
fragmentation of the on-disk index structure. B-trees use a default
fillfactor of 90, but any integer value from 10 to 100 can be selected.
If the table is static then fillfactor 100 is best to minimize the
index's physical size, but for heavily updated tables a smaller
fillfactor is better to minimize the need for page splits. The
other index methods use fillfactor in different but roughly analogous
ways; the default fillfactor varies between methods.
</para>
<para>
B-tree indexes on tables where many inserts and/or updates are
anticipated can benefit from lower fillfactor settings at
<command>CREATE INDEX</command> time (following bulk loading into the
table). Values in the range of 50 - 90 can usefully <quote>smooth
out</quote> the <emphasis>rate</emphasis> of page splits during the
early life of the B-tree index (lowering fillfactor like this may even
lower the absolute number of page splits, though this effect is highly
workload dependent). The B-tree bottom-up index deletion technique
described in <xref linkend="btree-deletion"/> is dependent on having
some <quote>extra</quote> space on pages to store <quote>extra</quote>
tuple versions, and so can be affected by fillfactor (though the effect
is usually not significant).
</para>
<para>
In other specific cases it might be useful to increase fillfactor to
100 at <command>CREATE INDEX</command> time as a way of maximizing
space utilization. You should only consider this when you are
completely sure that the table is static (i.e. that it will never be
affected by either inserts or updates). A fillfactor setting of 100
otherwise risks <emphasis>harming</emphasis> performance: even a few
updates or inserts will cause a sudden flood of page splits.
</para>
<para>
The other index methods use fillfactor in different but roughly
analogous ways; the default fillfactor varies between methods.
</para>
</listitem>
</varlistentry>
......
......@@ -55,6 +55,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
#include "storage/lmgr.h"
......@@ -102,6 +103,8 @@ static void MultiXactIdWait(MultiXactId multi, MultiXactStatus status, uint16 in
int *remaining);
static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status,
uint16 infomask, Relation rel, int *remaining);
static void index_delete_sort(TM_IndexDeleteOp *delstate);
static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_changed,
bool *copy);
......@@ -166,18 +169,33 @@ static const struct
#ifdef USE_PREFETCH
/*
* heap_compute_xid_horizon_for_tuples and xid_horizon_prefetch_buffer use
* this structure to coordinate prefetching activity.
* heap_index_delete_tuples and index_delete_prefetch_buffer use this
* structure to coordinate prefetching activity
*/
typedef struct
{
BlockNumber cur_hblkno;
int next_item;
int nitems;
ItemPointerData *tids;
} XidHorizonPrefetchState;
int ndeltids;
TM_IndexDelete *deltids;
} IndexDeletePrefetchState;
#endif
/* heap_index_delete_tuples bottom-up index deletion costing constants */
#define BOTTOMUP_MAX_NBLOCKS 6
#define BOTTOMUP_TOLERANCE_NBLOCKS 3
/*
* heap_index_delete_tuples uses this when determining which heap blocks it
* must visit to help its bottom-up index deletion caller
*/
typedef struct IndexDeleteCounts
{
int16 npromisingtids; /* Number of "promising" TIDs in group */
int16 ntids; /* Number of TIDs in group */
int16 ifirsttid; /* Offset to group's first deltid */
} IndexDeleteCounts;
/*
* This table maps tuple lock strength values for each particular
* MultiXactStatus value.
......@@ -6936,28 +6954,31 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
#ifdef USE_PREFETCH
/*
* Helper function for heap_compute_xid_horizon_for_tuples. Issue prefetch
* requests for the number of buffers indicated by prefetch_count. The
* prefetch_state keeps track of all the buffers that we can prefetch and
* which ones have already been prefetched; each call to this function picks
* up where the previous call left off.
* Helper function for heap_index_delete_tuples. Issues prefetch requests for
* prefetch_count buffers. The prefetch_state keeps track of all the buffers
* we can prefetch, and which have already been prefetched; each call to this
* function picks up where the previous call left off.
*
* Note: we expect the deltids array to be sorted in an order that groups TIDs
* by heap block, with all TIDs for each block appearing together in exactly
* one group.
*/
static void
xid_horizon_prefetch_buffer(Relation rel,
XidHorizonPrefetchState *prefetch_state,
int prefetch_count)
index_delete_prefetch_buffer(Relation rel,
IndexDeletePrefetchState *prefetch_state,
int prefetch_count)
{
BlockNumber cur_hblkno = prefetch_state->cur_hblkno;
int count = 0;
int i;
int nitems = prefetch_state->nitems;
ItemPointerData *tids = prefetch_state->tids;
int ndeltids = prefetch_state->ndeltids;
TM_IndexDelete *deltids = prefetch_state->deltids;
for (i = prefetch_state->next_item;
i < nitems && count < prefetch_count;
i < ndeltids && count < prefetch_count;
i++)
{
ItemPointer htid = &tids[i];
ItemPointer htid = &deltids[i].tid;
if (cur_hblkno == InvalidBlockNumber ||
ItemPointerGetBlockNumber(htid) != cur_hblkno)
......@@ -6978,24 +6999,20 @@ xid_horizon_prefetch_buffer(Relation rel,
#endif
/*
* Get the latestRemovedXid from the heap pages pointed at by the index
* tuples being deleted.
* heapam implementation of tableam's index_delete_tuples interface.
*
* We used to do this during recovery rather than on the primary, but that
* approach now appears inferior. It meant that the primary could generate
* a lot of work for the standby without any back-pressure to slow down the
* primary, and it required the standby to have reached consistency, whereas
* we want to have correct information available even before that point.
* This helper function is called by index AMs during index tuple deletion.
* See tableam header comments for an explanation of the interface implemented
* here and a general theory of operation. Note that each call here is either
* a simple index deletion call, or a bottom-up index deletion call.
*
* It's possible for this to generate a fair amount of I/O, since we may be
* deleting hundreds of tuples from a single index block. To amortize that
* cost to some degree, this uses prefetching and combines repeat accesses to
* the same block.
* the same heap block.
*/
TransactionId
heap_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *tids,
int nitems)
heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
{
/* Initial assumption is that earlier pruning took care of conflict */
TransactionId latestRemovedXid = InvalidTransactionId;
......@@ -7005,28 +7022,44 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
OffsetNumber maxoff = InvalidOffsetNumber;
TransactionId priorXmax;
#ifdef USE_PREFETCH
XidHorizonPrefetchState prefetch_state;
IndexDeletePrefetchState prefetch_state;
int prefetch_distance;
#endif
SnapshotData SnapshotNonVacuumable;
int finalndeltids = 0,
nblocksaccessed = 0;
/* State that's only used in bottom-up index deletion case */
int nblocksfavorable = 0;
int curtargetfreespace = delstate->bottomupfreespace,
lastfreespace = 0,
actualfreespace = 0;
bool bottomup_final_block = false;
InitNonVacuumableSnapshot(SnapshotNonVacuumable, GlobalVisTestFor(rel));
/* Sort caller's deltids array by TID for further processing */
index_delete_sort(delstate);
/*
* Sort to avoid repeated lookups for the same page, and to make it more
* likely to access items in an efficient order. In particular, this
* ensures that if there are multiple pointers to the same page, they all
* get processed looking up and locking the page just once.
* Bottom-up case: resort deltids array in an order attuned to where the
* greatest number of promising TIDs are to be found, and determine how
* many blocks from the start of sorted array should be considered
* favorable. This will also shrink the deltids array in order to
* eliminate completely unfavorable blocks up front.
*/
qsort((void *) tids, nitems, sizeof(ItemPointerData),
(int (*) (const void *, const void *)) ItemPointerCompare);
if (delstate->bottomup)
nblocksfavorable = bottomup_sort_and_shrink(delstate);
#ifdef USE_PREFETCH
/* Initialize prefetch state. */
prefetch_state.cur_hblkno = InvalidBlockNumber;
prefetch_state.next_item = 0;
prefetch_state.nitems = nitems;
prefetch_state.tids = tids;
prefetch_state.ndeltids = delstate->ndeltids;
prefetch_state.deltids = delstate->deltids;
/*
* Compute the prefetch distance that we will attempt to maintain.
* Determine the prefetch distance that we will attempt to maintain.
*
* Since the caller holds a buffer lock somewhere in rel, we'd better make
* sure that isn't a catalog relation before we call code that does
......@@ -7038,33 +7071,111 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
prefetch_distance =
get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
/* Cap initial prefetch distance for bottom-up deletion caller */
if (delstate->bottomup)
{
Assert(nblocksfavorable >= 1);
Assert(nblocksfavorable <= BOTTOMUP_MAX_NBLOCKS);
prefetch_distance = Min(prefetch_distance, nblocksfavorable);
}
/* Start prefetching. */
xid_horizon_prefetch_buffer(rel, &prefetch_state, prefetch_distance);
index_delete_prefetch_buffer(rel, &prefetch_state, prefetch_distance);
#endif
/* Iterate over all tids, and check their horizon */
for (int i = 0; i < nitems; i++)
/* Iterate over deltids, determine which to delete, check their horizon */
Assert(delstate->ndeltids > 0);
for (int i = 0; i < delstate->ndeltids; i++)
{
ItemPointer htid = &tids[i];
TM_IndexDelete *ideltid = &delstate->deltids[i];
TM_IndexStatus *istatus = delstate->status + ideltid->id;
ItemPointer htid = &ideltid->tid;
OffsetNumber offnum;
/*
* Read heap buffer, but avoid refetching if it's the same block as
* required for the last tid.
* Read buffer, and perform required extra steps each time a new block
* is encountered. Avoid refetching if it's the same block as the one
* from the last htid.
*/
if (blkno == InvalidBlockNumber ||
ItemPointerGetBlockNumber(htid) != blkno)
{
/* release old buffer */
if (BufferIsValid(buf))
/*
* Consider giving up early for bottom-up index deletion caller
* first. (Only prefetch next-next block afterwards, when it
* becomes clear that we're at least going to access the next
* block in line.)
*
* Sometimes the first block frees so much space for bottom-up
* caller that the deletion process can end without accessing any
* more blocks. It is usually necessary to access 2 or 3 blocks
* per bottom-up deletion operation, though.
*/
if (delstate->bottomup)
{
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buf);
/*
* We often allow caller to delete a few additional items
* whose entries we reached after the point that space target
* from caller was satisfied. The cost of accessing the page
* was already paid at that point, so it made sense to finish
* it off. When that happened, we finalize everything here
* (by finishing off the whole bottom-up deletion operation
* without needlessly paying the cost of accessing any more
* blocks).
*/
if (bottomup_final_block)
break;
/*
* Give up when we didn't enable our caller to free any
* additional space as a result of processing the page that we
* just finished up with. This rule is the main way in which
* we keep the cost of bottom-up deletion under control.
*/
if (nblocksaccessed >= 1 && actualfreespace == lastfreespace)
break;
lastfreespace = actualfreespace; /* for next time */
/*
* Deletion operation (which is bottom-up) will definitely
* access the next block in line. Prepare for that now.
*
* Decay target free space so that we don't hang on for too
* long with a marginal case. (Space target is only truly
* helpful when it allows us to recognize that we don't need
* to access more than 1 or 2 blocks to satisfy caller due to
* agreeable workload characteristics.)
*
* We are a bit more patient when we encounter contiguous
* blocks, though: these are treated as favorable blocks. The
* decay process is only applied when the next block in line
* is not a favorable/contiguous block. This is not an
* exception to the general rule; we still insist on finding
* at least one deletable item per block accessed. See
* bottomup_nblocksfavorable() for full details of the theory
* behind favorable blocks and heap block locality in general.
*
* Note: The first block in line is always treated as a
* favorable block, so the earliest possible point that the
* decay can be applied is just before we access the second
* block in line. The Assert() verifies this for us.
*/
Assert(nblocksaccessed > 0 || nblocksfavorable > 0);
if (nblocksfavorable > 0)
nblocksfavorable--;
else
curtargetfreespace /= 2;
}
blkno = ItemPointerGetBlockNumber(htid);
/* release old buffer */
if (BufferIsValid(buf))
UnlockReleaseBuffer(buf);
blkno = ItemPointerGetBlockNumber(htid);
buf = ReadBuffer(rel, blkno);
nblocksaccessed++;
Assert(!delstate->bottomup ||
nblocksaccessed <= BOTTOMUP_MAX_NBLOCKS);
#ifdef USE_PREFETCH
......@@ -7072,7 +7183,7 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
* To maintain the prefetch distance, prefetch one more page for
* each page we read.
*/
xid_horizon_prefetch_buffer(rel, &prefetch_state, 1);
index_delete_prefetch_buffer(rel, &prefetch_state, 1);
#endif
LockBuffer(buf, BUFFER_LOCK_SHARE);
......@@ -7081,6 +7192,31 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
maxoff = PageGetMaxOffsetNumber(page);
}
if (istatus->knowndeletable)
Assert(!delstate->bottomup && !istatus->promising);
else
{
ItemPointerData tmp = *htid;
HeapTupleData heapTuple;
/* Are any tuples from this HOT chain non-vacuumable? */
if (heap_hot_search_buffer(&tmp, rel, buf, &SnapshotNonVacuumable,
&heapTuple, NULL, true))
continue; /* can't delete entry */
/* Caller will delete, since whole HOT chain is vacuumable */
istatus->knowndeletable = true;
/* Maintain index free space info for bottom-up deletion case */
if (delstate->bottomup)
{
Assert(istatus->freespace > 0);
actualfreespace += istatus->freespace;
if (actualfreespace >= curtargetfreespace)
bottomup_final_block = true;
}
}
/*
* Maintain latestRemovedXid value for deletion operation as a whole
* by advancing current value using heap tuple headers. This is
......@@ -7108,17 +7244,18 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
}
/*
* We'll often encounter LP_DEAD line pointers. No need to do
* anything more with htid when that happens. This is okay
* because the earlier pruning operation that made the line
* pointer LP_DEAD in the first place must have considered the
* tuple header as part of generating its own latestRemovedXid
* value.
* We'll often encounter LP_DEAD line pointers (especially with an
* entry marked knowndeletable by our caller up front). No heap
* tuple headers get examined for an htid that leads us to an
* LP_DEAD item. This is okay because the earlier pruning
* operation that made the line pointer LP_DEAD in the first place
* must have considered the original tuple header as part of
* generating its own latestRemovedXid value.
*
* Relying on XLOG_HEAP2_CLEANUP_INFO records like this is the
* same strategy that index vacuuming uses in all cases. Index
* VACUUM WAL records don't even have a latestRemovedXid field of
* their own for this reason.
* Relying on XLOG_HEAP2_CLEAN records like this is the same
* strategy that index vacuuming uses in all cases. Index VACUUM
* WAL records don't even have a latestRemovedXid field of their
* own for this reason.
*/
if (!ItemIdIsNormal(lp))
break;
......@@ -7148,15 +7285,388 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
offnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}
/* Enable further/final shrinking of deltids for caller */
finalndeltids = i + 1;
}
if (BufferIsValid(buf))
UnlockReleaseBuffer(buf);
/*
* Shrink deltids array to exclude non-deletable entries at the end. This
* is not just a minor optimization. Final deltids array size might be
* zero for a bottom-up caller. Index AM is explicitly allowed to rely on
* ndeltids being zero in all cases with zero total deletable entries.
*/
Assert(finalndeltids > 0 || delstate->bottomup);
delstate->ndeltids = finalndeltids;
return latestRemovedXid;
}
/*
* Specialized inlineable comparison function for index_delete_sort()
*/
static inline int
index_delete_sort_cmp(TM_IndexDelete *deltid1, TM_IndexDelete *deltid2)
{
ItemPointer tid1 = &deltid1->tid;
ItemPointer tid2 = &deltid2->tid;
{
BlockNumber blk1 = ItemPointerGetBlockNumber(tid1);
BlockNumber blk2 = ItemPointerGetBlockNumber(tid2);
if (blk1 != blk2)
return (blk1 < blk2) ? -1 : 1;
}
{
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buf);
OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1);
OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2);
if (pos1 != pos2)
return (pos1 < pos2) ? -1 : 1;
}
return latestRemovedXid;
pg_unreachable();
return 0;
}
/*
* Sort deltids array from delstate by TID. This prepares it for further
* processing by heap_index_delete_tuples().
*
* This operation becomes a noticeable consumer of CPU cycles with some
* workloads, so we go to the trouble of specialization/micro optimization.
* We use shellsort for this because it's easy to specialize, compiles to
* relatively few instructions, and is adaptive to presorted inputs/subsets
* (which are typical here).
*/
static void
index_delete_sort(TM_IndexDeleteOp *delstate)
{
TM_IndexDelete *deltids = delstate->deltids;
int ndeltids = delstate->ndeltids;
int low = 0;
/*
* Shellsort gap sequence (taken from Sedgewick-Incerpi paper).
*
* This implementation is fast with array sizes up to ~4500. This covers
* all supported BLCKSZ values.
*/
const int gaps[9] = {1968, 861, 336, 112, 48, 21, 7, 3, 1};
/* Think carefully before changing anything here -- keep swaps cheap */
StaticAssertStmt(sizeof(TM_IndexDelete) <= 8,
"element size exceeds 8 bytes");
for (int g = 0; g < lengthof(gaps); g++)
{
for (int hi = gaps[g], i = low + hi; i < ndeltids; i++)
{
TM_IndexDelete d = deltids[i];
int j = i;
while (j >= hi && index_delete_sort_cmp(&deltids[j - hi], &d) >= 0)
{
deltids[j] = deltids[j - hi];
j -= hi;
}
deltids[j] = d;
}
}
}
/*
* Returns how many blocks should be considered favorable/contiguous for a
* bottom-up index deletion pass. This is a number of heap blocks that starts
* from and includes the first block in line.
*
* There is always at least one favorable block during bottom-up index
* deletion. In the worst case (i.e. with totally random heap blocks) the
* first block in line (the only favorable block) can be thought of as a
* degenerate array of contiguous blocks that consists of a single block.
* heap_index_delete_tuples() will expect this.
*
* Caller passes blockgroups, a description of the final order that deltids
* will be sorted in for heap_index_delete_tuples() bottom-up index deletion
* processing. Note that deltids need not actually be sorted just yet (caller
* only passes deltids to us so that we can interpret blockgroups).
*
* You might guess that the existence of contiguous blocks cannot matter much,
* since in general the main factor that determines which blocks we visit is
* the number of promising TIDs, which is a fixed hint from the index AM.
* We're not really targeting the general case, though -- the actual goal is
* to adapt our behavior to a wide variety of naturally occurring conditions.
* The effects of most of the heuristics we apply are only noticeable in the
* aggregate, over time and across many _related_ bottom-up index deletion
* passes.
*
* Deeming certain blocks favorable allows heapam to recognize and adapt to
* workloads where heap blocks visited during bottom-up index deletion can be
* accessed contiguously, in the sense that each newly visited block is the
* neighbor of the block that bottom-up deletion just finished processing (or
* close enough to it). It will likely be cheaper to access more favorable
* blocks sooner rather than later (e.g. in this pass, not across a series of
* related bottom-up passes). Either way it is probably only a matter of time
* (or a matter of further correlated version churn) before all blocks that
* appear together as a single large batch of favorable blocks get accessed by
* _some_ bottom-up pass. Large batches of favorable blocks tend to either
* appear almost constantly or not even once (it all depends on per-index
* workload characteristics).
*
* Note that the blockgroups sort order applies a power-of-two bucketing
* scheme that creates opportunities for contiguous groups of blocks to get
* batched together, at least with workloads that are naturally amenable to
* being driven by heap block locality. This doesn't just enhance the spatial
* locality of bottom-up heap block processing in the obvious way. It also
* enables temporal locality of access, since sorting by heap block number
* naturally tends to make the bottom-up processing order deterministic.
*
* Consider the following example to get a sense of how temporal locality
* might matter: There is a heap relation with several indexes, each of which
* is low to medium cardinality. It is subject to constant non-HOT updates.
* The updates are skewed (in one part of the primary key, perhaps). None of
* the indexes are logically modified by the UPDATE statements (if they were
* then bottom-up index deletion would not be triggered in the first place).
* Naturally, each new round of index tuples (for each heap tuple that gets a
* heap_update() call) will have the same heap TID in each and every index.
* Since these indexes are low cardinality and never get logically modified,
* heapam processing during bottom-up deletion passes will access heap blocks
* in approximately sequential order. Temporal locality of access occurs due
* to bottom-up deletion passes behaving very similarly across each of the
* indexes at any given moment. This keeps the number of buffer misses needed
* to visit heap blocks to a minimum.
*/
static int
bottomup_nblocksfavorable(IndexDeleteCounts *blockgroups, int nblockgroups,
TM_IndexDelete *deltids)
{
int64 lastblock = -1;
int nblocksfavorable = 0;
Assert(nblockgroups >= 1);
Assert(nblockgroups <= BOTTOMUP_MAX_NBLOCKS);
/*
* We tolerate heap blocks that will be accessed only slightly out of
* physical order. Small blips occur when a pair of almost-contiguous
* blocks happen to fall into different buckets (perhaps due only to a
* small difference in npromisingtids that the bucketing scheme didn't
* quite manage to ignore). We effectively ignore these blips by applying
* a small tolerance. The precise tolerance we use is a little arbitrary,
* but it works well enough in practice.
*/
for (int b = 0; b < nblockgroups; b++)
{
IndexDeleteCounts *group = blockgroups + b;
TM_IndexDelete *firstdtid = deltids + group->ifirsttid;
BlockNumber block = ItemPointerGetBlockNumber(&firstdtid->tid);
if (lastblock != -1 &&
((int64) block < lastblock - BOTTOMUP_TOLERANCE_NBLOCKS ||
(int64) block > lastblock + BOTTOMUP_TOLERANCE_NBLOCKS))
break;
nblocksfavorable++;
lastblock = block;
}
/* Always indicate that there is at least 1 favorable block */
Assert(nblocksfavorable >= 1);
return nblocksfavorable;
}
/*
* qsort comparison function for bottomup_sort_and_shrink()
*/
static int
bottomup_sort_and_shrink_cmp(const void *arg1, const void *arg2)
{
const IndexDeleteCounts *group1 = (const IndexDeleteCounts *) arg1;
const IndexDeleteCounts *group2 = (const IndexDeleteCounts *) arg2;
/*
* Most significant field is npromisingtids (which we invert the order of
* so as to sort in desc order).
*
* Caller should have already normalized npromisingtids fields into
* power-of-two values (buckets).
*/
if (group1->npromisingtids > group2->npromisingtids)
return -1;
if (group1->npromisingtids < group2->npromisingtids)
return 1;
/*
* Tiebreak: desc ntids sort order.
*
* We cannot expect power-of-two values for ntids fields. We should
* behave as if they were already rounded up for us instead.
*/
if (group1->ntids != group2->ntids)
{
uint32 ntids1 = pg_nextpower2_32((uint32) group1->ntids);
uint32 ntids2 = pg_nextpower2_32((uint32) group2->ntids);
if (ntids1 > ntids2)
return -1;
if (ntids1 < ntids2)
return 1;
}
/*
* Tiebreak: asc offset-into-deltids-for-block (offset to first TID for
* block in deltids array) order.
*
* This is equivalent to sorting in ascending heap block number order
* (among otherwise equal subsets of the array). This approach allows us
* to avoid accessing the out-of-line TID. (We rely on the assumption
* that the deltids array was sorted in ascending heap TID order when
* these offsets to the first TID from each heap block group were formed.)
*/
if (group1->ifirsttid > group2->ifirsttid)
return 1;
if (group1->ifirsttid < group2->ifirsttid)
return -1;
pg_unreachable();
return 0;
}
/*
* heap_index_delete_tuples() helper function for bottom-up deletion callers.
*
* Sorts deltids array in the order needed for useful processing by bottom-up
* deletion. The array should already be sorted in TID order when we're
* called. The sort process groups heap TIDs from deltids into heap block
* groupings. Earlier/more-promising groups/blocks are usually those that are
* known to have the most "promising" TIDs.
*
* Sets new size of deltids array (ndeltids) in state. deltids will only have
* TIDs from the BOTTOMUP_MAX_NBLOCKS most promising heap blocks when we
* return. This often means that deltids will be shrunk to a small fraction
* of its original size (we eliminate many heap blocks from consideration for
* caller up front).
*
* Returns the number of "favorable" blocks. See bottomup_nblocksfavorable()
* for a definition and full details.
*/
static int
bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
{
IndexDeleteCounts *blockgroups;
TM_IndexDelete *reordereddeltids;
BlockNumber curblock = InvalidBlockNumber;
int nblockgroups = 0;
int ncopied = 0;
int nblocksfavorable = 0;
Assert(delstate->bottomup);
Assert(delstate->ndeltids > 0);
/* Calculate per-heap-block count of TIDs */
blockgroups = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids);
for (int i = 0; i < delstate->ndeltids; i++)
{
TM_IndexDelete *ideltid = &delstate->deltids[i];
TM_IndexStatus *istatus = delstate->status + ideltid->id;
ItemPointer htid = &ideltid->tid;
bool promising = istatus->promising;
if (curblock != ItemPointerGetBlockNumber(htid))
{
/* New block group */
nblockgroups++;
Assert(curblock < ItemPointerGetBlockNumber(htid) ||
!BlockNumberIsValid(curblock));
curblock = ItemPointerGetBlockNumber(htid);
blockgroups[nblockgroups - 1].ifirsttid = i;
blockgroups[nblockgroups - 1].ntids = 1;
blockgroups[nblockgroups - 1].npromisingtids = 0;
}
else
{
blockgroups[nblockgroups - 1].ntids++;
}
if (promising)
blockgroups[nblockgroups - 1].npromisingtids++;
}
/*
* We're about ready to sort block groups to determine the optimal order
* for visiting heap blocks. But before we do, round the number of
* promising tuples for each block group up to the nearest power-of-two
* (except for block groups where npromisingtids is already 0).
*
* This scheme divides heap blocks/block groups into buckets. Each bucket
* contains blocks that have _approximately_ the same number of promising
* TIDs as each other. The goal is to ignore relatively small differences
* in the total number of promising entries, so that the whole process can
* give a little weight to heapam factors (like heap block locality)
* instead. This isn't a trade-off, really -- we have nothing to lose.
* It would be foolish to interpret small differences in npromisingtids
* values as anything more than noise.
*
* We tiebreak on nhtids when sorting block group subsets that have the
* same npromisingtids, but this has the same issues as npromisingtids,
* and so nhtids is subject to the same power-of-two bucketing scheme.
* The only reason that we don't fix nhtids in the same way here too is
* that we'll need accurate nhtids values after the sort. We handle
* nhtids bucketization dynamically instead (in the sort comparator).
*
* See bottomup_nblocksfavorable() for a full explanation of when and how
* heap locality/favorable blocks can significantly influence when and how
* heap blocks are accessed.
*/
for (int b = 0; b < nblockgroups; b++)
{
IndexDeleteCounts *group = blockgroups + b;
/* Better off falling back on nhtids with low npromisingtids */
if (group->npromisingtids <= 4)
group->npromisingtids = 4;
else
group->npromisingtids =
pg_nextpower2_32((uint32) group->npromisingtids);
}
/* Sort groups and rearrange caller's deltids array */
qsort(blockgroups, nblockgroups, sizeof(IndexDeleteCounts),
bottomup_sort_and_shrink_cmp);
reordereddeltids = palloc(delstate->ndeltids * sizeof(TM_IndexDelete));
nblockgroups = Min(BOTTOMUP_MAX_NBLOCKS, nblockgroups);
/* Determine number of favorable blocks at the start of final deltids */
nblocksfavorable = bottomup_nblocksfavorable(blockgroups, nblockgroups,
delstate->deltids);
for (int b = 0; b < nblockgroups; b++)
{
IndexDeleteCounts *group = blockgroups + b;
TM_IndexDelete *firstdtid = delstate->deltids + group->ifirsttid;
memcpy(reordereddeltids + ncopied, firstdtid,
sizeof(TM_IndexDelete) * group->ntids);
ncopied += group->ntids;
}
/* Copy final grouped and sorted TIDs back into start of caller's array */
memcpy(delstate->deltids, reordereddeltids,
sizeof(TM_IndexDelete) * ncopied);
delstate->ndeltids = ncopied;
pfree(reordereddeltids);
pfree(blockgroups);
return nblocksfavorable;
}
/*
......
......@@ -2563,7 +2563,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_get_latest_tid = heap_get_latest_tid,
.tuple_tid_valid = heapam_tuple_tid_valid,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples,
.index_delete_tuples = heap_index_delete_tuples,
.relation_set_new_filenode = heapam_relation_set_new_filenode,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
......
......@@ -276,11 +276,18 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
*
* Note: index access methods that don't consistently use the standard
* IndexTuple + heap TID item pointer representation will need to provide
* their own version of this function.
* tuples being deleted using an AM-generic approach.
*
* This is a table_index_delete_tuples() shim used by index AMs that have
* simple requirements. These callers only need to consult the tableam to get
* a latestRemovedXid value, and only expect to delete tuples that are already
* known deletable. When a latestRemovedXid value isn't needed in index AM's
* deletion WAL record, it is safe for it to skip calling here entirely.
*
* We assume that caller index AM uses the standard IndexTuple representation,
* with table TIDs stored in the t_tid field. We also expect (and assert)
* that the line pointers on page for 'itemnos' offsets are already marked
* LP_DEAD.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
......@@ -289,12 +296,17 @@ index_compute_xid_horizon_for_tuples(Relation irel,
OffsetNumber *itemnos,
int nitems)
{
ItemPointerData *ttids =
(ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
TM_IndexDeleteOp delstate;
TransactionId latestRemovedXid = InvalidTransactionId;
Page ipage = BufferGetPage(ibuf);
IndexTuple itup;
delstate.bottomup = false;
delstate.bottomupfreespace = 0;
delstate.ndeltids = 0;
delstate.deltids = palloc(nitems * sizeof(TM_IndexDelete));
delstate.status = palloc(nitems * sizeof(TM_IndexStatus));
/* identify what the index tuples about to be deleted point to */
for (int i = 0; i < nitems; i++)
{
......@@ -303,14 +315,26 @@ index_compute_xid_horizon_for_tuples(Relation irel,
iitemid = PageGetItemId(ipage, itemnos[i]);
itup = (IndexTuple) PageGetItem(ipage, iitemid);
ItemPointerCopy(&itup->t_tid, &ttids[i]);
Assert(ItemIdIsDead(iitemid));
ItemPointerCopy(&itup->t_tid, &delstate.deltids[i].tid);
delstate.deltids[i].id = delstate.ndeltids;
delstate.status[i].idxoffnum = InvalidOffsetNumber; /* unused */
delstate.status[i].knowndeletable = true; /* LP_DEAD-marked */
delstate.status[i].promising = false; /* unused */
delstate.status[i].freespace = 0; /* unused */
delstate.ndeltids++;
}
/* determine the actual xid horizon */
latestRemovedXid =
table_compute_xid_horizon_for_tuples(hrel, ttids, nitems);
latestRemovedXid = table_index_delete_tuples(hrel, &delstate);
/* assert tableam agrees that all items are deletable */
Assert(delstate.ndeltids == nitems);
pfree(ttids);
pfree(delstate.deltids);
pfree(delstate.status);
return latestRemovedXid;
}
......
......@@ -82,8 +82,8 @@ page.) A backwards scan has one additional bit of complexity: after
following the left-link we must account for the possibility that the
left sibling page got split before we could read it. So, we have to
move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see deletion discussion
below.)
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
......@@ -163,16 +163,16 @@ pages (though suffix truncation is also considered). Note we must include
the incoming item in this calculation, otherwise it is possible to find
that the incoming item doesn't fit on the split page where it needs to go!
The Deletion Algorithm
----------------------
Deleting index tuples during VACUUM
-----------------------------------
Before deleting a leaf item, we get a super-exclusive lock on the target
page, so that no other backend has a pin on the page when the deletion
starts. This is not necessary for correctness in terms of the btree index
operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place. The reason we do it is to
provide an interlock between non-full VACUUM and indexscans. Since VACUUM
deletes index entries before reclaiming heap tuple line pointers, the
provide an interlock between VACUUM and indexscans. Since VACUUM deletes
index entries before reclaiming heap tuple line pointers, the
super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
line pointer that an indexscanning process might be about to visit. This
guarantee works only for simple indexscans that visit the heap in sync
......@@ -202,7 +202,8 @@ from the page have been processed. This guarantees that the btbulkdelete
call cannot return while any indexscan is still holding a copy of a
deleted index tuple if the scan could be confused by that. Note that this
requirement does not say that btbulkdelete must visit the pages in any
particular order. (See also on-the-fly deletion, below.)
particular order. (See also simple deletion and bottom-up deletion,
below.)
There is no such interlocking for deletion of items in internal pages,
since backends keep no lock nor pin on a page they have descended past.
......@@ -213,8 +214,8 @@ page). Since we hold a lock on the lower page (per L&Y) until we have
re-found the parent item that links to it, we can be assured that the
parent item does still exist and can't have been deleted.
Page Deletion
-------------
Deleting entire pages during VACUUM
-----------------------------------
We consider deleting an entire page from the btree only when it's become
completely empty of items. (Merging partly-full pages would allow better
......@@ -419,8 +420,8 @@ without a backend's cached page also being detected as invalidated, but
only when we happen to recycle a block that once again gets recycled as the
rightmost leaf page.
On-the-Fly Deletion Of Index Tuples
-----------------------------------
Simple deletion
---------------
If a process visits a heap tuple and finds that it's dead and removable
(ie, dead to all open transactions, not only that process), then we can
......@@ -434,24 +435,27 @@ LP_DEAD bits are often set when checking a unique index for conflicts on
insert (this is simpler because it takes place when we hold an exclusive
lock on the leaf page).
Once an index tuple has been marked LP_DEAD it can actually be removed
Once an index tuple has been marked LP_DEAD it can actually be deleted
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion. We separate the steps
because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
hence have exclusive lock on it already). Deduplication can also prevent
a page split, but removing LP_DEAD tuples is the preferred approach.
(Note that posting list tuples can only have their LP_DEAD bit set when
every table TID within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
implementation of VACUUM, but it could be a problem for anything that
explicitly tries to find index entries for dead tuples. (However, the
same situation is created by REINDEX, since it doesn't enter dead
tuples into the index.)
exclusive lock. Also, delaying the deletion often allows us to pick up
extra index tuples that weren't initially safe for index scans to mark
LP_DEAD. We do this with index tuples whose TIDs point to the same table
blocks as an LP_DEAD-marked tuple. They're practically free to check in
passing, and have a pretty good chance of being safe to delete due to
various locality effects.
We only try to delete LP_DEAD tuples (and nearby tuples) when we are
otherwise faced with having to split a page to do an insertion (and hence
have exclusive lock on it already). Deduplication and bottom-up index
deletion can also prevent a page split, but simple deletion is always our
preferred approach. (Note that posting list tuples can only have their
LP_DEAD bit set when every table TID within the posting list is known
dead. This isn't much of a problem in practice because LP_DEAD bits are
just a starting point for simple deletion -- we still manage to perform
granular deletes of posting list TIDs quite often.)
It's sufficient to have an exclusive lock on the index page, not a
super-exclusive lock, to do deletion of LP_DEAD items. It might seem
......@@ -469,6 +473,70 @@ LSN of the page, and only act to set LP_DEAD bits when the LSN has not
changed at all. (Avoiding dropping the pin entirely also makes it safe, of
course.)
Bottom-Up deletion
------------------
We attempt to delete whatever duplicates happen to be present on the page
when the duplicates are suspected to be caused by version churn from
successive UPDATEs. This only happens when we receive an executor hint
indicating that optimizations like heapam's HOT have not worked out for
the index -- the incoming tuple must be a logically unchanged duplicate
which is needed for MVCC purposes, suggesting that that might well be the
dominant source of new index tuples on the leaf page in question. (Also,
bottom-up deletion is triggered within unique indexes in cases with
continual INSERT and DELETE related churn, since that is easy to detect
without any external hint.)
Simple deletion will already have failed to prevent a page split when a
bottom-up deletion pass takes place (often because no LP_DEAD bits were
ever set on the page). The two mechanisms have closely related
implementations. The same WAL records are used for each operation, and
the same tableam infrastructure is used to determine what TIDs/tuples are
actually safe to delete. The implementations only differ in how they pick
TIDs to consider for deletion, and whether or not the tableam will give up
before accessing all table blocks (bottom-up deletion lives with the
uncertainty of its success by keeping the cost of failure low). Even
still, the two mechanisms are clearly distinct at the conceptual level.
Bottom-up index deletion is driven entirely by heuristics (whereas simple
deletion is guaranteed to delete at least those index tuples that are
already LP_DEAD marked -- there must be at least one). We have no
certainty that we'll find even one index tuple to delete. That's why we
closely cooperate with the tableam to keep the costs it pays in balance
with the benefits we receive. The interface that we use for this is
described in detail in access/tableam.h.
Bottom-up index deletion can be thought of as a backstop mechanism against
unnecessary version-driven page splits. It is based in part on an idea
from generational garbage collection: the "generational hypothesis". This
is the empirical observation that "most objects die young". Within
nbtree, new index tuples often quickly appear in the same place, and then
quickly become garbage. There can be intense concentrations of garbage in
relatively few leaf pages with certain workloads (or there could be in
earlier versions of PostgreSQL without bottom-up index deletion, at
least). See doc/src/sgml/btree.sgml for a high-level description of the
design principles behind bottom-up index deletion in nbtree, including
details of how it complements VACUUM.
We expect to find a reasonably large number of tuples that are safe to
delete within each bottom-up pass. If we don't then we won't need to
consider the question of bottom-up deletion for the same leaf page for
quite a while (usually because the page splits, which resolves the
situation for the time being). We expect to perform regular bottom-up
deletion operations against pages that are at constant risk of unnecessary
page splits caused only by version churn. When the mechanism works well
we'll constantly be "on the verge" of having version-churn-driven page
splits, but never actually have even one.
Our duplicate heuristics work well despite being fairly simple.
Unnecessary page splits only occur when there are truly pathological
levels of version churn (in theory a small amount of version churn could
make a page split occur earlier than strictly necessary, but that's pretty
harmless). We don't have to understand the underlying workload; we only
have to understand the general nature of the pathology that we target.
Version churn is easy to spot when it is truly pathological. Affected
leaf pages are fairly homogeneous.
WAL Considerations
------------------
......@@ -767,9 +835,10 @@ into a single physical tuple with a posting list (a simple array of heap
TIDs with the standard item pointer format). Deduplication is always
applied lazily, at the point where it would otherwise be necessary to
perform a page split. It occurs only when LP_DEAD items have been
removed, as our last line of defense against splitting a leaf page. We
can set the LP_DEAD bit with posting list tuples, though only when all
TIDs are known dead.
removed, as our last line of defense against splitting a leaf page
(bottom-up index deletion may be attempted first, as our second last line
of defense). We can set the LP_DEAD bit with posting list tuples, though
only when all TIDs are known dead.
Our lazy approach to deduplication allows the page space accounting used
during page splits to have absolutely minimal special case logic for
......@@ -788,7 +857,10 @@ page space accounting (see later section), so it's not clear how
compression could be integrated with nbtree. Besides, posting list
compression does not offer a compelling trade-off for nbtree, since in
general nbtree is optimized for consistent performance with many
concurrent readers and writers.
concurrent readers and writers. Compression would also make the deletion
of a subset of TIDs from a posting list slow and complicated, which would
be a big problem for workloads that depend heavily on bottom-up index
deletion.
A major goal of our lazy approach to deduplication is to limit the
performance impact of deduplication with random updates. Even concurrent
......@@ -826,6 +898,16 @@ delay a split that is probably inevitable anyway. This allows us to avoid
the overhead of attempting to deduplicate with unique indexes that always
have few or no duplicates.
Note: Avoiding "unnecessary" page splits driven by version churn is also
the goal of bottom-up index deletion, which was added to PostgreSQL 14.
Bottom-up index deletion is now the preferred way to deal with this
problem (with all kinds of indexes, though especially with unique
indexes). Still, deduplication can sometimes augment bottom-up index
deletion. When deletion cannot free tuples (due to an old snapshot
holding up cleanup), falling back on deduplication provides additional
capacity. Delaying the page split by deduplicating can allow a future
bottom-up deletion pass of the same page to succeed.
Posting list splits
-------------------
......
/*-------------------------------------------------------------------------
*
* nbtdedup.c
* Deduplicate items in Postgres btrees.
* Deduplicate or bottom-up delete items in Postgres btrees.
*
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
......@@ -19,6 +19,8 @@
#include "miscadmin.h"
#include "utils/rel.h"
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
......@@ -267,6 +269,147 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
pfree(state);
}
/*
* Perform bottom-up index deletion pass.
*
* See if duplicate index tuples (plus certain nearby tuples) are eligible to
* be deleted via bottom-up index deletion. The high level goal here is to
* entirely prevent "unnecessary" page splits caused by MVCC version churn
* from UPDATEs (when the UPDATEs don't logically modify any of the columns
* covered by the 'rel' index). This is qualitative, not quantitative -- we
* do not particularly care about once-off opportunities to delete many index
* tuples together.
*
* See nbtree/README for details on the design of nbtree bottom-up deletion.
* See access/tableam.h for a description of how we're expected to cooperate
* with the tableam.
*
* Returns true on success, in which case caller can assume page split will be
* avoided for a reasonable amount of time. Returns false when caller should
* deduplicate the page (if possible at all).
*
* Note: Occasionally we return true despite failing to delete enough items to
* avoid a split. This makes caller skip deduplication and go split the page
* right away. Our return value is always just advisory information.
*
* Note: Caller should have already deleted all existing items with their
* LP_DEAD bits set.
*/
bool
_bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz)
{
OffsetNumber offnum,
minoff,
maxoff;
Page page = BufferGetPage(buf);
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
BTDedupState state;
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
/* Initialize deduplication state */
state = (BTDedupState) palloc(sizeof(BTDedupStateData));
state->deduplicate = true;
state->nmaxitems = 0;
state->maxpostingsize = BLCKSZ; /* We're not really deduplicating */
state->base = NULL;
state->baseoff = InvalidOffsetNumber;
state->basetupsize = 0;
state->htids = palloc(state->maxpostingsize);
state->nhtids = 0;
state->nitems = 0;
state->phystupsize = 0;
state->nintervals = 0;
/*
* Initialize tableam state that describes bottom-up index deletion
* operation.
*
* We'll go on to ask the tableam to search for TIDs whose index tuples we
* can safely delete. The tableam will search until our leaf page space
* target is satisfied, or until the cost of continuing with the tableam
* operation seems too high. It focuses its efforts on TIDs associated
* with duplicate index tuples that we mark "promising".
*
* This space target is a little arbitrary. The tableam must be able to
* keep the costs and benefits in balance. We provide the tableam with
* exhaustive information about what might work, without directly
* concerning ourselves with avoiding work during the tableam call. Our
* role in costing the bottom-up deletion process is strictly advisory.
*/
delstate.bottomup = true;
delstate.bottomupfreespace = Max(BLCKSZ / 16, newitemsz);
delstate.ndeltids = 0;
delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete));
delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus));
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = minoff;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid = PageGetItemId(page, offnum);
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
Assert(!ItemIdIsDead(itemid));
if (offnum == minoff)
{
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
}
else
{
/* Finalize interval -- move its TIDs to delete state */
_bt_bottomupdel_finish_pending(page, state, &delstate);
/* itup starts new pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
}
/* Finalize final interval -- move its TIDs to delete state */
_bt_bottomupdel_finish_pending(page, state, &delstate);
/*
* We don't give up now in the event of having few (or even zero)
* promising tuples for the tableam because it's not up to us as the index
* AM to manage costs (note that the tableam might have heuristics of its
* own that work out what to do). We should at least avoid having our
* caller do a useless deduplication pass after we return in the event of
* zero promising tuples, though.
*/
neverdedup = false;
if (state->nintervals == 0)
neverdedup = true;
pfree(state->htids);
pfree(state);
/* Ask tableam which TIDs are deletable, then physically delete them */
_bt_delitems_delete_check(rel, buf, heapRel, &delstate);
pfree(delstate.deltids);
pfree(delstate.status);
/* Report "success" to caller unconditionally to avoid deduplication */
if (neverdedup)
return true;
/* Don't dedup when we won't end up back here any time soon anyway */
return PageGetExactFreeSpace(page) >= Max(BLCKSZ / 24, newitemsz);
}
/*
* Create a new pending posting list tuple based on caller's base tuple.
*
......@@ -452,6 +595,150 @@ _bt_dedup_finish_pending(Page newpage, BTDedupState state)
return spacesaving;
}
/*
* Finalize interval during bottom-up index deletion.
*
* During a bottom-up pass we expect that TIDs will be recorded in dedup state
* first, and then get moved over to delstate (in variable-sized batches) by
* calling here. Call here happens when the number of TIDs in a dedup
* interval is known, and interval gets finalized (i.e. when caller sees next
* tuple on the page is not a duplicate, or when caller runs out of tuples to
* process from leaf page).
*
* This is where bottom-up deletion determines and remembers which entries are
* duplicates. This will be important information to the tableam delete
* infrastructure later on. Plain index tuple duplicates are marked
* "promising" here, per tableam contract.
*
* Our approach to marking entries whose TIDs come from posting lists is more
* complicated. Posting lists can only be formed by a deduplication pass (or
* during an index build), so recent version churn affecting the pointed-to
* logical rows is not particularly likely. We may still give a weak signal
* about posting list tuples' entries (by marking just one of its TIDs/entries
* promising), though this is only a possibility in the event of further
* duplicate index tuples in final interval that covers posting list tuple (as
* in the plain tuple case). A weak signal/hint will be useful to the tableam
* when it has no stronger signal to go with for the deletion operation as a
* whole.
*
* The heuristics we use work well in practice because we only need to give
* the tableam the right _general_ idea about where to look. Garbage tends to
* naturally get concentrated in relatively few table blocks with workloads
* that bottom-up deletion targets. The tableam cannot possibly rank all
* available table blocks sensibly based on the hints we provide, but that's
* okay -- only the extremes matter. The tableam just needs to be able to
* predict which few table blocks will have the most tuples that are safe to
* delete for each deletion operation, with low variance across related
* deletion operations.
*/
static void
_bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate)
{
bool dupinterval = (state->nitems > 1);
Assert(state->nitems > 0);
Assert(state->nitems <= state->nhtids);
Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
for (int i = 0; i < state->nitems; i++)
{
OffsetNumber offnum = state->baseoff + i;
ItemId itemid = PageGetItemId(page, offnum);
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
TM_IndexDelete *ideltid = &delstate->deltids[delstate->ndeltids];
TM_IndexStatus *istatus = &delstate->status[delstate->ndeltids];
if (!BTreeTupleIsPosting(itup))
{
/* Simple case: A plain non-pivot tuple */
ideltid->tid = itup->t_tid;
ideltid->id = delstate->ndeltids;
istatus->idxoffnum = offnum;
istatus->knowndeletable = false; /* for now */
istatus->promising = dupinterval; /* simple rule */
istatus->freespace = ItemIdGetLength(itemid) + sizeof(ItemIdData);
delstate->ndeltids++;
}
else
{
/*
* Complicated case: A posting list tuple.
*
* We make the conservative assumption that there can only be at
* most one affected logical row per posting list tuple. There
* will be at most one promising entry in deltids to represent
* this presumed lone logical row. Note that this isn't even
* considered unless the posting list tuple is also in an interval
* of duplicates -- this complicated rule is just a variant of the
* simple rule used to decide if plain index tuples are promising.
*/
int nitem = BTreeTupleGetNPosting(itup);
bool firstpromising = false;
bool lastpromising = false;
Assert(_bt_posting_valid(itup));
if (dupinterval)
{
/*
* Complicated rule: either the first or last TID in the
* posting list gets marked promising (if any at all)
*/
BlockNumber minblocklist,
midblocklist,
maxblocklist;
ItemPointer mintid,
midtid,
maxtid;
mintid = BTreeTupleGetHeapTID(itup);
midtid = BTreeTupleGetPostingN(itup, nitem / 2);
maxtid = BTreeTupleGetMaxHeapTID(itup);
minblocklist = ItemPointerGetBlockNumber(mintid);
midblocklist = ItemPointerGetBlockNumber(midtid);
maxblocklist = ItemPointerGetBlockNumber(maxtid);
/* Only entry with predominant table block can be promising */
firstpromising = (minblocklist == midblocklist);
lastpromising = (!firstpromising &&
midblocklist == maxblocklist);
}
for (int p = 0; p < nitem; p++)
{
ItemPointer htid = BTreeTupleGetPostingN(itup, p);
ideltid->tid = *htid;
ideltid->id = delstate->ndeltids;
istatus->idxoffnum = offnum;
istatus->knowndeletable = false; /* for now */
istatus->promising = false;
if ((firstpromising && p == 0) ||
(lastpromising && p == nitem - 1))
istatus->promising = true;
istatus->freespace = sizeof(ItemPointerData); /* at worst */
ideltid++;
istatus++;
delstate->ndeltids++;
}
}
}
if (dupinterval)
{
state->intervals[state->nintervals].nitems = state->nitems;
state->nintervals++;
}
/* Reset state for next interval */
state->nhtids = 0;
state->nitems = 0;
state->phystupsize = 0;
}
/*
* Determine if page non-pivot tuples (data items) are all duplicates of the
* same value -- if they are, deduplication's "single value" strategy should
......@@ -622,8 +909,8 @@ _bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
* Generate a replacement tuple by "updating" a posting list tuple so that it
* no longer has TIDs that need to be deleted.
*
* Used by VACUUM. Caller's vacposting argument points to the existing
* posting list tuple to be updated.
* Used by both VACUUM and index deletion. Caller's vacposting argument
* points to the existing posting list tuple to be updated.
*
* On return, caller's vacposting argument will point to final "updated"
* tuple, which will be palloc()'d in caller's memory context.
......
......@@ -17,9 +17,9 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
#include "lib/qunique.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
......@@ -37,6 +37,7 @@ static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
static OffsetNumber _bt_findinsertloc(Relation rel,
BTInsertState insertstate,
bool checkingunique,
bool indexUnchanged,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
......@@ -60,8 +61,16 @@ static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off, bool newfirstdataitem);
static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTInsertState insertstate,
bool lpdeadonly, bool checkingunique,
bool uniquedup);
bool simpleonly, bool checkingunique,
bool uniquedup, bool indexUnchanged);
static void _bt_simpledel_pass(Relation rel, Buffer buffer, Relation heapRel,
OffsetNumber *deletable, int ndeletable,
IndexTuple newitem, OffsetNumber minoff,
OffsetNumber maxoff);
static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int ndeletable, IndexTuple newitem,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
......@@ -75,6 +84,11 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
* For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
* don't actually insert.
*
* indexUnchanged executor hint indicates if itup is from an
* UPDATE that didn't logically change the indexed value, but
* must nevertheless have a new entry to point to a successor
* version.
*
* The result value is only significant for UNIQUE_CHECK_PARTIAL:
* it must be true if the entry is known unique, else false.
* (In the current implementation we'll also return true after a
......@@ -83,7 +97,8 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
*/
bool
_bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel)
IndexUniqueCheck checkUnique, bool indexUnchanged,
Relation heapRel)
{
bool is_unique = false;
BTInsertStateData insertstate;
......@@ -238,7 +253,7 @@ search:
* checkingunique.
*/
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
indexUnchanged, stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
itup, insertstate.itemsz, newitemoff,
insertstate.postingoff, false);
......@@ -480,11 +495,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* items as quickly as we can. We only apply _bt_compare() when
* we get to a non-killed item. We could reuse the bounds to
* avoid _bt_compare() calls for known equal tuples, but it
* doesn't seem worth it. Workloads with heavy update activity
* tend to have many deduplication passes, so we'll often avoid
* most of those comparisons, too (we call _bt_compare() when the
* posting list tuple is initially encountered, though not when
* processing later TIDs from the same tuple).
* doesn't seem worth it.
*/
if (!inposting)
curitemid = PageGetItemId(page, offset);
......@@ -777,6 +788,17 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* room for the new tuple, this function moves right, trying to find a
* legal page that does.)
*
* If 'indexUnchanged' is true, this is for an UPDATE that didn't
* logically change the indexed value, but must nevertheless have a new
* entry to point to a successor version. This hint from the executor
* will influence our behavior when the page might have to be split and
* we must consider our options. Bottom-up index deletion can avoid
* pathological version-driven page splits, but we only want to go to the
* trouble of trying it when we already have moderate confidence that
* it's appropriate. The hint should not significantly affect our
* behavior over time unless practically all inserts on to the leaf page
* get the hint.
*
* On exit, insertstate buffer contains the chosen insertion page, and
* the offset within that page is returned. If _bt_findinsertloc needed
* to move right, the lock and pin on the original page are released, and
......@@ -793,6 +815,7 @@ static OffsetNumber
_bt_findinsertloc(Relation rel,
BTInsertState insertstate,
bool checkingunique,
bool indexUnchanged,
BTStack stack,
Relation heapRel)
{
......@@ -817,7 +840,7 @@ _bt_findinsertloc(Relation rel,
if (itup_key->heapkeyspace)
{
/* Keep track of whether checkingunique duplicate seen */
bool uniquedup = false;
bool uniquedup = indexUnchanged;
/*
* If we're inserting into a unique index, we may have to walk right
......@@ -874,14 +897,13 @@ _bt_findinsertloc(Relation rel,
}
/*
* If the target page is full, see if we can obtain enough space using
* one or more strategies (e.g. erasing LP_DEAD items, deduplication).
* Page splits are expensive, and should only go ahead when truly
* necessary.
* If the target page cannot fit newitem, try to avoid splitting the
* page on insert by performing deletion or deduplication now
*/
if (PageGetFreeSpace(page) < insertstate->itemsz)
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
checkingunique, uniquedup);
checkingunique, uniquedup,
indexUnchanged);
}
else
{
......@@ -921,9 +943,9 @@ _bt_findinsertloc(Relation rel,
*/
if (P_HAS_GARBAGE(opaque))
{
/* Erase LP_DEAD items (won't deduplicate) */
/* Perform simple deletion */
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
checkingunique, false);
false, false, false);
if (PageGetFreeSpace(page) >= insertstate->itemsz)
break; /* OK, now we have enough space */
......@@ -970,14 +992,11 @@ _bt_findinsertloc(Relation rel,
/*
* There is an overlapping posting list tuple with its LP_DEAD bit
* set. We don't want to unnecessarily unset its LP_DEAD bit while
* performing a posting list split, so delete all LP_DEAD items early.
* This is the only case where LP_DEAD deletes happen even though
* there is space for newitem on the page.
*
* This can only erase LP_DEAD items (it won't deduplicate).
* performing a posting list split, so perform simple index tuple
* deletion early.
*/
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
checkingunique, false);
false, false, false);
/*
* Do new binary search. New insert location cannot overlap with any
......@@ -2606,21 +2625,19 @@ _bt_pgaddtup(Page page,
}
/*
* _bt_delete_or_dedup_one_page - Try to avoid a leaf page split by attempting
* a variety of operations.
*
* There are two operations performed here: deleting items already marked
* LP_DEAD, and deduplication. If both operations fail to free enough space
* for the incoming item then caller will go on to split the page. We always
* attempt our preferred strategy (which is to delete items whose LP_DEAD bit
* are set) first. If that doesn't work out we move on to deduplication.
* _bt_delete_or_dedup_one_page - Try to avoid a leaf page split.
*
* Caller's checkingunique and uniquedup arguments help us decide if we should
* perform deduplication, which is primarily useful with low cardinality data,
* but can sometimes absorb version churn.
* There are three operations performed here: simple index deletion, bottom-up
* index deletion, and deduplication. If all three operations fail to free
* enough space for the incoming item then caller will go on to split the
* page. We always consider simple deletion first. If that doesn't work out
* we consider alternatives. Callers that only want us to consider simple
* deletion (without any fallback) ask for that using the 'simpleonly'
* argument.
*
* Callers that only want us to look for/delete LP_DEAD items can ask for that
* directly by passing true 'lpdeadonly' argument.
* We usually pick only one alternative "complex" operation when simple
* deletion alone won't prevent a page split. The 'checkingunique',
* 'uniquedup', and 'indexUnchanged' arguments are used for that.
*
* Note: We used to only delete LP_DEAD items when the BTP_HAS_GARBAGE page
* level flag was found set. The flag was useful back when there wasn't
......@@ -2638,12 +2655,13 @@ _bt_pgaddtup(Page page,
static void
_bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTInsertState insertstate,
bool lpdeadonly, bool checkingunique,
bool uniquedup)
bool simpleonly, bool checkingunique,
bool uniquedup, bool indexUnchanged)
{
OffsetNumber deletable[MaxIndexTuplesPerPage];
int ndeletable = 0;
OffsetNumber offnum,
minoff,
maxoff;
Buffer buffer = insertstate->buf;
BTScanInsert itup_key = insertstate->itup_key;
......@@ -2651,14 +2669,19 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(P_ISLEAF(opaque));
Assert(lpdeadonly || itup_key->heapkeyspace);
Assert(simpleonly || itup_key->heapkeyspace);
Assert(!simpleonly || (!checkingunique && !uniquedup && !indexUnchanged));
/*
* Scan over all items to see which ones need to be deleted according to
* LP_DEAD flags.
* LP_DEAD flags. We'll usually manage to delete a few extra items that
* are not marked LP_DEAD in passing. Often the extra items that actually
* end up getting deleted are items that would have had their LP_DEAD bit
* set before long anyway (if we opted not to include them as extras).
*/
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = P_FIRSTDATAKEY(opaque);
for (offnum = minoff;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
......@@ -2670,7 +2693,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
if (ndeletable > 0)
{
_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
_bt_simpledel_pass(rel, buffer, heapRel, deletable, ndeletable,
insertstate->itup, minoff, maxoff);
insertstate->bounds_valid = false;
/* Return when a page split has already been avoided */
......@@ -2682,37 +2706,288 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
}
/*
* Some callers only want to delete LP_DEAD items. Return early for these
* callers.
* We're done with simple deletion. Return early with callers that only
* call here so that simple deletion can be considered. This includes
* callers that explicitly ask for this and checkingunique callers that
* probably don't have any version churn duplicates on the page.
*
* Note: The page's BTP_HAS_GARBAGE hint flag may still be set when we
* return at this point (or when we go on the try either or both of our
* other strategies and they also fail). We do not bother expending a
* separate write to clear it, however. Caller will definitely clear it
* when it goes on to split the page (plus deduplication knows to clear
* the flag when it actually modifies the page).
* when it goes on to split the page (note also that the deduplication
* process will clear the flag in passing, just to keep things tidy).
*/
if (lpdeadonly)
return;
/*
* We can get called in the checkingunique case when there is no reason to
* believe that there are any duplicates on the page; we should at least
* still check for LP_DEAD items. If that didn't work out, give up and
* let caller split the page. Deduplication cannot be justified given
* there is no reason to think that there are duplicates.
*/
if (checkingunique && !uniquedup)
if (simpleonly || (checkingunique && !uniquedup))
{
Assert(!indexUnchanged);
return;
}
/* Assume bounds about to be invalidated (this is almost certain now) */
insertstate->bounds_valid = false;
/*
* Perform deduplication pass, though only when it is enabled for the
* index and known to be safe (it must be an allequalimage index).
* Perform bottom-up index deletion pass when executor hint indicated that
* incoming item is logically unchanged, or for a unique index that is
* known to have physical duplicates for some other reason. (There is a
* large overlap between these two cases for a unique index. It's worth
* having both triggering conditions in order to apply the optimization in
* the event of successive related INSERT and DELETE statements.)
*
* We'll go on to do a deduplication pass when a bottom-up pass fails to
* delete an acceptable amount of free space (a significant fraction of
* the page, or space for the new item, whichever is greater).
*
* Note: Bottom-up index deletion uses the same equality/equivalence
* routines as deduplication internally. However, it does not merge
* together index tuples, so the same correctness considerations do not
* apply. We deliberately omit an index-is-allequalimage test here.
*/
if ((indexUnchanged || uniquedup) &&
_bt_bottomupdel_pass(rel, buffer, heapRel, insertstate->itemsz))
return;
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
_bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
insertstate->itemsz, checkingunique);
}
/*
* _bt_simpledel_pass - Simple index tuple deletion pass.
*
* We delete all LP_DEAD-set index tuples on a leaf page. The offset numbers
* of all such tuples are determined by caller (caller passes these to us as
* its 'deletable' argument).
*
* We might also delete extra index tuples that turn out to be safe to delete
* in passing (though they must be cheap to check in passing to begin with).
* There is no certainty that any extra tuples will be deleted, though. The
* high level goal of the approach we take is to get the most out of each call
* here (without noticeably increasing the per-call overhead compared to what
* we need to do just to be able to delete the page's LP_DEAD-marked index
* tuples).
*
* The number of extra index tuples that turn out to be deletable might
* greatly exceed the number of LP_DEAD-marked index tuples due to various
* locality related effects. For example, it's possible that the total number
* of table blocks (pointed to by all TIDs on the leaf page) is naturally
* quite low, in which case we might end up checking if it's possible to
* delete _most_ index tuples on the page (without the tableam needing to
* access additional table blocks). The tableam will sometimes stumble upon
* _many_ extra deletable index tuples in indexes where this pattern is
* common.
*
* See nbtree/README for further details on simple index tuple deletion.
*/
static void
_bt_simpledel_pass(Relation rel, Buffer buffer, Relation heapRel,
OffsetNumber *deletable, int ndeletable, IndexTuple newitem,
OffsetNumber minoff, OffsetNumber maxoff)
{
Page page = BufferGetPage(buffer);
BlockNumber *deadblocks;
int ndeadblocks;
TM_IndexDeleteOp delstate;
OffsetNumber offnum;
/* Get array of table blocks pointed to by LP_DEAD-set tuples */
deadblocks = _bt_deadblocks(page, deletable, ndeletable, newitem,
&ndeadblocks);
/* Initialize tableam state that describes index deletion operation */
delstate.bottomup = false;
delstate.bottomupfreespace = 0;
delstate.ndeltids = 0;
delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete));
delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus));
for (offnum = minoff;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid = PageGetItemId(page, offnum);
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
TM_IndexDelete *odeltid = &delstate.deltids[delstate.ndeltids];
TM_IndexStatus *ostatus = &delstate.status[delstate.ndeltids];
BlockNumber tidblock;
void *match;
if (!BTreeTupleIsPosting(itup))
{
tidblock = ItemPointerGetBlockNumber(&itup->t_tid);
match = bsearch(&tidblock, deadblocks, ndeadblocks,
sizeof(BlockNumber), _bt_blk_cmp);
if (!match)
{
Assert(!ItemIdIsDead(itemid));
continue;
}
/*
* TID's table block is among those pointed to by the TIDs from
* LP_DEAD-bit set tuples on page -- add TID to deltids
*/
odeltid->tid = itup->t_tid;
odeltid->id = delstate.ndeltids;
ostatus->idxoffnum = offnum;
ostatus->knowndeletable = ItemIdIsDead(itemid);
ostatus->promising = false; /* unused */
ostatus->freespace = 0; /* unused */
delstate.ndeltids++;
}
else
{
int nitem = BTreeTupleGetNPosting(itup);
for (int p = 0; p < nitem; p++)
{
ItemPointer tid = BTreeTupleGetPostingN(itup, p);
tidblock = ItemPointerGetBlockNumber(tid);
match = bsearch(&tidblock, deadblocks, ndeadblocks,
sizeof(BlockNumber), _bt_blk_cmp);
if (!match)
{
Assert(!ItemIdIsDead(itemid));
continue;
}
/*
* TID's table block is among those pointed to by the TIDs
* from LP_DEAD-bit set tuples on page -- add TID to deltids
*/
odeltid->tid = *tid;
odeltid->id = delstate.ndeltids;
ostatus->idxoffnum = offnum;
ostatus->knowndeletable = ItemIdIsDead(itemid);
ostatus->promising = false; /* unused */
ostatus->freespace = 0; /* unused */
odeltid++;
ostatus++;
delstate.ndeltids++;
}
}
}
pfree(deadblocks);
Assert(delstate.ndeltids >= ndeletable);
/* Physically delete LP_DEAD tuples (plus any delete-safe extra TIDs) */
_bt_delitems_delete_check(rel, buffer, heapRel, &delstate);
pfree(delstate.deltids);
pfree(delstate.status);
}
/*
* _bt_deadblocks() -- Get LP_DEAD related table blocks.
*
* Builds sorted and unique-ified array of table block numbers from index
* tuple TIDs whose line pointers are marked LP_DEAD. Also adds the table
* block from incoming newitem just in case it isn't among the LP_DEAD-related
* table blocks.
*
* Always counting the newitem's table block as an LP_DEAD related block makes
* sense because the cost is consistently low; it is practically certain that
* the table block will not incur a buffer miss in tableam. On the other hand
* the benefit is often quite high. There is a decent chance that there will
* be some deletable items from this block, since in general most garbage
* tuples became garbage in the recent past (in many cases this won't be the
* first logical row that core code added to/modified in table block
* recently).
*
* Returns final array, and sets *nblocks to its final size for caller.
*/
static BlockNumber *
_bt_deadblocks(Page page, OffsetNumber *deletable, int ndeletable,
IndexTuple newitem, int *nblocks)
{
int spacentids,
ntids;
BlockNumber *tidblocks;
/*
* Accumulate each TID's block in array whose initial size has space for
* one table block per LP_DEAD-set tuple (plus space for the newitem table
* block). Array will only need to grow when there are LP_DEAD-marked
* posting list tuples (which is not that common).
*/
spacentids = ndeletable + 1;
ntids = 0;
tidblocks = (BlockNumber *) palloc(sizeof(BlockNumber) * spacentids);
/*
* First add the table block for the incoming newitem. This is the one
* case where simple deletion can visit a table block that doesn't have
* any known deletable items.
*/
Assert(!BTreeTupleIsPosting(newitem) && !BTreeTupleIsPivot(newitem));
tidblocks[ntids++] = ItemPointerGetBlockNumber(&newitem->t_tid);
for (int i = 0; i < ndeletable; i++)
{
ItemId itemid = PageGetItemId(page, deletable[i]);
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
Assert(ItemIdIsDead(itemid));
if (!BTreeTupleIsPosting(itup))
{
if (ntids + 1 > spacentids)
{
spacentids *= 2;
tidblocks = (BlockNumber *)
repalloc(tidblocks, sizeof(BlockNumber) * spacentids);
}
tidblocks[ntids++] = ItemPointerGetBlockNumber(&itup->t_tid);
}
else
{
int nposting = BTreeTupleGetNPosting(itup);
if (ntids + nposting > spacentids)
{
spacentids = Max(spacentids * 2, ntids + nposting);
tidblocks = (BlockNumber *)
repalloc(tidblocks, sizeof(BlockNumber) * spacentids);
}
for (int j = 0; j < nposting; j++)
{
ItemPointer tid = BTreeTupleGetPostingN(itup, j);
tidblocks[ntids++] = ItemPointerGetBlockNumber(tid);
}
}
}
qsort(tidblocks, ntids, sizeof(BlockNumber), _bt_blk_cmp);
*nblocks = qunique(tidblocks, ntids, sizeof(BlockNumber), _bt_blk_cmp);
return tidblocks;
}
/*
* _bt_blk_cmp() -- qsort comparison function for _bt_simpledel_pass
*/
static inline int
_bt_blk_cmp(const void *arg1, const void *arg2)
{
BlockNumber b1 = *((BlockNumber *) arg1);
BlockNumber b2 = *((BlockNumber *) arg2);
if (b1 < b2)
return -1;
else if (b1 > b2)
return 1;
return 0;
}
......@@ -38,8 +38,14 @@
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable);
static void _bt_delitems_delete(Relation rel, Buffer buf,
TransactionId latestRemovedXid,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable,
Relation heapRel);
static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
OffsetNumber *updatedoffsets,
Size *updatedbuflen, bool needswal);
static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
......@@ -1110,15 +1116,16 @@ _bt_page_recyclable(Page page)
* sorted in ascending order.
*
* Routine deals with deleting TIDs when some (but not all) of the heap TIDs
* in an existing posting list item are to be removed by VACUUM. This works
* by updating/overwriting an existing item with caller's new version of the
* item (a version that lacks the TIDs that are to be deleted).
* in an existing posting list item are to be removed. This works by
* updating/overwriting an existing item with caller's new version of the item
* (a version that lacks the TIDs that are to be deleted).
*
* We record VACUUMs and b-tree deletes differently in WAL. Deletes must
* generate their own latestRemovedXid by accessing the heap directly, whereas
* VACUUMs rely on the initial heap scan taking care of it indirectly. Also,
* only VACUUM can perform granular deletes of individual TIDs in posting list
* tuples.
* generate their own latestRemovedXid by accessing the table directly,
* whereas VACUUMs rely on the initial VACUUM table scan performing
* WAL-logging that takes care of the issue for the table's indexes
* indirectly. Also, we remove the VACUUM cycle ID from pages, which b-tree
* deletes don't do.
*/
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
......@@ -1127,7 +1134,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
Size itemsz;
bool needswal = RelationNeedsWAL(rel);
char *updatedbuf = NULL;
Size updatedbuflen = 0;
OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
......@@ -1135,45 +1142,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
/* Shouldn't be called unless there's something to do */
Assert(ndeletable > 0 || nupdatable > 0);
for (int i = 0; i < nupdatable; i++)
{
/* Replace work area IndexTuple with updated version */
_bt_update_posting(updatable[i]);
/* Maintain array of updatable page offsets for WAL record */
updatedoffsets[i] = updatable[i]->updatedoffset;
}
/* XLOG stuff -- allocate and fill buffer before critical section */
if (nupdatable > 0 && RelationNeedsWAL(rel))
{
Size offset = 0;
for (int i = 0; i < nupdatable; i++)
{
BTVacuumPosting vacposting = updatable[i];
itemsz = SizeOfBtreeUpdate +
vacposting->ndeletedtids * sizeof(uint16);
updatedbuflen += itemsz;
}
updatedbuf = palloc(updatedbuflen);
for (int i = 0; i < nupdatable; i++)
{
BTVacuumPosting vacposting = updatable[i];
xl_btree_update update;
update.ndeletedtids = vacposting->ndeletedtids;
memcpy(updatedbuf + offset, &update.ndeletedtids,
SizeOfBtreeUpdate);
offset += SizeOfBtreeUpdate;
itemsz = update.ndeletedtids * sizeof(uint16);
memcpy(updatedbuf + offset, vacposting->deletetids, itemsz);
offset += itemsz;
}
}
/* Generate new version of posting lists without deleted TIDs */
if (nupdatable > 0)
updatedbuf = _bt_delitems_update(updatable, nupdatable,
updatedoffsets, &updatedbuflen,
needswal);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
......@@ -1194,6 +1167,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
{
OffsetNumber updatedoffset = updatedoffsets[i];
IndexTuple itup;
Size itemsz;
itup = updatable[i]->itup;
itemsz = MAXALIGN(IndexTupleSize(itup));
......@@ -1218,7 +1192,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* Clear the BTP_HAS_GARBAGE page flag.
*
* This flag indicates the presence of LP_DEAD items on the page (though
* not reliably). Note that we only trust it with pg_upgrade'd
* not reliably). Note that we only rely on it with pg_upgrade'd
* !heapkeyspace indexes. That's why clearing it here won't usually
* interfere with _bt_delitems_delete().
*/
......@@ -1227,7 +1201,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
MarkBufferDirty(buf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
if (needswal)
{
XLogRecPtr recptr;
xl_btree_vacuum xlrec_vacuum;
......@@ -1260,7 +1234,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
/* can't leak memory here */
if (updatedbuf != NULL)
pfree(updatedbuf);
/* free tuples generated by calling _bt_update_posting() */
/* free tuples allocated within _bt_delitems_update() */
for (int i = 0; i < nupdatable; i++)
pfree(updatable[i]->itup);
}
......@@ -1269,40 +1243,66 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* Delete item(s) from a btree leaf page during single-page cleanup.
*
* This routine assumes that the caller has pinned and write locked the
* buffer. Also, the given deletable array *must* be sorted in ascending
* order.
* buffer. Also, the given deletable and updatable arrays *must* be sorted in
* ascending order.
*
* Routine deals with deleting TIDs when some (but not all) of the heap TIDs
* in an existing posting list item are to be removed. This works by
* updating/overwriting an existing item with caller's new version of the item
* (a version that lacks the TIDs that are to be deleted).
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
* the page, but it needs to generate its own latestRemovedXid by accessing
* the heap. This is used by the REDO routine to generate recovery conflicts.
* Also, it doesn't handle posting list tuples unless the entire tuple can be
* deleted as a whole (since there is only one LP_DEAD bit per line pointer).
* the page, but it needs its own latestRemovedXid from caller (caller gets
* this from tableam). This is used by the REDO routine to generate recovery
* conflicts. The other difference is that only _bt_delitems_vacuum will
* clear page's VACUUM cycle ID.
*/
void
_bt_delitems_delete(Relation rel, Buffer buf,
static void
_bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable,
Relation heapRel)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
TransactionId latestRemovedXid = InvalidTransactionId;
bool needswal = RelationNeedsWAL(rel);
char *updatedbuf = NULL;
Size updatedbuflen = 0;
OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
/* Shouldn't be called unless there's something to do */
Assert(ndeletable > 0);
Assert(ndeletable > 0 || nupdatable > 0);
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
/* Generate new versions of posting lists without deleted TIDs */
if (nupdatable > 0)
updatedbuf = _bt_delitems_update(updatable, nupdatable,
updatedoffsets, &updatedbuflen,
needswal);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
/* Fix the page */
PageIndexMultiDelete(page, deletable, ndeletable);
/* Handle updates and deletes just like _bt_delitems_vacuum */
for (int i = 0; i < nupdatable; i++)
{
OffsetNumber updatedoffset = updatedoffsets[i];
IndexTuple itup;
Size itemsz;
itup = updatable[i]->itup;
itemsz = MAXALIGN(IndexTupleSize(itup));
if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup,
itemsz))
elog(PANIC, "failed to update partially dead item in block %u of index \"%s\"",
BufferGetBlockNumber(buf), RelationGetRelationName(rel));
}
if (ndeletable > 0)
PageIndexMultiDelete(page, deletable, ndeletable);
/*
* Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID,
* because this is not called by VACUUM
* Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID at
* this point. The VACUUM command alone controls vacuum cycle IDs.
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
......@@ -1310,7 +1310,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
* Clear the BTP_HAS_GARBAGE page flag.
*
* This flag indicates the presence of LP_DEAD items on the page (though
* not reliably). Note that we only trust it with pg_upgrade'd
* not reliably). Note that we only rely on it with pg_upgrade'd
* !heapkeyspace indexes.
*/
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
......@@ -1318,25 +1318,29 @@ _bt_delitems_delete(Relation rel, Buffer buf,
MarkBufferDirty(buf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
if (needswal)
{
XLogRecPtr recptr;
xl_btree_delete xlrec_delete;
xlrec_delete.latestRemovedXid = latestRemovedXid;
xlrec_delete.ndeleted = ndeletable;
xlrec_delete.nupdated = nupdatable;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterData((char *) &xlrec_delete, SizeOfBtreeDelete);
/*
* The deletable array is not in the buffer, but pretend that it is.
* When XLogInsert stores the whole buffer, the array need not be
* stored too.
*/
XLogRegisterBufData(0, (char *) deletable,
ndeletable * sizeof(OffsetNumber));
if (ndeletable > 0)
XLogRegisterBufData(0, (char *) deletable,
ndeletable * sizeof(OffsetNumber));
if (nupdatable > 0)
{
XLogRegisterBufData(0, (char *) updatedoffsets,
nupdatable * sizeof(OffsetNumber));
XLogRegisterBufData(0, updatedbuf, updatedbuflen);
}
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE);
......@@ -1344,83 +1348,313 @@ _bt_delitems_delete(Relation rel, Buffer buf,
}
END_CRIT_SECTION();
/* can't leak memory here */
if (updatedbuf != NULL)
pfree(updatedbuf);
/* free tuples allocated within _bt_delitems_update() */
for (int i = 0; i < nupdatable; i++)
pfree(updatable[i]->itup);
}
/*
* Get the latestRemovedXid from the table entries pointed to by the non-pivot
* tuples being deleted.
* Set up state needed to delete TIDs from posting list tuples via "updating"
* the tuple. Performs steps common to both _bt_delitems_vacuum and
* _bt_delitems_delete. These steps must take place before each function's
* critical section begins.
*
* updatabable and nupdatable are inputs, though note that we will use
* _bt_update_posting() to replace the original itup with a pointer to a final
* version in palloc()'d memory. Caller should free the tuples when its done.
*
* The first nupdatable entries from updatedoffsets are set to the page offset
* number for posting list tuples that caller updates. This is mostly useful
* because caller may need to WAL-log the page offsets (though we always do
* this for caller out of convenience).
*
* This is a specialized version of index_compute_xid_horizon_for_tuples().
* It's needed because btree tuples don't always store table TID using the
* standard index tuple header field.
* Returns buffer consisting of an array of xl_btree_update structs that
* describe the steps we perform here for caller (though only when needswal is
* true). Also sets *updatedbuflen to the final size of the buffer. This
* buffer is used by caller when WAL logging is required.
*/
static TransactionId
_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable)
static char *
_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
OffsetNumber *updatedoffsets, Size *updatedbuflen,
bool needswal)
{
TransactionId latestRemovedXid = InvalidTransactionId;
int spacenhtids;
int nhtids;
ItemPointer htids;
/* Array will grow iff there are posting list tuples to consider */
spacenhtids = ndeletable;
nhtids = 0;
htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
for (int i = 0; i < ndeletable; i++)
char *updatedbuf = NULL;
Size buflen = 0;
/* Shouldn't be called unless there's something to do */
Assert(nupdatable > 0);
for (int i = 0; i < nupdatable; i++)
{
ItemId itemid;
IndexTuple itup;
BTVacuumPosting vacposting = updatable[i];
Size itemsz;
itemid = PageGetItemId(page, deletable[i]);
itup = (IndexTuple) PageGetItem(page, itemid);
/* Replace work area IndexTuple with updated version */
_bt_update_posting(vacposting);
Assert(ItemIdIsDead(itemid));
Assert(!BTreeTupleIsPivot(itup));
/* Keep track of size of xl_btree_update for updatedbuf in passing */
itemsz = SizeOfBtreeUpdate + vacposting->ndeletedtids * sizeof(uint16);
buflen += itemsz;
if (!BTreeTupleIsPosting(itup))
/* Build updatedoffsets buffer in passing */
updatedoffsets[i] = vacposting->updatedoffset;
}
/* XLOG stuff */
if (needswal)
{
Size offset = 0;
/* Allocate, set final size for caller */
updatedbuf = palloc(buflen);
*updatedbuflen = buflen;
for (int i = 0; i < nupdatable; i++)
{
if (nhtids + 1 > spacenhtids)
{
spacenhtids *= 2;
htids = (ItemPointer)
repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
}
BTVacuumPosting vacposting = updatable[i];
Size itemsz;
xl_btree_update update;
update.ndeletedtids = vacposting->ndeletedtids;
memcpy(updatedbuf + offset, &update.ndeletedtids,
SizeOfBtreeUpdate);
offset += SizeOfBtreeUpdate;
Assert(ItemPointerIsValid(&itup->t_tid));
ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
nhtids++;
itemsz = update.ndeletedtids * sizeof(uint16);
memcpy(updatedbuf + offset, vacposting->deletetids, itemsz);
offset += itemsz;
}
else
}
return updatedbuf;
}
/*
* Comparator used by _bt_delitems_delete_check() to restore deltids array
* back to its original leaf-page-wise sort order
*/
static int
_bt_delitems_cmp(const void *a, const void *b)
{
TM_IndexDelete *indexdelete1 = (TM_IndexDelete *) a;
TM_IndexDelete *indexdelete2 = (TM_IndexDelete *) b;
if (indexdelete1->id > indexdelete2->id)
return 1;
if (indexdelete1->id < indexdelete2->id)
return -1;
Assert(false);
return 0;
}
/*
* Try to delete item(s) from a btree leaf page during single-page cleanup.
*
* nbtree interface to table_index_delete_tuples(). Deletes a subset of index
* tuples from caller's deltids array: those whose TIDs are found safe to
* delete by the tableam (or already marked LP_DEAD in index, and so already
* known to be deletable by our simple index deletion caller). We physically
* delete index tuples from buf leaf page last of all (for index tuples where
* that is known to be safe following our table_index_delete_tuples() call).
*
* Simple index deletion caller only includes TIDs from index tuples marked
* LP_DEAD, as well as extra TIDs it found on the same leaf page that can be
* included without increasing the total number of distinct table blocks for
* the deletion operation as a whole. This approach often allows us to delete
* some extra index tuples that were practically free for tableam to check in
* passing (when they actually turn out to be safe to delete). It probably
* only makes sense for the tableam to go ahead with these extra checks when
* it is block-orientated (otherwise the checks probably won't be practically
* free, which we rely on). The tableam interface requires the tableam side
* to handle the problem, though, so this is okay (we as an index AM are free
* to make the simplifying assumption that all tableams must be block-based).
*
* Bottom-up index deletion caller provides all the TIDs from the leaf page,
* without expecting that tableam will check most of them. The tableam has
* considerable discretion around which entries/blocks it checks. Our role in
* costing the bottom-up deletion operation is strictly advisory.
*
* Note: Caller must have added deltids entries (i.e. entries that go in
* delstate's main array) in leaf-page-wise order: page offset number order,
* TID order among entries taken from the same posting list tuple (tiebreak on
* TID). This order is convenient to work with here.
*
* Note: We also rely on the id field of each deltids element "capturing" this
* original leaf-page-wise order. That is, we expect to be able to get back
* to the original leaf-page-wise order just by sorting deltids on the id
* field (tableam will sort deltids for its own reasons, so we'll need to put
* it back in leaf-page-wise order afterwards).
*/
void
_bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp *delstate)
{
Page page = BufferGetPage(buf);
TransactionId latestRemovedXid;
OffsetNumber postingidxoffnum = InvalidOffsetNumber;
int ndeletable = 0,
nupdatable = 0;
OffsetNumber deletable[MaxIndexTuplesPerPage];
BTVacuumPosting updatable[MaxIndexTuplesPerPage];
/* Use tableam interface to determine which tuples to delete first */
latestRemovedXid = table_index_delete_tuples(heapRel, delstate);
/* Should not WAL-log latestRemovedXid unless it's required */
if (!XLogStandbyInfoActive() || !RelationNeedsWAL(rel))
latestRemovedXid = InvalidTransactionId;
/*
* Construct a leaf-page-wise description of what _bt_delitems_delete()
* needs to do to physically delete index tuples from the page.
*
* Must sort deltids array to restore leaf-page-wise order (original order
* before call to tableam). This is the order that the loop expects.
*
* Note that deltids array might be a lot smaller now. It might even have
* no entries at all (with bottom-up deletion caller), in which case there
* is nothing left to do.
*/
qsort(delstate->deltids, delstate->ndeltids, sizeof(TM_IndexDelete),
_bt_delitems_cmp);
if (delstate->ndeltids == 0)
{
Assert(delstate->bottomup);
return;
}
/* We definitely have to delete at least one index tuple (or one TID) */
for (int i = 0; i < delstate->ndeltids; i++)
{
TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id;
OffsetNumber idxoffnum = dstatus->idxoffnum;
ItemId itemid = PageGetItemId(page, idxoffnum);
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
int nestedi,
nitem;
BTVacuumPosting vacposting;
Assert(OffsetNumberIsValid(idxoffnum));
if (idxoffnum == postingidxoffnum)
{
/*
* This deltid entry is a TID from a posting list tuple that has
* already been completely processed
*/
Assert(BTreeTupleIsPosting(itup));
Assert(ItemPointerCompare(BTreeTupleGetHeapTID(itup),
&delstate->deltids[i].tid) < 0);
Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(itup),
&delstate->deltids[i].tid) >= 0);
continue;
}
if (!BTreeTupleIsPosting(itup))
{
/* Plain non-pivot tuple */
Assert(ItemPointerEquals(&itup->t_tid, &delstate->deltids[i].tid));
if (dstatus->knowndeletable)
deletable[ndeletable++] = idxoffnum;
continue;
}
/*
* itup is a posting list tuple whose lowest deltids entry (which may
* or may not be for the first TID from itup) is considered here now.
* We should process all of the deltids entries for the posting list
* together now, though (not just the lowest). Remember to skip over
* later itup-related entries during later iterations of outermost
* loop.
*/
postingidxoffnum = idxoffnum; /* Remember work in outermost loop */
nestedi = i; /* Initialize for first itup deltids entry */
vacposting = NULL; /* Describes final action for itup */
nitem = BTreeTupleGetNPosting(itup);
for (int p = 0; p < nitem; p++)
{
int nposting = BTreeTupleGetNPosting(itup);
ItemPointer ptid = BTreeTupleGetPostingN(itup, p);
int ptidcmp = -1;
if (nhtids + nposting > spacenhtids)
/*
* This nested loop reuses work across ptid TIDs taken from itup.
* We take advantage of the fact that both itup's TIDs and deltids
* entries (within a single itup/posting list grouping) must both
* be in ascending TID order.
*/
for (; nestedi < delstate->ndeltids; nestedi++)
{
spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
htids = (ItemPointer)
repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
TM_IndexDelete *tcdeltid = &delstate->deltids[nestedi];
TM_IndexStatus *tdstatus = (delstate->status + tcdeltid->id);
/* Stop once we get past all itup related deltids entries */
Assert(tdstatus->idxoffnum >= idxoffnum);
if (tdstatus->idxoffnum != idxoffnum)
break;
/* Skip past non-deletable itup related entries up front */
if (!tdstatus->knowndeletable)
continue;
/* Entry is first partial ptid match (or an exact match)? */
ptidcmp = ItemPointerCompare(&tcdeltid->tid, ptid);
if (ptidcmp >= 0)
{
/* Greater than or equal (partial or exact) match... */
break;
}
}
for (int j = 0; j < nposting; j++)
{
ItemPointer htid = BTreeTupleGetPostingN(itup, j);
/* ...exact ptid match to a deletable deltids entry? */
if (ptidcmp != 0)
continue;
Assert(ItemPointerIsValid(htid));
ItemPointerCopy(htid, &htids[nhtids]);
nhtids++;
/* Exact match for deletable deltids entry -- ptid gets deleted */
if (vacposting == NULL)
{
vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
nitem * sizeof(uint16));
vacposting->itup = itup;
vacposting->updatedoffset = idxoffnum;
vacposting->ndeletedtids = 0;
}
vacposting->deletetids[vacposting->ndeletedtids++] = p;
}
}
Assert(nhtids >= ndeletable);
/* Final decision on itup, a posting list tuple */
latestRemovedXid =
table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
if (vacposting == NULL)
{
/* No TIDs to delete from itup -- do nothing */
}
else if (vacposting->ndeletedtids == nitem)
{
/* Straight delete of itup (to delete all TIDs) */
deletable[ndeletable++] = idxoffnum;
/* Turns out we won't need granular information */
pfree(vacposting);
}
else
{
/* Delete some (but not all) TIDs from itup */
Assert(vacposting->ndeletedtids > 0 &&
vacposting->ndeletedtids < nitem);
updatable[nupdatable++] = vacposting;
}
}
pfree(htids);
/* Physically delete tuples (or TIDs) using deletable (or updatable) */
_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
updatable, nupdatable, heapRel);
return latestRemovedXid;
/* be tidy */
for (int i = 0; i < nupdatable; i++)
pfree(updatable[i]);
}
/*
......
......@@ -209,7 +209,7 @@ btinsert(Relation rel, Datum *values, bool *isnull,
itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
itup->t_tid = *ht_ctid;
result = _bt_doinsert(rel, itup, checkUnique, heapRel);
result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
pfree(itup);
......@@ -1282,10 +1282,10 @@ backtrack:
* as long as the callback function only considers whether the
* index tuple refers to pre-cutoff heap tuples that were
* certainly already pruned away during VACUUM's initial heap
* scan by the time we get here. (XLOG_HEAP2_CLEANUP_INFO
* records produce conflicts using a latestRemovedXid value
* for the entire VACUUM, so there is no need to produce our
* own conflict now.)
* scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
* and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
* a latestRemovedXid value for the pointed-to heap tuples, so
* there is no need to produce our own conflict now.)
*
* Backends with snapshots acquired after a VACUUM starts but
* before it finishes could have visibility cutoff with a
......
......@@ -49,7 +49,6 @@
#include "access/parallel.h"
#include "access/relscan.h"
#include "access/table.h"
#include "access/tableam.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
......
......@@ -556,6 +556,47 @@ btree_xlog_dedup(XLogReaderState *record)
UnlockReleaseBuffer(buf);
}
static void
btree_xlog_updates(Page page, OffsetNumber *updatedoffsets,
xl_btree_update *updates, int nupdated)
{
BTVacuumPosting vacposting;
IndexTuple origtuple;
ItemId itemid;
Size itemsz;
for (int i = 0; i < nupdated; i++)
{
itemid = PageGetItemId(page, updatedoffsets[i]);
origtuple = (IndexTuple) PageGetItem(page, itemid);
vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
updates->ndeletedtids * sizeof(uint16));
vacposting->updatedoffset = updatedoffsets[i];
vacposting->itup = origtuple;
vacposting->ndeletedtids = updates->ndeletedtids;
memcpy(vacposting->deletetids,
(char *) updates + SizeOfBtreeUpdate,
updates->ndeletedtids * sizeof(uint16));
_bt_update_posting(vacposting);
/* Overwrite updated version of tuple */
itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
(Item) vacposting->itup, itemsz))
elog(PANIC, "failed to update partially dead item");
pfree(vacposting->itup);
pfree(vacposting);
/* advance to next xl_btree_update from array */
updates = (xl_btree_update *)
((char *) updates + SizeOfBtreeUpdate +
updates->ndeletedtids * sizeof(uint16));
}
}
static void
btree_xlog_vacuum(XLogReaderState *record)
{
......@@ -589,41 +630,7 @@ btree_xlog_vacuum(XLogReaderState *record)
xlrec->nupdated *
sizeof(OffsetNumber));
for (int i = 0; i < xlrec->nupdated; i++)
{
BTVacuumPosting vacposting;
IndexTuple origtuple;
ItemId itemid;
Size itemsz;
itemid = PageGetItemId(page, updatedoffsets[i]);
origtuple = (IndexTuple) PageGetItem(page, itemid);
vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
updates->ndeletedtids * sizeof(uint16));
vacposting->updatedoffset = updatedoffsets[i];
vacposting->itup = origtuple;
vacposting->ndeletedtids = updates->ndeletedtids;
memcpy(vacposting->deletetids,
(char *) updates + SizeOfBtreeUpdate,
updates->ndeletedtids * sizeof(uint16));
_bt_update_posting(vacposting);
/* Overwrite updated version of tuple */
itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
(Item) vacposting->itup, itemsz))
elog(PANIC, "failed to update partially dead item");
pfree(vacposting->itup);
pfree(vacposting);
/* advance to next xl_btree_update from array */
updates = (xl_btree_update *)
((char *) updates + SizeOfBtreeUpdate +
updates->ndeletedtids * sizeof(uint16));
}
btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated);
}
if (xlrec->ndeleted > 0)
......@@ -675,7 +682,22 @@ btree_xlog_delete(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
if (xlrec->nupdated > 0)
{
OffsetNumber *updatedoffsets;
xl_btree_update *updates;
updatedoffsets = (OffsetNumber *)
(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
updates = (xl_btree_update *) ((char *) updatedoffsets +
xlrec->nupdated *
sizeof(OffsetNumber));
btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated);
}
if (xlrec->ndeleted > 0)
PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
/* Mark the page as not containing any LP_DEAD items */
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
......
......@@ -63,8 +63,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_delete *xlrec = (xl_btree_delete *) rec;
appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u",
xlrec->latestRemovedXid, xlrec->ndeleted);
appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u; nupdated %u",
xlrec->latestRemovedXid, xlrec->ndeleted, xlrec->nupdated);
break;
}
case XLOG_BTREE_MARK_PAGE_HALFDEAD:
......
......@@ -66,7 +66,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_tid_valid != NULL);
Assert(routine->tuple_get_latest_tid != NULL);
Assert(routine->tuple_satisfies_snapshot != NULL);
Assert(routine->compute_xid_horizon_for_tuples != NULL);
Assert(routine->index_delete_tuples != NULL);
Assert(routine->tuple_insert != NULL);
......
......@@ -166,9 +166,8 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *items,
int nitems);
extern TransactionId heap_index_delete_tuples(Relation rel,
TM_IndexDeleteOp *delstate);
/* in heap/pruneheap.c */
struct GlobalVisState;
......
......@@ -17,6 +17,7 @@
#include "access/amapi.h"
#include "access/itup.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
#include "catalog/pg_am_d.h"
#include "catalog/pg_index.h"
......@@ -168,7 +169,7 @@ typedef struct BTMetaPageData
/*
* MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
* that may be stored on a btree leaf page. It is used to size the
* per-page temporary buffers used by index scans.
* per-page temporary buffers.
*
* Note: we don't bother considering per-tuple overheads here to keep
* things simple (value is based on how many elements a single array of
......@@ -766,8 +767,9 @@ typedef struct BTDedupStateData
typedef BTDedupStateData *BTDedupState;
/*
* BTVacuumPostingData is state that represents how to VACUUM a posting list
* tuple when some (though not all) of its TIDs are to be deleted.
* BTVacuumPostingData is state that represents how to VACUUM (or delete) a
* posting list tuple when some (though not all) of its TIDs are to be
* deleted.
*
* Convention is that itup field is the original posting list tuple on input,
* and palloc()'d final tuple used to overwrite existing tuple on output.
......@@ -1031,6 +1033,8 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
IndexTuple newitem, Size newitemsz,
bool checkingunique);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
OffsetNumber baseoff);
extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
......@@ -1045,7 +1049,8 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
* prototypes for functions in nbtinsert.c
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
IndexUniqueCheck checkUnique, bool indexUnchanged,
Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
......@@ -1083,9 +1088,9 @@ extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
Relation heapRel);
extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
Relation heapRel,
TM_IndexDeleteOp *delstate);
extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
TransactionId *oldestBtpoXact);
......
......@@ -176,24 +176,6 @@ typedef struct xl_btree_dedup
#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
/*
* This is what we need to know about delete of individual leaf index tuples.
* The WAL record can represent deletion of any number of index tuples on a
* single index page when *not* executed by VACUUM. Deletion of a subset of
* the TIDs within a posting list tuple is not supported.
*
* Backup Blk 0: index page
*/
typedef struct xl_btree_delete
{
TransactionId latestRemovedXid;
uint32 ndeleted;
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
} xl_btree_delete;
#define SizeOfBtreeDelete (offsetof(xl_btree_delete, ndeleted) + sizeof(uint32))
/*
* This is what we need to know about page reuse within btree. This record
* only exists to generate a conflict point for Hot Standby.
......@@ -211,31 +193,30 @@ typedef struct xl_btree_reuse_page
#define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page))
/*
* This is what we need to know about which TIDs to remove from an individual
* posting list tuple during vacuuming. An array of these may appear at the
* end of xl_btree_vacuum records.
*/
typedef struct xl_btree_update
{
uint16 ndeletedtids;
/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
} xl_btree_update;
#define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
/*
* This is what we need to know about a VACUUM of a leaf page. The WAL record
* can represent deletion of any number of index tuples on a single index page
* when executed by VACUUM. It can also support "updates" of index tuples,
* which is how deletes of a subset of TIDs contained in an existing posting
* list tuple are implemented. (Updates are only used when there will be some
* remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
* just be deleted).
* xl_btree_vacuum and xl_btree_delete records describe deletion of index
* tuples on a leaf page. The former variant is used by VACUUM, while the
* latter variant is used by the ad-hoc deletions that sometimes take place
* when btinsert() is called.
*
* The records are very similar. The only difference is that xl_btree_delete
* has to include a latestRemovedXid field to generate recovery conflicts.
* (VACUUM operations can just rely on earlier conflicts generated during
* pruning of the table whose TIDs the to-be-deleted index tuples point to.
* There are also small differences between each REDO routine that we don't go
* into here.)
*
* xl_btree_vacuum and xl_btree_delete both represent deletion of any number
* of index tuples on a single leaf page using page offset numbers. Both also
* support "updates" of index tuples, which is how deletes of a subset of TIDs
* contained in an existing posting list tuple are implemented.
*
* Updated posting list tuples are represented using xl_btree_update metadata.
* The REDO routine uses each xl_btree_update (plus its corresponding original
* index tuple from the target leaf page) to generate the final updated tuple.
* The REDO routines each use the xl_btree_update entries (plus each
* corresponding original index tuple from the target leaf page) to generate
* the final updated tuple.
*
* Updates are only used when there will be some remaining TIDs left by the
* REDO routine. Otherwise the posting list tuple just gets deleted outright.
*/
typedef struct xl_btree_vacuum
{
......@@ -244,11 +225,39 @@ typedef struct xl_btree_vacuum
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
} xl_btree_vacuum;
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
typedef struct xl_btree_delete
{
TransactionId latestRemovedXid;
uint16 ndeleted;
uint16 nupdated;
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
} xl_btree_delete;
#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
/*
* The offsets that appear in xl_btree_update metadata are offsets into the
* original posting list from tuple, not page offset numbers. These are
* 0-based. The page offset number for the original posting list tuple comes
* from the main xl_btree_vacuum/xl_btree_delete record.
*/
typedef struct xl_btree_update
{
uint16 ndeletedtids;
/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
} xl_btree_update;
#define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
/*
* This is what we need to know about marking an empty subtree for deletion.
* The target identifies the tuple removed from the parent page (note that we
......
......@@ -128,6 +128,106 @@ typedef struct TM_FailureData
bool traversed;
} TM_FailureData;
/*
* State used when calling table_index_delete_tuples().
*
* Represents the status of table tuples, referenced by table TID and taken by
* index AM from index tuples. State consists of high level parameters of the
* deletion operation, plus two mutable palloc()'d arrays for information
* about the status of individual table tuples. These are conceptually one
* single array. Using two arrays keeps the TM_IndexDelete struct small,
* which makes sorting the first array (the deltids array) fast.
*
* Some index AM callers perform simple index tuple deletion (by specifying
* bottomup = false), and include only known-dead deltids. These known-dead
* entries are all marked knowndeletable = true directly (typically these are
* TIDs from LP_DEAD-marked index tuples), but that isn't strictly required.
*
* Callers that specify bottomup = true are "bottom-up index deletion"
* callers. The considerations for the tableam are more subtle with these
* callers because they ask the tableam to perform highly speculative work,
* and might only expect the tableam to check a small fraction of all entries.
* Caller is not allowed to specify knowndeletable = true for any entry
* because everything is highly speculative. Bottom-up caller provides
* context and hints to tableam -- see comments below for details on how index
* AMs and tableams should coordinate during bottom-up index deletion.
*
* Simple index deletion callers may ask the tableam to perform speculative
* work, too. This is a little like bottom-up deletion, but not too much.
* The tableam will only perform speculative work when it's practically free
* to do so in passing for simple deletion caller (while always performing
* whatever work is is needed to enable knowndeletable/LP_DEAD index tuples to
* be deleted within index AM). This is the real reason why it's possible for
* simple index deletion caller to specify knowndeletable = false up front
* (this means "check if it's possible for me to delete corresponding index
* tuple when it's cheap to do so in passing"). The index AM should only
* include "extra" entries for index tuples whose TIDs point to a table block
* that tableam is expected to have to visit anyway (in the event of a block
* orientated tableam). The tableam isn't strictly obligated to check these
* "extra" TIDs, but a block-based AM should always manage to do so in
* practice.
*
* The final contents of the deltids/status arrays are interesting to callers
* that ask tableam to perform speculative work (i.e. when _any_ items have
* knowndeletable set to false up front). These index AM callers will
* naturally need to consult final state to determine which index tuples are
* in fact deletable.
*
* The index AM can keep track of which index tuple relates to which deltid by
* setting idxoffnum (and/or relying on each entry being uniquely identifiable
* using tid), which is important when the final contents of the array will
* need to be interpreted -- the array can shrink from initial size after
* tableam processing and/or have entries in a new order (tableam may sort
* deltids array for its own reasons). Bottom-up callers may find that final
* ndeltids is 0 on return from call to tableam, in which case no index tuple
* deletions are possible. Simple deletion callers can rely on any entries
* they know to be deletable appearing in the final array as deletable.
*/
typedef struct TM_IndexDelete
{
ItemPointerData tid; /* table TID from index tuple */
int16 id; /* Offset into TM_IndexStatus array */
} TM_IndexDelete;
typedef struct TM_IndexStatus
{
OffsetNumber idxoffnum; /* Index am page offset number */
bool knowndeletable; /* Currently known to be deletable? */
/* Bottom-up index deletion specific fields follow */
bool promising; /* Promising (duplicate) index tuple? */
int16 freespace; /* Space freed in index if deleted */
} TM_IndexStatus;
/*
* Index AM/tableam coordination is central to the design of bottom-up index
* deletion. The index AM provides hints about where to look to the tableam
* by marking some entries as "promising". Index AM does this with duplicate
* index tuples that are strongly suspected to be old versions left behind by
* UPDATEs that did not logically modify indexed values. Index AM may find it
* helpful to only mark entries as promising when they're thought to have been
* affected by such an UPDATE in the recent past.
*
* Bottom-up index deletion casts a wide net at first, usually by including
* all TIDs on a target index page. It is up to the tableam to worry about
* the cost of checking transaction status information. The tableam is in
* control, but needs careful guidance from the index AM. Index AM requests
* that bottomupfreespace target be met, while tableam measures progress
* towards that goal by tallying the per-entry freespace value for known
* deletable entries. (All !bottomup callers can just set these space related
* fields to zero.)
*/
typedef struct TM_IndexDeleteOp
{
bool bottomup; /* Bottom-up (not simple) deletion? */
int bottomupfreespace; /* Bottom-up space target */
/* Mutable per-TID information follows (index AM initializes entries) */
int ndeltids; /* Current # of deltids/status elements */
TM_IndexDelete *deltids;
TM_IndexStatus *status;
} TM_IndexDeleteOp;
/* "options" flag bits for table_tuple_insert */
/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
......@@ -342,10 +442,9 @@ typedef struct TableAmRoutine
TupleTableSlot *slot,
Snapshot snapshot);
/* see table_compute_xid_horizon_for_tuples() */
TransactionId (*compute_xid_horizon_for_tuples) (Relation rel,
ItemPointerData *items,
int nitems);
/* see table_index_delete_tuples() */
TransactionId (*index_delete_tuples) (Relation rel,
TM_IndexDeleteOp *delstate);
/* ------------------------------------------------------------------------
......@@ -1122,16 +1221,23 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
}
/*
* Compute the newest xid among the tuples pointed to by items. This is used
* to compute what snapshots to conflict with when replaying WAL records for
* page-level index vacuums.
* Determine which index tuples are safe to delete based on their table TID.
*
* Determines which entries from index AM caller's TM_IndexDeleteOp state
* point to vacuumable table tuples. Entries that are found by tableam to be
* vacuumable are naturally safe for index AM to delete, and so get directly
* marked as deletable. See comments above TM_IndexDelete and comments above
* TM_IndexDeleteOp for full details.
*
* Returns a latestRemovedXid transaction ID that caller generally places in
* its index deletion WAL record. This might be used during subsequent REDO
* of the WAL record when in Hot Standby mode -- a recovery conflict for the
* index deletion operation might be required on the standby.
*/
static inline TransactionId
table_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *items,
int nitems)
table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
{
return rel->rd_tableam->compute_xid_horizon_for_tuples(rel, items, nitems);
return rel->rd_tableam->index_delete_tuples(rel, delstate);
}
......
......@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
#define XLOG_PAGE_MAGIC 0xD108 /* can be used as WAL version indicator */
#define XLOG_PAGE_MAGIC 0xD109 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment