Enhance nbtree index tuple deletion.

Teach nbtree and heapam to cooperate in order to eagerly remove duplicate tuples representing dead MVCC versions. This is "bottom-up deletion". Each bottom-up deletion pass is triggered lazily in response to a flood of versions on an nbtree leaf page. This usually involves a "logically unchanged index" hint (these are produced by the executor mechanism added by commit 9dc718bd). The immediate goal of bottom-up index deletion is to avoid "unnecessary" page splits caused entirely by version duplicates. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The overall effect is to avoid certain pathological performance issues related to "version churn" from UPDATEs. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion" to clearly distinguish it from bottom-up deletion) won't usually need to visit any extra table blocks to check these extra tuples. We have to visit the same table blocks anyway to generate a latestRemovedXid value (at least in the common case where the index deletion operation's WAL record needs such a value). Testing has shown that the "extra tuples" simple deletion enhancement increases the number of index tuples deleted with almost any workload that has LP_DEAD bits set in leaf pages. That is, it almost never fails to delete at least a few extra index tuples. It helps most of all in cases that happen to naturally have a lot of delete-safe tuples. It's not uncommon for an individual deletion operation to end up deleting an order of magnitude more index tuples compared to the old naive approach (e.g., custom instrumentation of the patch shows that this happens fairly often when the regression tests are run). Add a further enhancement that augments simple deletion and bottom-up deletion in indexes that make use of deduplication: Teach nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. It is now possible to delete individual TIDs from posting list tuples provided the TIDs have a tableam block number of a table block that gets visited as part of the deletion process (visiting the table block can be triggered directly or indirectly). Setting the LP_DEAD bit of a posting list tuple is still an all-or-nothing thing, but that matters much less now that deletion only needs to start out with the right _general_ idea about which index tuples are deletable. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from bottom-up index deletion (i.e. no reindexing required) following a pg_upgrade. The enhancement to simple deletion is available with all B-Tree indexes following a pg_upgrade, no matter what PostgreSQL version the user upgrades from. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com

Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly remove duplicate tuples representing dead MVCC versions. This is "bottom-up deletion". Each bottom-up deletion pass is triggered lazily in response to a flood of versions on an nbtree leaf page. This usually involves a "logically unchanged index" hint (these are produced by the executor mechanism added by commit 9dc718bd). The immediate goal of bottom-up index deletion is to avoid "unnecessary" page splits caused entirely by version duplicates. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The overall effect is to avoid certain pathological performance issues related to "version churn" from UPDATEs. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion" to clearly distinguish it from bottom-up deletion) won't usually need to visit any extra table blocks to check these extra tuples. We have to visit the same table blocks anyway to generate a latestRemovedXid value (at least in the common case where the index deletion operation's WAL record needs such a value). Testing has shown that the "extra tuples" simple deletion enhancement increases the number of index tuples deleted with almost any workload that has LP_DEAD bits set in leaf pages. That is, it almost never fails to delete at least a few extra index tuples. It helps most of all in cases that happen to naturally have a lot of delete-safe tuples. It's not uncommon for an individual deletion operation to end up deleting an order of magnitude more index tuples compared to the old naive approach (e.g., custom instrumentation of the patch shows that this happens fairly often when the regression tests are run). Add a further enhancement that augments simple deletion and bottom-up deletion in indexes that make use of deduplication: Teach nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. It is now possible to delete individual TIDs from posting list tuples provided the TIDs have a tableam block number of a table block that gets visited as part of the deletion process (visiting the table block can be triggered directly or indirectly). Setting the LP_DEAD bit of a posting list tuple is still an all-or-nothing thing, but that matters much less now that deletion only needs to start out with the right _general_ idea about which index tuples are deletable. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from bottom-up index deletion (i.e. no reindexing required) following a pg_upgrade. The enhancement to simple deletion is available with all B-Tree indexes following a pg_upgrade, no matter what PostgreSQL version the user upgrades from. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
d168b666 · Peter Geoghegan · 9dc718bd · d168b666 · d168b666 · d168b666
Commit d168b666 authored Jan 13, 2021 by Peter Geoghegan
19 changed files
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -629,6 +629,109 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
  </para>
 </sect2>

+ <sect2 id="btree-deletion">
+  <title>Bottom-up index deletion</title>
+  <para>
+   B-Tree indexes are not directly aware that under MVCC, there might
+   be multiple extant versions of the same logical table row; to an
+   index, each tuple is an independent object that needs its own index
+   entry.  <quote>Version churn</quote> tuples may sometimes
+   accumulate and adversely affect query latency and throughput.  This
+   typically occurs with <command>UPDATE</command>-heavy workloads
+   where most individual updates cannot apply the
+   <acronym>HOT</acronym> optimization.  Changing the value of only
+   one column covered by one index during an <command>UPDATE</command>
+   <emphasis>always</emphasis> necessitates a new set of index tuples
+   &mdash; one for <emphasis>each and every</emphasis> index on the
+   table.  Note in particular that this includes indexes that were not
+   <quote>logically modified</quote> by the <command>UPDATE</command>.
+   All indexes will need a successor physical index tuple that points
+   to the latest version in the table.  Each new tuple within each
+   index will generally need to coexist with the original
+   <quote>updated</quote> tuple for a short period of time (typically
+   until shortly after the <command>UPDATE</command> transaction
+   commits).
+  </para>
+  <para>
+   B-Tree indexes incrementally delete version churn index tuples by
+   performing <firstterm>bottom-up index deletion</firstterm> passes.
+   Each deletion pass is triggered in reaction to an anticipated
+   <quote>version churn page split</quote>.  This only happens with
+   indexes that are not logically modified by
+   <command>UPDATE</command> statements, where concentrated build up
+   of obsolete versions in particular pages would occur otherwise.  A
+   page split will usually be avoided, though it's possible that
+   certain implementation-level heuristics will fail to identify and
+   delete even one garbage index tuple (in which case a page split or
+   deduplication pass resolves the issue of an incoming new tuple not
+   fitting on a leaf page).  The worst case number of versions that
+   any index scan must traverse (for any single logical row) is an
+   important contributor to overall system responsiveness and
+   throughput.  A bottom-up index deletion pass targets suspected
+   garbage tuples in a single leaf page based on
+   <emphasis>qualitative</emphasis> distinctions involving logical
+   rows and versions.  This contrasts with the <quote>top-down</quote>
+   index cleanup performed by autovacuum workers, which is triggered
+   when certain <emphasis>quantitative</emphasis> table-level
+   thresholds are exceeded (see <xref linkend="autovacuum"/>).
+  </para>
+  <note>
+   <para>
+    Not all deletion operations that are performed within B-Tree
+    indexes are bottom-up deletion operations.  There is a distinct
+    category of index tuple deletion: <firstterm>simple index tuple
+     deletion</firstterm>.  This is a deferred maintenance operation
+    that deletes index tuples that are known to be safe to delete
+    (those whose item identifier's <literal>LP_DEAD</literal> bit is
+    already set).  Like bottom-up index deletion, simple index
+    deletion takes place at the point that a page split is anticipated
+    as a way of avoiding the split.
+   </para>
+   <para>
+    Simple deletion is opportunistic in the sense that it can only
+    take place when recent index scans set the
+    <literal>LP_DEAD</literal> bits of affected items in passing.
+    Prior to <productname>PostgreSQL</productname> 14, the only
+    category of B-Tree deletion was simple deletion.  The main
+    differences between it and bottom-up deletion are that only the
+    former is opportunistically driven by the activity of passing
+    index scans, while only the latter specifically targets version
+    churn from <command>UPDATE</command>s that do not logically modify
+    indexed columns.
+   </para>
+  </note>
+  <para>
+   Bottom-up index deletion performs the vast majority of all garbage
+   index tuple cleanup for particular indexes with certain workloads.
+   This is expected with any B-Tree index that is subject to
+   significant version churn from <command>UPDATE</command>s that
+   rarely or never logically modify the columns that the index covers.
+   The average and worst case number of versions per logical row can
+   be kept low purely through targeted incremental deletion passes.
+   It's quite possible that the on-disk size of certain indexes will
+   never increase by even one single page/block despite
+   <emphasis>constant</emphasis> version churn from
+   <command>UPDATE</command>s.  Even then, an exhaustive <quote>clean
+    sweep</quote> by a <command>VACUUM</command> operation (typically
+   run in an autovacuum worker process) will eventually be required as
+   a part of <emphasis>collective</emphasis> cleanup of the table and
+   each of its indexes.
+  </para>
+  <para>
+   Unlike <command>VACUUM</command>, bottom-up index deletion does not
+   provide any strong guarantees about how old the oldest garbage
+   index tuple may be.  No index can be permitted to retain
+   <quote>floating garbage</quote> index tuples that became dead prior
+   to a conservative cutoff point shared by the table and all of its
+   indexes collectively.  This fundamental table-level invariant makes
+   it safe to recycle table <acronym>TID</acronym>s.  This is how it
+   is possible for distinct logical rows to reuse the same table
+   <acronym>TID</acronym> over time (though this can never happen with
+   two logical rows whose lifetimes span the same
+   <command>VACUUM</command> cycle).
+  </para>
+ </sect2>
+
 <sect2 id="btree-deduplication">
  <title>Deduplication</title>
  <para>
@@ -666,15 +769,17 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
  </note>
  <para>
   The deduplication process occurs lazily, when a new item is
-   inserted that cannot fit on an existing leaf page.  This prevents
-   (or at least delays) leaf page splits.  Unlike GIN posting list
-   tuples, B-Tree posting list tuples do not need to expand every time
-   a new duplicate is inserted; they are merely an alternative
-   physical representation of the original logical contents of the
-   leaf page.  This design prioritizes consistent performance with
-   mixed read-write workloads.  Most client applications will at least
-   see a moderate performance benefit from using deduplication.
-   Deduplication is enabled by default.
+   inserted that cannot fit on an existing leaf page, though only when
+   index tuple deletion could not free sufficient space for the new
+   item (typically deletion is briefly considered and then skipped
+   over).  Unlike GIN posting list tuples, B-Tree posting list tuples
+   do not need to expand every time a new duplicate is inserted; they
+   are merely an alternative physical representation of the original
+   logical contents of the leaf page.  This design prioritizes
+   consistent performance with mixed read-write workloads.  Most
+   client applications will at least see a moderate performance
+   benefit from using deduplication.  Deduplication is enabled by
+   default.
  </para>
  <para>
   <command>CREATE INDEX</command> and <command>REINDEX</command>
@@ -702,25 +807,16 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
   deduplication isn't usually helpful.
  </para>
  <para>
-   B-Tree indexes are not directly aware that under MVCC, there might
-   be multiple extant versions of the same logical table row; to an
-   index, each tuple is an independent object that needs its own index
-   entry.  <quote>Version duplicates</quote> may sometimes accumulate
-   and adversely affect query latency and throughput.  This typically
-   occurs with <command>UPDATE</command>-heavy workloads where most
-   individual updates cannot apply the <acronym>HOT</acronym>
-   optimization (often because at least one indexed column gets
-   modified, necessitating a new set of index tuple versions &mdash;
-   one new tuple for <emphasis>each and every</emphasis> index).  In
-   effect, B-Tree deduplication ameliorates index bloat caused by
-   version churn.  Note that even the tuples from a unique index are
-   not necessarily <emphasis>physically</emphasis> unique when stored
-   on disk due to version churn.  The deduplication optimization is
-   selectively applied within unique indexes.  It targets those pages
-   that appear to have version duplicates.  The high level goal is to
-   give <command>VACUUM</command> more time to run before an
-   <quote>unnecessary</quote> page split caused by version churn can
-   take place.
+   It is sometimes possible for unique indexes (as well as unique
+   constraints) to use deduplication.  This allows leaf pages to
+   temporarily <quote>absorb</quote> extra version churn duplicates.
+   Deduplication in unique indexes augments bottom-up index deletion,
+   especially in cases where a long-running transactions holds a
+   snapshot that blocks garbage collection.  The goal is to buy time
+   for the bottom-up index deletion strategy to become effective
+   again.  Delaying page splits until a single long-running
+   transaction naturally goes away can allow a bottom-up deletion pass
+   to succeed where an earlier deletion pass failed.
  </para>
  <tip>
   <para>

--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     <para>
      The fillfactor for an index is a percentage that determines how full
      the index method will try to pack index pages.  For B-trees, leaf pages
-      are filled to this percentage during initial index build, and also
+      are filled to this percentage during initial index builds, and also
      when extending the index at the right (adding new largest key values).
      If pages
      subsequently become completely full, they will be split, leading to
-      gradual degradation in the index's efficiency.  B-trees use a default
+      fragmentation of the on-disk index structure.  B-trees use a default
      fillfactor of 90, but any integer value from 10 to 100 can be selected.
-      If the table is static then fillfactor 100 is best to minimize the
-      index's physical size, but for heavily updated tables a smaller
-      fillfactor is better to minimize the need for page splits.  The
-      other index methods use fillfactor in different but roughly analogous
-      ways; the default fillfactor varies between methods.
+     </para>
+     <para>
+      B-tree indexes on tables where many inserts and/or updates are
+      anticipated can benefit from lower fillfactor settings at
+      <command>CREATE INDEX</command> time (following bulk loading into the
+      table).  Values in the range of 50 - 90 can usefully <quote>smooth
+       out</quote> the <emphasis>rate</emphasis> of page splits during the
+      early life of the B-tree index (lowering fillfactor like this may even
+      lower the absolute number of page splits, though this effect is highly
+      workload dependent).  The B-tree bottom-up index deletion technique
+      described in <xref linkend="btree-deletion"/> is dependent on having
+      some <quote>extra</quote> space on pages to store <quote>extra</quote>
+      tuple versions, and so can be affected by fillfactor (though the effect
+      is usually not significant).
+     </para>
+     <para>
+      In other specific cases it might be useful to increase fillfactor to
+      100 at <command>CREATE INDEX</command> time as a way of maximizing
+      space utilization.  You should only consider this when you are
+      completely sure that the table is static (i.e. that it will never be
+      affected by either inserts or updates).  A fillfactor setting of 100
+      otherwise risks <emphasis>harming</emphasis> performance: even a few
+      updates or inserts will cause a sudden flood of page splits.
+     </para>
+     <para>
+      The other index methods use fillfactor in different but roughly
+      analogous ways; the default fillfactor varies between methods.
     </para>
    </listitem>
   </varlistentry>

--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2563,7 +2563,7 @@ static const TableAmRoutine heapam_methods = {
 	.tuple_get_latest_tid = heap_get_latest_tid,
 	.tuple_tid_valid = heapam_tuple_tid_valid,
 	.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
-	.compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples,
+	.index_delete_tuples = heap_index_delete_tuples,

 	.relation_set_new_filenode = heapam_relation_set_new_filenode,
 	.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,

--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,11 +276,18 @@ BuildIndexValueDescription(Relation indexRelation,

 /*
 * Get the latestRemovedXid from the table entries pointed at by the index
- * tuples being deleted.
- *
- * Note: index access methods that don't consistently use the standard
- * IndexTuple + heap TID item pointer representation will need to provide
- * their own version of this function.
+ * tuples being deleted using an AM-generic approach.
+ *
+ * This is a table_index_delete_tuples() shim used by index AMs that have
+ * simple requirements.  These callers only need to consult the tableam to get
+ * a latestRemovedXid value, and only expect to delete tuples that are already
+ * known deletable.  When a latestRemovedXid value isn't needed in index AM's
+ * deletion WAL record, it is safe for it to skip calling here entirely.
+ *
+ * We assume that caller index AM uses the standard IndexTuple representation,
+ * with table TIDs stored in the t_tid field.  We also expect (and assert)
+ * that the line pointers on page for 'itemnos' offsets are already marked
+ * LP_DEAD.
 */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
@@ -289,12 +296,17 @@ index_compute_xid_horizon_for_tuples(Relation irel,
 									 OffsetNumber *itemnos,
 									 int nitems)
 {
-	ItemPointerData *ttids =
-	(ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
+	TM_IndexDeleteOp delstate;
 	TransactionId latestRemovedXid = InvalidTransactionId;
 	Page		ipage = BufferGetPage(ibuf);
 	IndexTuple	itup;

+	delstate.bottomup = false;
+	delstate.bottomupfreespace = 0;
+	delstate.ndeltids = 0;
+	delstate.deltids = palloc(nitems * sizeof(TM_IndexDelete));
+	delstate.status = palloc(nitems * sizeof(TM_IndexStatus));
+
 	/* identify what the index tuples about to be deleted point to */
 	for (int i = 0; i < nitems; i++)
 	{
@@ -303,14 +315,26 @@ index_compute_xid_horizon_for_tuples(Relation irel,
 		iitemid = PageGetItemId(ipage, itemnos[i]);
 		itup = (IndexTuple) PageGetItem(ipage, iitemid);

-		ItemPointerCopy(&itup->t_tid, &ttids[i]);
+		Assert(ItemIdIsDead(iitemid));
+
+		ItemPointerCopy(&itup->t_tid, &delstate.deltids[i].tid);
+		delstate.deltids[i].id = delstate.ndeltids;
+		delstate.status[i].idxoffnum = InvalidOffsetNumber; /* unused */
+		delstate.status[i].knowndeletable = true;	/* LP_DEAD-marked */
+		delstate.status[i].promising = false;	/* unused */
+		delstate.status[i].freespace = 0;	/* unused */
+
+		delstate.ndeltids++;
 	}

 	/* determine the actual xid horizon */
-	latestRemovedXid =
-		table_compute_xid_horizon_for_tuples(hrel, ttids, nitems);
+	latestRemovedXid = table_index_delete_tuples(hrel, &delstate);
+
+	/* assert tableam agrees that all items are deletable */
+	Assert(delstate.ndeltids == nitems);

-	pfree(ttids);
+	pfree(delstate.deltids);
+	pfree(delstate.status);

 	return latestRemovedXid;
 }

--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -209,7 +209,7 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
 	itup->t_tid = *ht_ctid;

-	result = _bt_doinsert(rel, itup, checkUnique, heapRel);
+	result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);

 	pfree(itup);

@@ -1282,10 +1282,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (XLOG_HEAP2_CLEANUP_INFO
-				 * records produce conflicts using a latestRemovedXid value
-				 * for the entire VACUUM, so there is no need to produce our
-				 * own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
+				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
+				 * a latestRemovedXid value for the pointed-to heap tuples, so
+				 * there is no need to produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a

--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -49,7 +49,6 @@
 #include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/table.h"
-#include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"

--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -556,6 +556,47 @@ btree_xlog_dedup(XLogReaderState *record)
 		UnlockReleaseBuffer(buf);
 }

+static void
+btree_xlog_updates(Page page, OffsetNumber *updatedoffsets,
+				   xl_btree_update *updates, int nupdated)
+{
+	BTVacuumPosting vacposting;
+	IndexTuple	origtuple;
+	ItemId		itemid;
+	Size		itemsz;
+
+	for (int i = 0; i < nupdated; i++)
+	{
+		itemid = PageGetItemId(page, updatedoffsets[i]);
+		origtuple = (IndexTuple) PageGetItem(page, itemid);
+
+		vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+							updates->ndeletedtids * sizeof(uint16));
+		vacposting->updatedoffset = updatedoffsets[i];
+		vacposting->itup = origtuple;
+		vacposting->ndeletedtids = updates->ndeletedtids;
+		memcpy(vacposting->deletetids,
+			   (char *) updates + SizeOfBtreeUpdate,
+			   updates->ndeletedtids * sizeof(uint16));
+
+		_bt_update_posting(vacposting);
+
+		/* Overwrite updated version of tuple */
+		itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
+		if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+									 (Item) vacposting->itup, itemsz))
+			elog(PANIC, "failed to update partially dead item");
+
+		pfree(vacposting->itup);
+		pfree(vacposting);
+
+		/* advance to next xl_btree_update from array */
+		updates = (xl_btree_update *)
+			((char *) updates + SizeOfBtreeUpdate +
+			 updates->ndeletedtids * sizeof(uint16));
+	}
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -589,41 +630,7 @@ btree_xlog_vacuum(XLogReaderState *record)
 										   xlrec->nupdated *
 										   sizeof(OffsetNumber));

-			for (int i = 0; i < xlrec->nupdated; i++)
-			{
-				BTVacuumPosting vacposting;
-				IndexTuple	origtuple;
-				ItemId		itemid;
-				Size		itemsz;
-
-				itemid = PageGetItemId(page, updatedoffsets[i]);
-				origtuple = (IndexTuple) PageGetItem(page, itemid);
-
-				vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
-									updates->ndeletedtids * sizeof(uint16));
-				vacposting->updatedoffset = updatedoffsets[i];
-				vacposting->itup = origtuple;
-				vacposting->ndeletedtids = updates->ndeletedtids;
-				memcpy(vacposting->deletetids,
-					   (char *) updates + SizeOfBtreeUpdate,
-					   updates->ndeletedtids * sizeof(uint16));
-
-				_bt_update_posting(vacposting);
-
-				/* Overwrite updated version of tuple */
-				itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
-				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
-											 (Item) vacposting->itup, itemsz))
-					elog(PANIC, "failed to update partially dead item");
-
-				pfree(vacposting->itup);
-				pfree(vacposting);
-
-				/* advance to next xl_btree_update from array */
-				updates = (xl_btree_update *)
-					((char *) updates + SizeOfBtreeUpdate +
-					 updates->ndeletedtids * sizeof(uint16));
-			}
+			btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated);
 		}

 		if (xlrec->ndeleted > 0)
@@ -675,7 +682,22 @@ btree_xlog_delete(XLogReaderState *record)

 		page = (Page) BufferGetPage(buffer);

-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			xl_btree_update *updates;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updates = (xl_btree_update *) ((char *) updatedoffsets +
+										   xlrec->nupdated *
+										   sizeof(OffsetNumber));
+
+			btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated);
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);

 		/* Mark the page as not containing any LP_DEAD items */
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);

--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -63,8 +63,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_delete *xlrec = (xl_btree_delete *) rec;

-				appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u",
-								 xlrec->latestRemovedXid, xlrec->ndeleted);
+				appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u; nupdated %u",
+								 xlrec->latestRemovedXid, xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_MARK_PAGE_HALFDEAD:

--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -66,7 +66,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->tuple_tid_valid != NULL);
 	Assert(routine->tuple_get_latest_tid != NULL);
 	Assert(routine->tuple_satisfies_snapshot != NULL);
-	Assert(routine->compute_xid_horizon_for_tuples != NULL);
+	Assert(routine->index_delete_tuples != NULL);

 	Assert(routine->tuple_insert != NULL);


--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -166,9 +166,8 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
 							   HeapTuple tup);

-extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
-														 ItemPointerData *items,
-														 int nitems);
+extern TransactionId heap_index_delete_tuples(Relation rel,
+											  TM_IndexDeleteOp *delstate);

 /* in heap/pruneheap.c */
 struct GlobalVisState;

--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -17,6 +17,7 @@
 #include "access/amapi.h"
 #include "access/itup.h"
 #include "access/sdir.h"
+#include "access/tableam.h"
 #include "access/xlogreader.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_index.h"
@@ -168,7 +169,7 @@ typedef struct BTMetaPageData
 /*
 * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
 * that may be stored on a btree leaf page.  It is used to size the
- * per-page temporary buffers used by index scans.
+ * per-page temporary buffers.
 *
 * Note: we don't bother considering per-tuple overheads here to keep
 * things simple (value is based on how many elements a single array of
@@ -766,8 +767,9 @@ typedef struct BTDedupStateData
 typedef BTDedupStateData *BTDedupState;

 /*
- * BTVacuumPostingData is state that represents how to VACUUM a posting list
- * tuple when some (though not all) of its TIDs are to be deleted.
+ * BTVacuumPostingData is state that represents how to VACUUM (or delete) a
+ * posting list tuple when some (though not all) of its TIDs are to be
+ * deleted.
 *
 * Convention is that itup field is the original posting list tuple on input,
 * and palloc()'d final tuple used to overwrite existing tuple on output.
@@ -1031,6 +1033,8 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
 						   IndexTuple newitem, Size newitemsz,
 						   bool checkingunique);
+extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
+								 Size newitemsz);
 extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
 									OffsetNumber baseoff);
 extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
@@ -1045,7 +1049,8 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 * prototypes for functions in nbtinsert.c
 */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
-						 IndexUniqueCheck checkUnique, Relation heapRel);
+						 IndexUniqueCheck checkUnique, bool indexUnchanged,
+						 Relation heapRel);
 extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);

@@ -1083,9 +1088,9 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
-extern void _bt_delitems_delete(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable,
-								Relation heapRel);
+extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
+									  Relation heapRel,
+									  TM_IndexDeleteOp *delstate);
 extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
 						  TransactionId *oldestBtpoXact);


--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -176,24 +176,6 @@ typedef struct xl_btree_dedup

 #define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))

-/*
- * This is what we need to know about delete of individual leaf index tuples.
- * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.  Deletion of a subset of
- * the TIDs within a posting list tuple is not supported.
- *
- * Backup Blk 0: index page
- */
-typedef struct xl_btree_delete
-{
-	TransactionId latestRemovedXid;
-	uint32		ndeleted;
-
-	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
-} xl_btree_delete;
-
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, ndeleted) + sizeof(uint32))
-
 /*
 * This is what we need to know about page reuse within btree.  This record
 * only exists to generate a conflict point for Hot Standby.
@@ -211,31 +193,30 @@ typedef struct xl_btree_reuse_page
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))

 /*
- * This is what we need to know about which TIDs to remove from an individual
- * posting list tuple during vacuuming.  An array of these may appear at the
- * end of xl_btree_vacuum records.
- */
-typedef struct xl_btree_update
-{
-	uint16		ndeletedtids;
-
-	/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
-} xl_btree_update;
-
-#define SizeOfBtreeUpdate	(offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
-
-/*
- * This is what we need to know about a VACUUM of a leaf page.  The WAL record
- * can represent deletion of any number of index tuples on a single index page
- * when executed by VACUUM.  It can also support "updates" of index tuples,
- * which is how deletes of a subset of TIDs contained in an existing posting
- * list tuple are implemented. (Updates are only used when there will be some
- * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
- * just be deleted).
+ * xl_btree_vacuum and xl_btree_delete records describe deletion of index
+ * tuples on a leaf page.  The former variant is used by VACUUM, while the
+ * latter variant is used by the ad-hoc deletions that sometimes take place
+ * when btinsert() is called.
+ *
+ * The records are very similar.  The only difference is that xl_btree_delete
+ * has to include a latestRemovedXid field to generate recovery conflicts.
+ * (VACUUM operations can just rely on earlier conflicts generated during
+ * pruning of the table whose TIDs the to-be-deleted index tuples point to.
+ * There are also small differences between each REDO routine that we don't go
+ * into here.)
+ *
+ * xl_btree_vacuum and xl_btree_delete both represent deletion of any number
+ * of index tuples on a single leaf page using page offset numbers.  Both also
+ * support "updates" of index tuples, which is how deletes of a subset of TIDs
+ * contained in an existing posting list tuple are implemented.
 *
 * Updated posting list tuples are represented using xl_btree_update metadata.
- * The REDO routine uses each xl_btree_update (plus its corresponding original
- * index tuple from the target leaf page) to generate the final updated tuple.
+ * The REDO routines each use the xl_btree_update entries (plus each
+ * corresponding original index tuple from the target leaf page) to generate
+ * the final updated tuple.
+ *
+ * Updates are only used when there will be some remaining TIDs left by the
+ * REDO routine.  Otherwise the posting list tuple just gets deleted outright.
 */
 typedef struct xl_btree_vacuum
 {
@@ -244,11 +225,39 @@ typedef struct xl_btree_vacuum

 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
-	/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
+	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_vacuum;

 #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))

+typedef struct xl_btree_delete
+{
+	TransactionId latestRemovedXid;
+	uint16		ndeleted;
+	uint16		nupdated;
+
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
+} xl_btree_delete;
+
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+
+/*
+ * The offsets that appear in xl_btree_update metadata are offsets into the
+ * original posting list from tuple, not page offset numbers.  These are
+ * 0-based.  The page offset number for the original posting list tuple comes
+ * from the main xl_btree_vacuum/xl_btree_delete record.
+ */
+typedef struct xl_btree_update
+{
+	uint16		ndeletedtids;
+
+	/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
+} xl_btree_update;
+
+#define SizeOfBtreeUpdate	(offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
+
 /*
 * This is what we need to know about marking an empty subtree for deletion.
 * The target identifies the tuple removed from the parent page (note that we

--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -128,6 +128,106 @@ typedef struct TM_FailureData
 	bool		traversed;
 } TM_FailureData;

+/*
+ * State used when calling table_index_delete_tuples().
+ *
+ * Represents the status of table tuples, referenced by table TID and taken by
+ * index AM from index tuples.  State consists of high level parameters of the
+ * deletion operation, plus two mutable palloc()'d arrays for information
+ * about the status of individual table tuples.  These are conceptually one
+ * single array.  Using two arrays keeps the TM_IndexDelete struct small,
+ * which makes sorting the first array (the deltids array) fast.
+ *
+ * Some index AM callers perform simple index tuple deletion (by specifying
+ * bottomup = false), and include only known-dead deltids.  These known-dead
+ * entries are all marked knowndeletable = true directly (typically these are
+ * TIDs from LP_DEAD-marked index tuples), but that isn't strictly required.
+ *
+ * Callers that specify bottomup = true are "bottom-up index deletion"
+ * callers.  The considerations for the tableam are more subtle with these
+ * callers because they ask the tableam to perform highly speculative work,
+ * and might only expect the tableam to check a small fraction of all entries.
+ * Caller is not allowed to specify knowndeletable = true for any entry
+ * because everything is highly speculative.  Bottom-up caller provides
+ * context and hints to tableam -- see comments below for details on how index
+ * AMs and tableams should coordinate during bottom-up index deletion.
+ *
+ * Simple index deletion callers may ask the tableam to perform speculative
+ * work, too.  This is a little like bottom-up deletion, but not too much.
+ * The tableam will only perform speculative work when it's practically free
+ * to do so in passing for simple deletion caller (while always performing
+ * whatever work is is needed to enable knowndeletable/LP_DEAD index tuples to
+ * be deleted within index AM).  This is the real reason why it's possible for
+ * simple index deletion caller to specify knowndeletable = false up front
+ * (this means "check if it's possible for me to delete corresponding index
+ * tuple when it's cheap to do so in passing").  The index AM should only
+ * include "extra" entries for index tuples whose TIDs point to a table block
+ * that tableam is expected to have to visit anyway (in the event of a block
+ * orientated tableam).  The tableam isn't strictly obligated to check these
+ * "extra" TIDs, but a block-based AM should always manage to do so in
+ * practice.
+ *
+ * The final contents of the deltids/status arrays are interesting to callers
+ * that ask tableam to perform speculative work (i.e. when _any_ items have
+ * knowndeletable set to false up front).  These index AM callers will
+ * naturally need to consult final state to determine which index tuples are
+ * in fact deletable.
+ *
+ * The index AM can keep track of which index tuple relates to which deltid by
+ * setting idxoffnum (and/or relying on each entry being uniquely identifiable
+ * using tid), which is important when the final contents of the array will
+ * need to be interpreted -- the array can shrink from initial size after
+ * tableam processing and/or have entries in a new order (tableam may sort
+ * deltids array for its own reasons).  Bottom-up callers may find that final
+ * ndeltids is 0 on return from call to tableam, in which case no index tuple
+ * deletions are possible.  Simple deletion callers can rely on any entries
+ * they know to be deletable appearing in the final array as deletable.
+ */
+typedef struct TM_IndexDelete
+{
+	ItemPointerData tid;		/* table TID from index tuple */
+	int16		id;				/* Offset into TM_IndexStatus array */
+} TM_IndexDelete;
+
+typedef struct TM_IndexStatus
+{
+	OffsetNumber idxoffnum;		/* Index am page offset number */
+	bool		knowndeletable; /* Currently known to be deletable? */
+
+	/* Bottom-up index deletion specific fields follow */
+	bool		promising;		/* Promising (duplicate) index tuple? */
+	int16		freespace;		/* Space freed in index if deleted */
+} TM_IndexStatus;
+
+/*
+ * Index AM/tableam coordination is central to the design of bottom-up index
+ * deletion.  The index AM provides hints about where to look to the tableam
+ * by marking some entries as "promising".  Index AM does this with duplicate
+ * index tuples that are strongly suspected to be old versions left behind by
+ * UPDATEs that did not logically modify indexed values.  Index AM may find it
+ * helpful to only mark entries as promising when they're thought to have been
+ * affected by such an UPDATE in the recent past.
+ *
+ * Bottom-up index deletion casts a wide net at first, usually by including
+ * all TIDs on a target index page.  It is up to the tableam to worry about
+ * the cost of checking transaction status information.  The tableam is in
+ * control, but needs careful guidance from the index AM.  Index AM requests
+ * that bottomupfreespace target be met, while tableam measures progress
+ * towards that goal by tallying the per-entry freespace value for known
+ * deletable entries. (All !bottomup callers can just set these space related
+ * fields to zero.)
+ */
+typedef struct TM_IndexDeleteOp
+{
+	bool		bottomup;		/* Bottom-up (not simple) deletion? */
+	int			bottomupfreespace;	/* Bottom-up space target */
+
+	/* Mutable per-TID information follows (index AM initializes entries) */
+	int			ndeltids;		/* Current # of deltids/status elements */
+	TM_IndexDelete *deltids;
+	TM_IndexStatus *status;
+} TM_IndexDeleteOp;
+
 /* "options" flag bits for table_tuple_insert */
 /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM		0x0002
@@ -342,10 +442,9 @@ typedef struct TableAmRoutine
 											 TupleTableSlot *slot,
 											 Snapshot snapshot);

-	/* see table_compute_xid_horizon_for_tuples() */
-	TransactionId (*compute_xid_horizon_for_tuples) (Relation rel,
-													 ItemPointerData *items,
-													 int nitems);
+	/* see table_index_delete_tuples() */
+	TransactionId (*index_delete_tuples) (Relation rel,
+										  TM_IndexDeleteOp *delstate);


 	/* ------------------------------------------------------------------------
@@ -1122,16 +1221,23 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 }

 /*
- * Compute the newest xid among the tuples pointed to by items. This is used
- * to compute what snapshots to conflict with when replaying WAL records for
- * page-level index vacuums.
+ * Determine which index tuples are safe to delete based on their table TID.
+ *
+ * Determines which entries from index AM caller's TM_IndexDeleteOp state
+ * point to vacuumable table tuples.  Entries that are found by tableam to be
+ * vacuumable are naturally safe for index AM to delete, and so get directly
+ * marked as deletable.  See comments above TM_IndexDelete and comments above
+ * TM_IndexDeleteOp for full details.
+ *
+ * Returns a latestRemovedXid transaction ID that caller generally places in
+ * its index deletion WAL record.  This might be used during subsequent REDO
+ * of the WAL record when in Hot Standby mode -- a recovery conflict for the
+ * index deletion operation might be required on the standby.
 */
 static inline TransactionId
-table_compute_xid_horizon_for_tuples(Relation rel,
-									 ItemPointerData *items,
-									 int nitems)
+table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 {
-	return rel->rd_tableam->compute_xid_horizon_for_tuples(rel, items, nitems);
+	return rel->rd_tableam->index_delete_tuples(rel, delstate);
 }



--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
 * Each page of XLOG file has a header like this:
 */
-#define XLOG_PAGE_MAGIC 0xD108	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD109	/* can be used as WAL version indicator */

 typedef struct XLogPageHeaderData
 {