Commit dad1539a authored by Andres Freund's avatar Andres Freund

Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning.

Since dc7420c2 the horizon used for pruning is determined "lazily". A more
accurate horizon is built on-demand, rather than in GetSnapshotData(). If a
horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls
for the same tuple, the result can change from RECENTLY_DEAD to DEAD.

heap_page_prune() can process the same tid multiple times (once following an
update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum()
of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the
second, the "tuple is DEAD and doesn't chain to anything else" path in
heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId
unused.

Initially not easily visible,
Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version
can reuse it. At that point the corruption may become visible, as index
entries pointing to the "original" redirect item, now point to a unrelated
tuple.

To fix, compute HTSV for all tuples on a page only once. This fixes the entire
class of problems of HTSV changing inside heap_page_prune(). However,
visibility changes can obviously still occur between HTSV checks inside
heap_page_prune() and outside (e.g. in lazy_scan_prune()).

The computation of HTSV is now done in bulk, in heap_page_prune(), rather than
on-demand in heap_prune_chain(). Besides being a bit simpler, it also is
faster: Memory accesses can happen sequentially, rather than in the order of
HOT chains.

There are other causes of HeapTupleSatisfiesVacuum() results changing between
two visibility checks for the same tuple, even before dc7420c2. E.g.
HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction
aborts between the two checks. None of the these other visibility status
changes are known to cause corruption, but heap_page_prune()'s approach makes
it hard to be confident.

A patch implementing a more fundamental redesign of heap_page_prune(), which
fixes this bug and simplifies pruning substantially, has been proposed by
Peter Geoghegan in
https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com

However, that redesign is larger change than desirable for backpatching. As
the new design still benefits from the batched visibility determination
introduced in this commit, it makes sense to commit this narrower fix to 14
and master, and then commit Peter's improvement in master.

The precise sequence required to trigger the bug is complicated and hard to do
exercise in an isolation test (until we have wait points). Due to that the
isolation test initially posted at
https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de
and updated in
https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de
isn't committable.

A followup commit will introduce additional assertions, to detect problems
like this more easily.

Bug: #17255
Reported-By: default avatarAlexander Lakhin <exclusion@gmail.com>
Debugged-By: default avatarAndres Freund <andres@anarazel.de>
Debugged-By: default avatarPeter Geoghegan <pg@bowt.ie>
Author: Andres Freund <andres@andres@anarazel.de>
Reviewed-By: default avatarPeter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de
Backpatch: 14-, the oldest branch containing dc7420c2
parent ad5b6f24
...@@ -56,11 +56,30 @@ typedef struct ...@@ -56,11 +56,30 @@ typedef struct
OffsetNumber redirected[MaxHeapTuplesPerPage * 2]; OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
OffsetNumber nowdead[MaxHeapTuplesPerPage]; OffsetNumber nowdead[MaxHeapTuplesPerPage];
OffsetNumber nowunused[MaxHeapTuplesPerPage]; OffsetNumber nowunused[MaxHeapTuplesPerPage];
/* marked[i] is true if item i is entered in one of the above arrays */
/*
* marked[i] is true if item i is entered in one of the above arrays.
*
* This needs to be MaxHeapTuplesPerPage + 1 long as FirstOffsetNumber is
* 1. Otherwise every access would need to subtract 1.
*/
bool marked[MaxHeapTuplesPerPage + 1]; bool marked[MaxHeapTuplesPerPage + 1];
/*
* Tuple visibility is only computed once for each tuple, for correctness
* and efficiency reasons; see comment in heap_page_prune() for
* details. This is of type int8[,] intead of HTSV_Result[], so we can use
* -1 to indicate no visibility has been computed, e.g. for LP_DEAD items.
*
* Same indexing as ->marked.
*/
int8 htsv[MaxHeapTuplesPerPage + 1];
} PruneState; } PruneState;
/* Local functions */ /* Local functions */
static HTSV_Result heap_prune_satisfies_vacuum(PruneState *prstate,
HeapTuple tup,
Buffer buffer);
static int heap_prune_chain(Buffer buffer, static int heap_prune_chain(Buffer buffer,
OffsetNumber rootoffnum, OffsetNumber rootoffnum,
PruneState *prstate); PruneState *prstate);
...@@ -228,6 +247,7 @@ heap_page_prune(Relation relation, Buffer buffer, ...@@ -228,6 +247,7 @@ heap_page_prune(Relation relation, Buffer buffer,
OffsetNumber offnum, OffsetNumber offnum,
maxoff; maxoff;
PruneState prstate; PruneState prstate;
HeapTupleData tup;
/* /*
* Our strategy is to scan the page and make lists of items to change, * Our strategy is to scan the page and make lists of items to change,
...@@ -250,8 +270,60 @@ heap_page_prune(Relation relation, Buffer buffer, ...@@ -250,8 +270,60 @@ heap_page_prune(Relation relation, Buffer buffer,
prstate.nredirected = prstate.ndead = prstate.nunused = 0; prstate.nredirected = prstate.ndead = prstate.nunused = 0;
memset(prstate.marked, 0, sizeof(prstate.marked)); memset(prstate.marked, 0, sizeof(prstate.marked));
/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
tup.t_tableOid = RelationGetRelid(prstate.rel);
/*
* Determine HTSV for all tuples.
*
* This is required for correctness to deal with cases where running HTSV
* twice could result in different results (e.g. RECENTLY_DEAD can turn to
* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
* inserting transaction aborts, ...). That in turn could cause
* heap_prune_chain() to behave incorrectly if a tuple is reached twice,
* once directly via a heap_prune_chain() and once following a HOT chain.
*
* It's also good for performance. Most commonly tuples within a page are
* stored at decreasing offsets (while the items are stored at increasing
* offsets). When processing all tuples on a page this leads to reading
* memory at decreasing offsets within a page, with a variable stride.
* That's hard for CPU prefetchers to deal with. Processing the items in
* reverse order (and thus the tuples in increasing order) increases
* prefetching efficiency significantly / decreases the number of cache
* misses.
*/
for (offnum = maxoff;
offnum >= FirstOffsetNumber;
offnum = OffsetNumberPrev(offnum))
{
ItemId itemid = PageGetItemId(page, offnum);
HeapTupleHeader htup;
/* Nothing to do if slot doesn't contain a tuple */
if (!ItemIdIsNormal(itemid))
{
prstate.htsv[offnum] = -1;
continue;
}
htup = (HeapTupleHeader) PageGetItem(page, itemid);
tup.t_data = htup;
tup.t_len = ItemIdGetLength(itemid);
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
/*
* Set the offset number so that we can display it along with any
* error that occurred while processing this tuple.
*/
if (off_loc)
*off_loc = offnum;
prstate.htsv[offnum] = heap_prune_satisfies_vacuum(&prstate, &tup,
buffer);
}
/* Scan the page */
for (offnum = FirstOffsetNumber; for (offnum = FirstOffsetNumber;
offnum <= maxoff; offnum <= maxoff;
offnum = OffsetNumberNext(offnum)) offnum = OffsetNumberNext(offnum))
...@@ -262,10 +334,7 @@ heap_page_prune(Relation relation, Buffer buffer, ...@@ -262,10 +334,7 @@ heap_page_prune(Relation relation, Buffer buffer,
if (prstate.marked[offnum]) if (prstate.marked[offnum])
continue; continue;
/* /* see preceding loop */
* Set the offset number so that we can display it along with any
* error that occurred while processing this tuple.
*/
if (off_loc) if (off_loc)
*off_loc = offnum; *off_loc = offnum;
...@@ -488,17 +557,14 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer) ...@@ -488,17 +557,14 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
* the HOT chain is pruned by removing all DEAD tuples at the start of the HOT * the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
* chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple. * chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
* This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
* DEAD, the OldestXmin test is just too coarse to detect it. * DEAD, our visibility test is just too coarse to detect it.
* *
* The root line pointer is redirected to the tuple immediately after the * The root line pointer is redirected to the tuple immediately after the
* latest DEAD tuple. If all tuples in the chain are DEAD, the root line * latest DEAD tuple. If all tuples in the chain are DEAD, the root line
* pointer is marked LP_DEAD. (This includes the case of a DEAD simple * pointer is marked LP_DEAD. (This includes the case of a DEAD simple
* tuple, which we treat as a chain of length 1.) * tuple, which we treat as a chain of length 1.)
* *
* OldestXmin is the cutoff XID used to identify dead tuples. * We don't actually change the page here. We just add entries to the arrays in
*
* We don't actually change the page here, except perhaps for hint-bit updates
* caused by HeapTupleSatisfiesVacuum. We just add entries to the arrays in
* prstate showing the changes to be made. Items to be redirected are added * prstate showing the changes to be made. Items to be redirected are added
* to the redirected[] array (two entries per redirection); items to be set to * to the redirected[] array (two entries per redirection); items to be set to
* LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED * LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
...@@ -520,9 +586,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) ...@@ -520,9 +586,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
OffsetNumber chainitems[MaxHeapTuplesPerPage]; OffsetNumber chainitems[MaxHeapTuplesPerPage];
int nchain = 0, int nchain = 0,
i; i;
HeapTupleData tup;
tup.t_tableOid = RelationGetRelid(prstate->rel);
rootlp = PageGetItemId(dp, rootoffnum); rootlp = PageGetItemId(dp, rootoffnum);
...@@ -531,12 +594,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) ...@@ -531,12 +594,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
*/ */
if (ItemIdIsNormal(rootlp)) if (ItemIdIsNormal(rootlp))
{ {
Assert(prstate->htsv[rootoffnum] != -1);
htup = (HeapTupleHeader) PageGetItem(dp, rootlp); htup = (HeapTupleHeader) PageGetItem(dp, rootlp);
tup.t_data = htup;
tup.t_len = ItemIdGetLength(rootlp);
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), rootoffnum);
if (HeapTupleHeaderIsHeapOnly(htup)) if (HeapTupleHeaderIsHeapOnly(htup))
{ {
/* /*
...@@ -557,8 +617,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) ...@@ -557,8 +617,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
* either here or while following a chain below. Whichever path * either here or while following a chain below. Whichever path
* gets there first will mark the tuple unused. * gets there first will mark the tuple unused.
*/ */
if (heap_prune_satisfies_vacuum(prstate, &tup, buffer) if (prstate->htsv[rootoffnum] == HEAPTUPLE_DEAD &&
== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup)) !HeapTupleHeaderIsHotUpdated(htup))
{ {
heap_prune_record_unused(prstate, rootoffnum); heap_prune_record_unused(prstate, rootoffnum);
HeapTupleHeaderAdvanceLatestRemovedXid(htup, HeapTupleHeaderAdvanceLatestRemovedXid(htup,
...@@ -618,12 +678,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) ...@@ -618,12 +678,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
break; break;
Assert(ItemIdIsNormal(lp)); Assert(ItemIdIsNormal(lp));
Assert(prstate->htsv[offnum] != -1);
htup = (HeapTupleHeader) PageGetItem(dp, lp); htup = (HeapTupleHeader) PageGetItem(dp, lp);
tup.t_data = htup;
tup.t_len = ItemIdGetLength(lp);
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
/* /*
* Check the tuple XMIN against prior XMAX, if any * Check the tuple XMIN against prior XMAX, if any
*/ */
...@@ -641,7 +698,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) ...@@ -641,7 +698,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
*/ */
tupdead = recent_dead = false; tupdead = recent_dead = false;
switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer)) switch ((HTSV_Result) prstate->htsv[offnum])
{ {
case HEAPTUPLE_DEAD: case HEAPTUPLE_DEAD:
tupdead = true; tupdead = true;
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment