Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning.
Since dc7420c2 the horizon used for pruning is determined "lazily". A more accurate horizon is built on-demand, rather than in GetSnapshotData(). If a horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls for the same tuple, the result can change from RECENTLY_DEAD to DEAD. heap_page_prune() can process the same tid multiple times (once following an update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum() of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the second, the "tuple is DEAD and doesn't chain to anything else" path in heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId unused. Initially not easily visible, Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version can reuse it. At that point the corruption may become visible, as index entries pointing to the "original" redirect item, now point to a unrelated tuple. To fix, compute HTSV for all tuples on a page only once. This fixes the entire class of problems of HTSV changing inside heap_page_prune(). However, visibility changes can obviously still occur between HTSV checks inside heap_page_prune() and outside (e.g. in lazy_scan_prune()). The computation of HTSV is now done in bulk, in heap_page_prune(), rather than on-demand in heap_prune_chain(). Besides being a bit simpler, it also is faster: Memory accesses can happen sequentially, rather than in the order of HOT chains. There are other causes of HeapTupleSatisfiesVacuum() results changing between two visibility checks for the same tuple, even before dc7420c2. E.g. HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction aborts between the two checks. None of the these other visibility status changes are known to cause corruption, but heap_page_prune()'s approach makes it hard to be confident. A patch implementing a more fundamental redesign of heap_page_prune(), which fixes this bug and simplifies pruning substantially, has been proposed by Peter Geoghegan in https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com However, that redesign is larger change than desirable for backpatching. As the new design still benefits from the batched visibility determination introduced in this commit, it makes sense to commit this narrower fix to 14 and master, and then commit Peter's improvement in master. The precise sequence required to trigger the bug is complicated and hard to do exercise in an isolation test (until we have wait points). Due to that the isolation test initially posted at https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de and updated in https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de isn't committable. A followup commit will introduce additional assertions, to detect problems like this more easily. Bug: #17255 Reported-By: Alexander Lakhin <exclusion@gmail.com> Debugged-By: Andres Freund <andres@anarazel.de> Debugged-By: Peter Geoghegan <pg@bowt.ie> Author: Andres Freund <andres@andres@anarazel.de> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de Backpatch: 14-, the oldest branch containing dc7420c2
Showing
Please register or sign in to comment