Update FSM on WAL replay of page all-visible/frozen

We aren't very strict about keeping FSM up to date on WAL replay, because per-page freespace values aren't critical in replicas (can't write to heap in a replica; and if the replica is promoted, the values would be updated by VACUUM anyway). However, VACUUM since 9.6 can skip processing pages marked all-visible or all-frozen, and if such pages are recorded in FSM with wrong values, those values are blindly propagated to FSM's upper layers by VACUUM's FreeSpaceMapVacuum. (This rationale assumes that crashes are not very frequent, because those would cause outdated FSM to occur in the primary.) Even when the FSM is outdated in standby, things are not too bad normally, because, most per-page FSM values will be zero (other than those propagated with the base-backup that created the standby); only once the remaining free space is less than 0.2*BLCKSZ the per-page value is maintained by WAL replay of heap ins/upd/del. However, if wal_log_hints=on causes complete FSM pages to be propagated to a standby via full-page images, many too-optimistic per-page values can end up being registered in the standby. Incorrect per-page values aren't critical in most cases, since an inserter that is given a page that doesn't actually contain the claimed free space will update FSM with the correct value, and retry until it finds a usable page. However, if there are many such updates to do, an inserter can spend a long time doing them before a usable page is found; in a heavily trafficked insert-only table with many concurrent inserters this has been observed to cause several second stalls, causing visible application malfunction. To fix this problem, it seems sufficient to have heap_xlog_visible (replay of setting all-visible and all-frozen VM bits for a heap page) update the FSM value for the page being processed. This fixes the per-page counters together with making the page skippable to vacuum, so when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper layers are the correct ones, avoiding the problem. While at it, apply the same fix to heap_xlog_clean (replay of tuple removal by HOT pruning and vacuum). This makes any space freed by the cleaning available earlier than the next vacuum in the promoted replica. Backpatch to 9.6, where this problem was diagnosed on an insert-only table with all-frozen pages, which were introduced as a concept in that release. Theoretically it could apply with all-visible pages to older branches, but there's been no report of that and it doesn't backpatch cleanly anyway. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql

Update FSM on WAL replay of page all-visible/frozen
We aren't very strict about keeping FSM up to date on WAL replay, because per-page freespace values aren't critical in replicas (can't write to heap in a replica; and if the replica is promoted, the values would be updated by VACUUM anyway). However, VACUUM since 9.6 can skip processing pages marked all-visible or all-frozen, and if such pages are recorded in FSM with wrong values, those values are blindly propagated to FSM's upper layers by VACUUM's FreeSpaceMapVacuum. (This rationale assumes that crashes are not very frequent, because those would cause outdated FSM to occur in the primary.) Even when the FSM is outdated in standby, things are not too bad normally, because, most per-page FSM values will be zero (other than those propagated with the base-backup that created the standby); only once the remaining free space is less than 0.2*BLCKSZ the per-page value is maintained by WAL replay of heap ins/upd/del. However, if wal_log_hints=on causes complete FSM pages to be propagated to a standby via full-page images, many too-optimistic per-page values can end up being registered in the standby. Incorrect per-page values aren't critical in most cases, since an inserter that is given a page that doesn't actually contain the claimed free space will update FSM with the correct value, and retry until it finds a usable page. However, if there are many such updates to do, an inserter can spend a long time doing them before a usable page is found; in a heavily trafficked insert-only table with many concurrent inserters this has been observed to cause several second stalls, causing visible application malfunction. To fix this problem, it seems sufficient to have heap_xlog_visible (replay of setting all-visible and all-frozen VM bits for a heap page) update the FSM value for the page being processed. This fixes the per-page counters together with making the page skippable to vacuum, so when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper layers are the correct ones, avoiding the problem. While at it, apply the same fix to heap_xlog_clean (replay of tuple removal by HOT pruning and vacuum). This makes any space freed by the cleaning available earlier than the next vacuum in the promoted replica. Backpatch to 9.6, where this problem was diagnosed on an insert-only table with all-frozen pages, which were introduced as a concept in that release. Theoretically it could apply with all-visible pages to older branches, but there's been no report of that and it doesn't backpatch cleanly anyway. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql
ab7dbd68 · Alvaro Herrera · cc4f6b77 · ab7dbd68
Commit ab7dbd68 authored Aug 15, 2018 by Alvaro Herrera
Hide whitespace changes
Inline Side-by-side

Showing with 40 additions and 12 deletions

src/backend/access/heap/heapam.c src/backend/access/heap/heapam.c +40 -12

No files found.
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8172,7 +8172,7 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 }
 /*
- * Handles HEAP2_CLEAN record type
+ * Handles XLOG_HEAP2_CLEAN record type
 */
 static void
 heap_xlog_clean(XLogReaderState *record)
@@ -8180,7 +8180,6 @@ heap_xlog_clean(XLogReaderState *record)
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
 	Buffer		buffer;
-	Size		freespace = 0;
 	RelFileNode rnode;
 	BlockNumber blkno;
 	XLogRedoAction action;
@@ -8232,8 +8231,6 @@ heap_xlog_clean(XLogReaderState *record)
 								nowdead, ndead,
 								nowunused, nunused);
-		freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
 		/*
 		 * Note: we don't worry about updating the page's prunability hints.
 		 * At worst this will cause an extra prune cycle to occur soon.
@@ -8242,18 +8239,24 @@ heap_xlog_clean(XLogReaderState *record)
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
-	/*
+		/*
-	 * Update the FSM as well.
+		 * After cleaning records from a page, it's useful to update the FSM
-	 *
+		 * about it, as it may cause the page become target for insertions
-	 * XXX: Don't do this if the page was restored from full page image. We
+		 * later even if vacuum decides not to visit it (which is possible if
-	 * don't bother to update the FSM in that case, it doesn't need to be
+		 * gets marked all-visible.)
-	 * totally accurate anyway.
+		 *
-	 */
+		 * Do this regardless of a full-page image being applied, since the
-	if (action == BLK_NEEDS_REDO)
+		 * FSM data is not in the page anyway.
+		 */
 		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
 }
 /*
@@ -8326,9 +8329,34 @@ heap_xlog_visible(XLogReaderState *record)
 		 * wal_log_hints enabled.)
 		 */
 	}
 	if (BufferIsValid(buffer))
+	{
+		Size		space = PageGetFreeSpace(BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
+		/*
+		 * Since FSM is not WAL-logged and only updated heuristically, it
+		 * easily becomes stale in standbys.  If the standby is later promoted
+		 * and runs VACUUM, it will skip updating individual free space
+		 * figures for pages that became all-visible (or all-frozen, depending
+		 * on the vacuum mode,) which is troublesome when FreeSpaceMapVacuum
+		 * propagates too optimistic free space values to upper FSM layers;
+		 * later inserters try to use such pages only to find out that they
+		 * are unusable.  This can cause long stalls when there are many such
+		 * pages.
+		 *
+		 * Forestall those problems by updating FSM's idea about a page that
+		 * is becoming all-visible or all-frozen.
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		if (xlrec->flags & VISIBILITYMAP_VALID_BITS)
+			XLogRecordPageWithFreeSpace(rnode, blkno, space);
+	}
 	/*
 	 * Even if we skipped the heap page update due to the LSN interlock, it's
 	 * still safe to update the visibility map.  Any WAL record that clears