Commit 17118825 authored by Tom Lane's avatar Tom Lane

Fix transient clobbering of shared buffers during WAL replay.

RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation.  The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer.  Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200.  It most likely explains the
original report as well, though we don't yet have confirmation of that.

To fix, change the code so that only bytes that are supposed to change will
change, even transiently.  This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.

Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.

So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.

Back-patch to 9.0 where Hot Standby was added.
parent ee68a441
...@@ -3716,9 +3716,9 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) ...@@ -3716,9 +3716,9 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
} }
else else
{ {
/* must zero-fill the hole */
MemSet((char *) page, 0, BLCKSZ);
memcpy((char *) page, blk, bkpb.hole_offset); memcpy((char *) page, blk, bkpb.hole_offset);
/* must zero-fill the hole */
MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length), memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset, blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length)); BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
......
...@@ -1521,6 +1521,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -1521,6 +1521,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
Buffer buffer; Buffer buffer;
Page page; Page page;
Page localpage;
char *item; char *item;
Size itemsz; Size itemsz;
xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record); xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
...@@ -1536,23 +1537,37 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -1536,23 +1537,37 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
Assert(BufferIsValid(buffer)); Assert(BufferIsValid(buffer));
page = (Page) BufferGetPage(buffer); page = (Page) BufferGetPage(buffer);
/* Always reinit the page and reinstall the magic number */ /*
/* See comments in DefineSequence */ * We must always reinit the page and reinstall the magic number (see
PageInit((Page) page, BufferGetPageSize(buffer), sizeof(sequence_magic)); * comments in fill_seq_with_data). However, since this WAL record type
sm = (sequence_magic *) PageGetSpecialPointer(page); * is also used for updating sequences, it's possible that a hot-standby
* backend is examining the page concurrently; so we mustn't transiently
* trash the buffer. The solution is to build the correct new page
* contents in local workspace and then memcpy into the buffer. Then
* only bytes that are supposed to change will change, even transiently.
* We must palloc the local page for alignment reasons.
*/
localpage = (Page) palloc(BufferGetPageSize(buffer));
PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
sm = (sequence_magic *) PageGetSpecialPointer(localpage);
sm->magic = SEQ_MAGIC; sm->magic = SEQ_MAGIC;
item = (char *) xlrec + sizeof(xl_seq_rec); item = (char *) xlrec + sizeof(xl_seq_rec);
itemsz = record->xl_len - sizeof(xl_seq_rec); itemsz = record->xl_len - sizeof(xl_seq_rec);
itemsz = MAXALIGN(itemsz); itemsz = MAXALIGN(itemsz);
if (PageAddItem(page, (Item) item, itemsz, if (PageAddItem(localpage, (Item) item, itemsz,
FirstOffsetNumber, false, false) == InvalidOffsetNumber) FirstOffsetNumber, false, false) == InvalidOffsetNumber)
elog(PANIC, "seq_redo: failed to add item to page"); elog(PANIC, "seq_redo: failed to add item to page");
PageSetLSN(page, lsn); PageSetLSN(localpage, lsn);
PageSetTLI(page, ThisTimeLineID); PageSetTLI(localpage, ThisTimeLineID);
memcpy(page, localpage, BufferGetPageSize(buffer));
MarkBufferDirty(buffer); MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer); UnlockReleaseBuffer(buffer);
pfree(localpage);
} }
void void
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment