Allow read only connections during recovery, known as Hot Standby.

Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.

Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
efc16ea5 · Simon Riggs · 78a09145 · efc16ea5 · efc16ea5 · efc16ea5
Commit efc16ea5 authored Dec 19, 2009 by Simon Riggs
87 changed files
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.238 2009/12/17 14:36:16 rhaas Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.239 2009/12/19 01:32:31 sriggs Exp $ -->
 <chapter Id="runtime-config">
  <title>Server Configuration</title>
@@ -376,6 +376,12 @@ SET ENABLE_SEQSCAN TO OFF;
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>
@@ -826,6 +832,12 @@ SET ENABLE_SEQSCAN TO OFF;
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>
@@ -1733,6 +1745,51 @@ archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"'  # Windows
     </variablelist>
    </sect2>
+    <sect2 id="runtime-config-standby">
+    <title>Standby Servers</title>
+    <variablelist>
+     <varlistentry id="recovery-connections" xreflabel="recovery_connections">
+      <term><varname>recovery_connections</varname> (<type>boolean</type>)</term>
+      <listitem>
+       <para>
+		Parameter has two roles. During recovery, specifies whether or not
+		you can connect and run queries to enable <xref linkend="hot-standby">.
+		During normal running, specifies whether additional information is written
+		to WAL to allow recovery connections on a standby server that reads
+		WAL data generated by this server. The default value is
+        <literal>on</literal>.  It is thought that there is little
+		measurable difference in performance from using this feature, so
+		feedback is welcome if any production impacts are noticeable.
+		It is likely that this parameter will be removed in later releases.
+        This parameter can only be set at server start.
+       </para>
+      </listitem>
+     </varlistentry>
+     <varlistentry id="max-standby-delay" xreflabel="max_standby_delay">
+      <term><varname>max_standby_delay</varname> (<type>string</type>)</term>
+      <listitem>
+       <para>
+		When server acts as a standby, this parameter specifies a wait policy
+		for queries that conflict with incoming data changes. Valid settings
+		are -1, meaning wait forever, or a wait time of 0 or more seconds.
+		If a conflict should occur the server will delay up to this
+		amount before it begins trying to resolve things less amicably, as
+		described in <xref linkend="hot-standby-conflict">. Typically,
+		this parameter makes sense only during replication, so when
+		performing an archive recovery to recover from data loss a
+		parameter setting of 0 is recommended.  The default is 30 seconds.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+     </variablelist>
+    </sect2>
   </sect1>
   <sect1 id="runtime-config-query">
@@ -4161,6 +4218,29 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
      </listitem>
     </varlistentry>
+     <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
+      <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>vacuum_defer_cleanup_age</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies the number of transactions by which <command>VACUUM</> and
+		<acronym>HOT</> updates will defer cleanup of dead row versions. The
+		default is 0 transactions, meaning that dead row versions will be
+		removed as soon as possible. You may wish to set this to a non-zero
+		value when planning or maintaining a <xref linkend="hot-standby">
+		configuration. The recommended value is <literal>0</> unless you have
+		clear reason to increase it. The purpose of the parameter is to
+		allow the user to specify an approximate time delay before cleanup
+		occurs. However, it should be noted that there is no direct link with
+		any specific time delay and so the results will be application and
+		installation specific, as well as variable over time, depending upon
+		the transaction rate (of writes only).
+       </para>
+      </listitem>
+     </varlistentry>
     <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
      <term><varname>bytea_output</varname> (<type>enum</type>)</term>
      <indexterm>
@@ -4689,6 +4769,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>
@@ -5546,6 +5632,32 @@ plruby.use_strict = true        # generates error: unknown class name
      </listitem>
     </varlistentry>
+     <varlistentry id="guc-trace-recovery-messages" xreflabel="trace_recovery_messages">
+      <term><varname>trace_recovery_messages</varname> (<type>string</type>)</term>
+      <indexterm>
+       <primary><varname>trace_recovery_messages</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Controls which message levels are written to the server log
+        for system modules needed for recovery processing. This allows
+        the user to override the normal setting of log_min_messages,
+        but only for specific messages. This is intended for use in
+        debugging Hot Standby.
+        Valid values are <literal>DEBUG5</>, <literal>DEBUG4</>,
+        <literal>DEBUG3</>, <literal>DEBUG2</>, <literal>DEBUG1</>,
+        <literal>INFO</>, <literal>NOTICE</>, <literal>WARNING</>,
+        <literal>ERROR</>, <literal>LOG</>, <literal>FATAL</>, and
+        <literal>PANIC</>.  Each level includes all the levels that
+        follow it.  The later the level, the fewer messages are sent
+        to the log.  The default is <literal>WARNING</>.  Note that
+        <literal>LOG</> has a different rank here than in
+        <varname>client_min_messages</>.
+        Parameter should be set in the postgresql.conf only.
+       </para>
+      </listitem>
+     </varlistentry>
    <varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages">
      <term><varname>zero_damaged_pages</varname> (<type>boolean</type>)</term>
      <indexterm>

--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.493 2009/12/15 17:57:46 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.494 2009/12/19 01:32:31 sriggs Exp $ -->
 <chapter id="functions">
  <title>Functions and Operators</title>
@@ -13132,6 +13132,38 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
    <xref linkend="continuous-archiving">.
   </para>
+   <indexterm>
+    <primary>pg_is_in_recovery</primary>
+   </indexterm>
+   <para>
+    The functions shown in <xref
+    linkend="functions-recovery-info-table"> provide information
+	about the current status of Hot Standby.
+    These functions may be executed during both recovery and in normal running.
+   </para>
+   <table id="functions-recovery-info-table">
+    <title>Recovery Information Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_is_in_recovery</function>()</literal>
+        </entry>
+       <entry><type>bool</type></entry>
+       <entry>True if recovery is still in progress.
+	   </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
   <para>
    The functions shown in <xref linkend="functions-admin-dbsize"> calculate
    the disk space usage of database objects.

--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.16 2008/11/14 10:22:45 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.17 2009/12/19 01:32:31 sriggs Exp $ -->
 <refentry id="sql-checkpoint">
 <refmeta>
@@ -42,6 +42,11 @@ CHECKPOINT
   <xref linkend="wal"> for more information about the WAL system.
  </para>
+  <para>
+   If executed during recovery, the <command>CHECKPOINT</command> command
+   will force a restartpoint rather than writing a new checkpoint.
+  </para>
  <para>
   Only superusers can call <command>CHECKPOINT</command>.  The command is
   not intended for use during normal operation.

--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *			 $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.19 2009/06/11 14:48:53 momjian Exp $
+ *			 $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.20 2009/12/19 01:32:31 sriggs Exp $
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"
@@ -621,6 +621,10 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+	/*
+	 * GIN indexes do not require any conflict processing.
+	 */
 	RestoreBkpBlocks(lsn, record, false);
 	topCtx = MemoryContextSwitchTo(opCtx);

--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *			 $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.32 2009/01/20 18:59:36 heikki Exp $
+ *			 $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.33 2009/12/19 01:32:32 sriggs Exp $
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"
@@ -396,6 +396,12 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record)
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
 	MemoryContext oldCxt;
+	/*
+	 * GIST indexes do not require any conflict processing. NB: If we ever
+	 * implement a similar optimization we have in b-tree, and remove killed
+	 * tuples outside VACUUM, we'll need to handle that here.
+	 */
 	RestoreBkpBlocks(lsn, record, false);
 	oldCxt = MemoryContextSwitchTo(opCtx);

--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.278 2009/08/24 02:18:31 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.279 2009/12/19 01:32:32 sriggs Exp $
 *
 *
 * INTERFACE ROUTINES
@@ -59,6 +59,7 @@
 #include "storage/lmgr.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -248,8 +249,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
 	 * visible to everyone, we can skip the per-tuple visibility tests.
+	 * But not in hot standby mode. A tuple that's already visible to all
+	 * transactions in the master might still be invisible to a read-only
+	 * transaction in the standby.
 	 */
-	all_visible = PageIsAllVisible(dp);
+	all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;
 	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
 		 lineoff <= lines;
@@ -3769,6 +3773,60 @@ heap_restrpos(HeapScanDesc scan)
 	}
 }
+/*
+ * If 'tuple' contains any XID greater than latestRemovedXid, update
+ * latestRemovedXid to the greatest one found.
+ */
+void
+HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
+									   TransactionId *latestRemovedXid)
+{
+	TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+	TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+	TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+	if (tuple->t_infomask & HEAP_MOVED_OFF ||
+		tuple->t_infomask & HEAP_MOVED_IN)
+	{
+		if (TransactionIdPrecedes(*latestRemovedXid, xvac))
+			*latestRemovedXid = xvac;
+	}
+	if (TransactionIdPrecedes(*latestRemovedXid, xmax))
+		*latestRemovedXid = xmax;
+	if (TransactionIdPrecedes(*latestRemovedXid, xmin))
+		*latestRemovedXid = xmin;
+	Assert(TransactionIdIsValid(*latestRemovedXid));
+}
+/*
+ * Perform XLogInsert to register a heap cleanup info message. These
+ * messages are sent once per VACUUM and are required because
+ * of the phasing of removal operations during a lazy VACUUM.
+ * see comments for vacuum_log_cleanup_info().
+ */
+XLogRecPtr
+log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+{
+	xl_heap_cleanup_info xlrec;
+	XLogRecPtr	recptr;
+	XLogRecData rdata;
+	xlrec.node = rnode;
+	xlrec.latestRemovedXid = latestRemovedXid;
+	rdata.data = (char *) &xlrec;
+	rdata.len = SizeOfHeapCleanupInfo;
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata);
+	return recptr;
+}
 /*
 * Perform XLogInsert for a heap-clean operation.  Caller must already
 * have modified the buffer and marked it dirty.
@@ -3776,13 +3834,17 @@ heap_restrpos(HeapScanDesc scan)
 * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
 * zero-based tuple indexes.  Now they are one-based like other uses
 * of OffsetNumber.
+ *
+ * We also include latestRemovedXid, which is the greatest XID present in
+ * the removed tuples. That allows recovery processing to cancel or wait
+ * for long standby queries that can still see these tuples.
 */
 XLogRecPtr
 log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
 			   OffsetNumber *nowdead, int ndead,
 			   OffsetNumber *nowunused, int nunused,
-			   bool redirect_move)
+			   TransactionId latestRemovedXid, bool redirect_move)
 {
 	xl_heap_clean xlrec;
 	uint8		info;
@@ -3794,6 +3856,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	xlrec.node = reln->rd_node;
 	xlrec.block = BufferGetBlockNumber(buffer);
+	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -4067,6 +4130,33 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 	return recptr;
 }
+/*
+ * Handles CLEANUP_INFO
+ */
+static void
+heap_xlog_cleanup_info(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+		backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM index cleanup",
+											   CONFLICT_MODE_ERROR);
+	}
+	/*
+	 * Actual operation is a no-op. Record type exists to provide a means
+	 * for conflict processing to occur before we begin index vacuum actions.
+	 * see vacuumlazy.c and also comments in btvacuumpage()
+	 */
+}
 /*
 * Handles CLEAN and CLEAN_MOVE record types
 */
@@ -4085,12 +4175,31 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
 	int			nunused;
 	Size		freespace;
+	/*
+	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
+	 * no queries running for which the removed tuples are still visible.
+	 */
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+		backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM heap cleanup",
+											   CONFLICT_MODE_ERROR);
+	}
+	RestoreBkpBlocks(lsn, record, true);
 	if (record->xl_info & XLR_BKP_BLOCK_1)
 		return;
-	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
 	if (!BufferIsValid(buffer))
 		return;
+	LockBufferForCleanup(buffer);
 	page = (Page) BufferGetPage(buffer);
 	if (XLByteLE(lsn, PageGetLSN(page)))
@@ -4145,12 +4254,40 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;
+	/*
+	 * In Hot Standby mode, ensure that there's no queries running which still
+	 * consider the frozen xids as running.
+	 */
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+		/*
+		 * XXX: Using cutoff_xid is overly conservative. Even if cutoff_xid
+		 * is recent enough to conflict with a backend, the actual values
+		 * being frozen might not be. With a typical vacuum_freeze_min_age
+		 * setting in the ballpark of millions of transactions, it won't make
+		 * a difference, but it might if you run a manual VACUUM FREEZE.
+		 * Typically the cutoff is much earlier than any recently deceased
+		 * tuple versions removed by this vacuum, so don't worry too much.
+		 */
+		backends = GetConflictingVirtualXIDs(cutoff_xid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM heap freeze",
+											   CONFLICT_MODE_ERROR);
+	}
+	RestoreBkpBlocks(lsn, record, false);
 	if (record->xl_info & XLR_BKP_BLOCK_1)
 		return;
-	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
 	if (!BufferIsValid(buffer))
 		return;
+	LockBufferForCleanup(buffer);
 	page = (Page) BufferGetPage(buffer);
 	if (XLByteLE(lsn, PageGetLSN(page)))
@@ -4740,6 +4877,11 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+	/*
+	 * These operations don't overwrite MVCC data so no conflict
+	 * processing is required. The ones in heap2 rmgr do.
+	 */
 	RestoreBkpBlocks(lsn, record, false);
 	switch (info & XLOG_HEAP_OPMASK)
@@ -4778,20 +4920,25 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+	/*
+	 * Note that RestoreBkpBlocks() is called after conflict processing
+	 * within each record type handling function.
+	 */
 	switch (info & XLOG_HEAP_OPMASK)
 	{
 		case XLOG_HEAP2_FREEZE:
-			RestoreBkpBlocks(lsn, record, false);
 			heap_xlog_freeze(lsn, record);
 			break;
 		case XLOG_HEAP2_CLEAN:
-			RestoreBkpBlocks(lsn, record, true);
 			heap_xlog_clean(lsn, record, false);
 			break;
 		case XLOG_HEAP2_CLEAN_MOVE:
-			RestoreBkpBlocks(lsn, record, true);
 			heap_xlog_clean(lsn, record, true);
 			break;
+		case XLOG_HEAP2_CLEANUP_INFO:
+			heap_xlog_cleanup_info(lsn, record);
+			break;
 		default:
 			elog(PANIC, "heap2_redo: unknown op code %u", info);
 	}
@@ -4921,17 +5068,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 	{
 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
-		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u",
+		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
 						 xlrec->node.spcNode, xlrec->node.dbNode,
-						 xlrec->node.relNode, xlrec->block);
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->latestRemovedXid);
 	}
 	else if (info == XLOG_HEAP2_CLEAN_MOVE)
 	{
 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
-		appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u",
+		appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u",
 						 xlrec->node.spcNode, xlrec->node.dbNode,
-						 xlrec->node.relNode, xlrec->block);
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->latestRemovedXid);
+	}
+	else if (info == XLOG_HEAP2_CLEANUP_INFO)
+	{
+		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
+		appendStringInfo(buf, "cleanup info: remxid %u",
+						 xlrec->latestRemovedXid);
 	}
 	else
 		appendStringInfo(buf, "UNKNOWN");

--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.18 2009/06/11 14:48:53 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.19 2009/12/19 01:32:32 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -30,6 +30,7 @@
 typedef struct
 {
 	TransactionId new_prune_xid;	/* new prune hint value for page */
+	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;		/* numbers of entries in arrays below */
 	int			ndead;
 	int			nunused;
@@ -84,6 +85,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
 	if (!PageIsPrunable(page, OldestXmin))
 		return;
+	/*
+	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * clean the page. The master will likely issue a cleaning WAL record
+	 * soon anyway, so this is no particular loss.
+	 */
+	if (RecoveryInProgress())
+		return;
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
 	 * for a new tuple version, or when free space falls below the relation's
@@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -257,7 +267,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									prstate.redirected, prstate.nredirected,
 									prstate.nowdead, prstate.ndead,
 									prstate.nowunused, prstate.nunused,
-									redirect_move);
+									prstate.latestRemovedXid, redirect_move);
 			PageSetLSN(BufferGetPage(buffer), recptr);
 			PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
@@ -395,6 +405,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
+				HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+													   &prstate->latestRemovedXid);
 				ndeleted++;
 			}
@@ -520,7 +532,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 * find another DEAD tuple is a fairly unusual corner case.)
 		 */
 		if (tupdead)
+		{
 			latestdead = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+												   &prstate->latestRemovedXid);
+		}
 		else if (!recent_dead)
 			break;

--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.77 2009/12/07 05:22:21 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.78 2009/12/19 01:32:32 sriggs Exp $
 *
 * NOTES
 *	  many of the old access method routines have been turned into
@@ -91,8 +91,19 @@ RelationGetIndexScan(Relation indexRelation,
 	else
 		scan->keyData = NULL;
+	/*
+	 * During recovery we ignore killed tuples and don't bother to kill them
+	 * either. We do this because the xmin on the primary node could easily
+	 * be later than the xmin on the standby node, so that what the primary
+	 * thinks is killed is supposed to be visible on standby. So for correct
+	 * MVCC for queries during recovery we must ignore these hints and check
+	 * all tuples. Do *not* set ignore_killed_tuples to true when running
+	 * in a transaction that was started during recovery.
+	 * xactStartedInRecovery should not be altered by index AMs.
+	 */
 	scan->kill_prior_tuple = false;
-	scan->ignore_killed_tuples = true;	/* default setting */
+	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
+	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
 	scan->opaque = NULL;

--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.115 2009/07/29 20:56:18 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.116 2009/12/19 01:32:32 sriggs Exp $
 *
 * INTERFACE ROUTINES
 *		index_open		- open an index relation by relation OID
@@ -455,8 +455,11 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)
 			/*
 			 * If we scanned a whole HOT chain and found only dead tuples,
-			 * tell index AM to kill its entry for that TID.
+			 * tell index AM to kill its entry for that TID. We do not do
+			 * this when in recovery because it may violate MVCC to do so.
+			 * see comments in RelationGetIndexScan().
 			 */
+			if (!scan->xactStartedInRecovery)
 				scan->kill_prior_tuple = scan->xs_hot_dead;
 			/*

--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
-$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.20 2008/03/21 13:23:27 momjian Exp $
+$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.21 2009/12/19 01:32:32 sriggs Exp $
 Btree Indexing
 ==============
@@ -401,6 +401,33 @@ of the WAL entry.)  If the parent page becomes half-dead but is not
 immediately deleted due to a subsequent crash, there is no loss of
 consistency, and the empty page will be picked up by the next VACUUM.
+Scans during Recovery
+---------------------
+The btree index type can be safely used during recovery. During recovery
+we have at most one writer and potentially many readers. In that
+situation the locking requirements can be relaxed and we do not need
+double locking during block splits. Each WAL record makes changes to a
+single level of the btree using the correct locking sequence and so
+is safe for concurrent readers. Some readers may observe a block split
+in progress as they descend the tree, but they will simply move right
+onto the correct page.
+During recovery all index scans start with ignore_killed_tuples = false
+and we never set kill_prior_tuple. We do this because the oldest xmin
+on the standby server can be older than the oldest xmin on the master
+server, which means tuples can be marked as killed even when they are
+still visible on the standby. We don't WAL log tuple killed bits, but
+they can still appear in the standby because of full page writes. So
+we must always ignore them in standby, and that means it's not worth
+setting them either.
+Note that we talk about scans that are started during recovery. We go to
+a little trouble to allow a scan to start during recovery and end during
+normal running after recovery has completed. This is a key capability
+because it allows running applications to continue while the standby
+changes state into a normally running server.
 Other Things That Are Handy to Know
 -----------------------------------

--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.174 2009/10/02 21:14:04 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.175 2009/12/19 01:32:32 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -2025,7 +2025,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer)
 	}
 	if (ndeletable > 0)
-		_bt_delitems(rel, buffer, deletable, ndeletable);
+		_bt_delitems(rel, buffer, deletable, ndeletable, false, 0);
 	/*
 	 * Note: if we didn't find any LP_DEAD items, then the page's

--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -9,7 +9,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.113 2009/05/05 19:02:22 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.114 2009/12/19 01:32:33 sriggs Exp $
 *
 *	NOTES
 *	   Postgres btree pages look like ordinary relation pages.	The opaque
@@ -653,18 +653,32 @@ _bt_page_recyclable(Page page)
 *
 * This routine assumes that the caller has pinned and locked the buffer.
 * Also, the given itemnos *must* appear in increasing order in the array.
+ *
+ * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
+ * we need to be able to pin all of the blocks in the btree in physical
+ * order when replaying the effects of a VACUUM, just as we do for the
+ * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
+ * intermediate range of blocks has had no changes at all by VACUUM,
+ * and so must be scanned anyway during replay. We always write a WAL record
+ * for the last block in the index, whether or not it contained any items
+ * to be removed. This allows us to scan right up to end of index to
+ * ensure correct locking.
 */
 void
 _bt_delitems(Relation rel, Buffer buf,
-			 OffsetNumber *itemnos, int nitems)
+			 OffsetNumber *itemnos, int nitems, bool isVacuum,
+			 BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Assert(isVacuum || lastBlockVacuumed == 0);
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 	/* Fix the page */
+	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
 	/*
@@ -688,15 +702,36 @@ _bt_delitems(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (!rel->rd_istemp)
 	{
-		xl_btree_delete xlrec;
 		XLogRecPtr	recptr;
 		XLogRecData rdata[2];
-		xlrec.node = rel->rd_node;
+		if (isVacuum)
-		xlrec.block = BufferGetBlockNumber(buf);
+		{
+			xl_btree_vacuum xlrec_vacuum;
+			xlrec_vacuum.node = rel->rd_node;
+			xlrec_vacuum.block = BufferGetBlockNumber(buf);
-		rdata[0].data = (char *) &xlrec;
+			xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+			rdata[0].data = (char *) &xlrec_vacuum;
+			rdata[0].len = SizeOfBtreeVacuum;
+		}
+		else
+		{
+			xl_btree_delete xlrec_delete;
+			xlrec_delete.node = rel->rd_node;
+			xlrec_delete.block = BufferGetBlockNumber(buf);
+			/*
+			 * XXX: We would like to set an accurate latestRemovedXid, but
+			 * there is no easy way of obtaining a useful value. So we punt
+			 * and store InvalidTransactionId, which forces the standby to
+			 * wait for/cancel all currently running transactions.
+			 */
+			xlrec_delete.latestRemovedXid = InvalidTransactionId;
+			rdata[0].data = (char *) &xlrec_delete;
 			rdata[0].len = SizeOfBtreeDelete;
+		}
 		rdata[0].buffer = InvalidBuffer;
 		rdata[0].next = &(rdata[1]);
@@ -719,6 +754,9 @@ _bt_delitems(Relation rel, Buffer buf,
 		rdata[1].buffer_std = true;
 		rdata[1].next = NULL;
+		if (isVacuum)
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
+		else
 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
 		PageSetLSN(page, recptr);

--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -12,7 +12,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.172 2009/07/29 20:56:18 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.173 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -57,7 +57,8 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber lastUsedPage;
+	BlockNumber lastBlockVacuumed; 	/* last blkno reached by Vacuum scan */
+	BlockNumber lastUsedPage;		/* blkno of last non-recyclable page */
 	BlockNumber totFreePages;	/* true total # of free pages */
 	MemoryContext pagedelcontext;
 } BTVacState;
@@ -629,6 +630,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
+	vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
 	vstate.lastUsedPage = BTREE_METAPAGE;
 	vstate.totFreePages = 0;
@@ -705,6 +707,32 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		num_pages = new_pages;
 	}
+	/*
+	 * InHotStandby we need to scan right up to the end of the index for
+	 * correct locking, so we may need to write a WAL record for the final
+	 * block in the index if it was not vacuumed. It's possible that VACUUMing
+	 * has actually removed zeroed pages at the end of the index so we need to
+	 * take care to issue the record for last actual block and not for the
+	 * last block that was scanned. Ignore empty indexes.
+	 */
+	if (XLogStandbyInfoActive() &&
+		num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1))
+	{
+		Buffer		buf;
+		/*
+		 * We can't use _bt_getbuf() here because it always applies
+		 * _bt_checkpage(), which will barf on an all-zero page. We want to
+		 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+		 * buffer access strategy.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
+								 info->strategy);
+		LockBufferForCleanup(buf);
+		_bt_delitems(rel, buf, NULL, 0, true, vstate.lastBlockVacuumed);
+		_bt_relbuf(rel, buf);
+	}
 	MemoryContextDelete(vstate.pagedelcontext);
 	/* update statistics */
@@ -847,6 +875,26 @@ restart:
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
 				htup = &(itup->t_tid);
+				/*
+				 * During Hot Standby we currently assume that XLOG_BTREE_VACUUM
+				 * records do not produce conflicts. That is only true as long
+				 * as the callback function depends only upon whether the index
+				 * tuple refers to heap tuples removed in the initial heap scan.
+				 * When vacuum starts it derives a value of OldestXmin. Backends
+				 * taking later snapshots could have a RecentGlobalXmin with a
+				 * later xid than the vacuum's OldestXmin, so it is possible that
+				 * row versions deleted after OldestXmin could be marked as killed
+				 * by other backends. The callback function *could* look at the
+				 * index tuple state in isolation and decide to delete the index
+				 * tuple, though currently it does not. If it ever did, we would
+				 * need to reconsider whether XLOG_BTREE_VACUUM records should
+				 * cause conflicts. If they did cause conflicts they would be
+				 * fairly harsh conflicts, since we haven't yet worked out a way
+				 * to pass a useful value for latestRemovedXid on the
+				 * XLOG_BTREE_VACUUM records. This applies to *any* type of index
+				 * that marks index tuples as killed.
+				 */
 				if (callback(htup, callback_state))
 					deletable[ndeletable++] = offnum;
 			}
@@ -858,7 +906,19 @@ restart:
 		 */
 		if (ndeletable > 0)
 		{
-			_bt_delitems(rel, buf, deletable, ndeletable);
+			BlockNumber	lastBlockVacuumed = BufferGetBlockNumber(buf);
+			_bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed);
+			/*
+			 * Keep track of the block number of the lastBlockVacuumed, so
+			 * we can scan those blocks as well during WAL replay. This then
+			 * provides concurrency protection and allows btrees to be used
+			 * while in recovery.
+			 */
+			if (lastBlockVacuumed > vstate->lastBlockVacuumed)
+				vstate->lastBlockVacuumed = lastBlockVacuumed;
 			stats->tuples_removed += ndeletable;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);

--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.55 2009/06/11 14:48:54 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.56 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -16,7 +16,11 @@
 #include "access/nbtree.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "storage/standby.h"
+#include "miscadmin.h"
 /*
 * We must keep track of expected insertions due to page splits, and apply
@@ -458,6 +462,97 @@ btree_xlog_split(bool onleft, bool isroot,
 						 xlrec->leftsib, xlrec->rightsib, isroot);
 }
+static void
+btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_btree_vacuum *xlrec;
+	Buffer		buffer;
+	Page		page;
+	BTPageOpaque opaque;
+	xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+	/*
+	 * If queries might be active then we need to ensure every block is unpinned
+	 * between the lastBlockVacuumed and the current block, if there are any.
+	 * This ensures that every block in the index is touched during VACUUM as
+	 * required to ensure scans work correctly.
+	 */
+	if (standbyState == STANDBY_SNAPSHOT_READY &&
+		(xlrec->lastBlockVacuumed + 1) != xlrec->block)
+	{
+		BlockNumber blkno = xlrec->lastBlockVacuumed + 1;
+		for (; blkno < xlrec->block; blkno++)
+		{
+			/*
+			 * XXX we don't actually need to read the block, we
+			 * just need to confirm it is unpinned. If we had a special call
+			 * into the buffer manager we could optimise this so that
+			 * if the block is not in shared_buffers we confirm it as unpinned.
+			 *
+			 * Another simple optimization would be to check if there's any
+			 * backends running; if not, we could just skip this.
+			 */
+			buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+			if (BufferIsValid(buffer))
+			{
+				LockBufferForCleanup(buffer);
+				UnlockReleaseBuffer(buffer);
+			}
+		}
+	}
+	/*
+	 * If the block was restored from a full page image, nothing more to do.
+	 * The RestoreBkpBlocks() call already pinned and took cleanup lock on
+	 * it. XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop
+	 * above, to make the disk access more sequential.
+	 */
+	if (record->xl_info & XLR_BKP_BLOCK_1)
+		return;
+	/*
+	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
+	 * page. See nbtree/README for details.
+	 */
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
+	if (!BufferIsValid(buffer))
+		return;
+	LockBufferForCleanup(buffer);
+	page = (Page) BufferGetPage(buffer);
+	if (XLByteLE(lsn, PageGetLSN(page)))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+	if (record->xl_len > SizeOfBtreeVacuum)
+	{
+		OffsetNumber *unused;
+		OffsetNumber *unend;
+		unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeVacuum);
+		unend = (OffsetNumber *) ((char *) xlrec + record->xl_len);
+		if ((unend - unused) > 0)
+			PageIndexMultiDelete(page, unused, unend - unused);
+	}
+	/*
+	 * Mark the page as not containing any LP_DEAD items --- see comments in
+	 * _bt_delitems().
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+	PageSetLSN(page, lsn);
+	PageSetTLI(page, ThisTimeLineID);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
 static void
 btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -470,6 +565,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 		return;
 	xlrec = (xl_btree_delete *) XLogRecGetData(record);
+	/*
+	 * We don't need to take a cleanup lock to apply these changes.
+	 * See nbtree/README for details.
+	 */
 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
 	if (!BufferIsValid(buffer))
 		return;
@@ -714,7 +814,43 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
-	RestoreBkpBlocks(lsn, record, false);
+	/*
+	 * Btree delete records can conflict with standby queries. You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives. After that any we know that no conflicts
+	 * exist from individual btree vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		if (info == XLOG_BTREE_DELETE)
+		{
+			xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
+			VirtualTransactionId *backends;
+			/*
+			 * XXX Currently we put everybody on death row, because
+			 * currently _bt_delitems() supplies InvalidTransactionId.
+			 * This can be fairly painful, so providing a better value
+			 * here is worth some thought and possibly some effort to
+			 * improve.
+			 */
+			backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+												 InvalidOid,
+												 true);
+			ResolveRecoveryConflictWithVirtualXIDs(backends,
+												   "b-tree delete",
+												   CONFLICT_MODE_ERROR);
+		}
+	}
+	/*
+	 * Vacuum needs to pin and take cleanup lock on every leaf page,
+	 * a regular exclusive lock is enough for all other purposes.
+	 */
+	RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));
 	switch (info)
 	{
@@ -739,6 +875,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_BTREE_SPLIT_R_ROOT:
 			btree_xlog_split(false, true, lsn, record);
 			break;
+		case XLOG_BTREE_VACUUM:
+			btree_xlog_vacuum(lsn, record);
+			break;
 		case XLOG_BTREE_DELETE:
 			btree_xlog_delete(lsn, record);
 			break;
@@ -843,13 +982,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
 								 xlrec->level, xlrec->firstright);
 				break;
 			}
+		case XLOG_BTREE_VACUUM:
+			{
+				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
+				appendStringInfo(buf, "vacuum: rel %u/%u/%u; blk %u, lastBlockVacuumed %u",
+								 xlrec->node.spcNode, xlrec->node.dbNode,
+								 xlrec->node.relNode, xlrec->block,
+								 xlrec->lastBlockVacuumed);
+				break;
+			}
 		case XLOG_BTREE_DELETE:
 			{
 				xl_btree_delete *xlrec = (xl_btree_delete *) rec;
-				appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u",
+				appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->block);
+								 xlrec->node.relNode, xlrec->block,
+								 xlrec->latestRemovedXid);
 				break;
 			}
 		case XLOG_BTREE_DELETE_PAGE:

--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
-$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.12 2008/10/20 19:18:18 alvherre Exp $
+$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $
 The Transaction System
 ======================
@@ -649,3 +649,34 @@ fsync it down to disk without any sort of interlock, as soon as it finishes
 the bulk update.  However, all these paths are designed to write data that
 no other transaction can see until after T1 commits.  The situation is thus
 not different from ordinary WAL-logged updates.
+Transaction Emulation during Recovery
+-------------------------------------
+During Recovery we replay transaction changes in the order they occurred.
+As part of this replay we emulate some transactional behaviour, so that
+read only backends can take MVCC snapshots. We do this by maintaining a
+list of XIDs belonging to transactions that are being replayed, so that
+each transaction that has recorded WAL records for database writes exist
+in the array until it commits. Further details are given in comments in
+procarray.c.
+Many actions write no WAL records at all, for example read only transactions.
+These have no effect on MVCC in recovery and we can pretend they never
+occurred at all. Subtransaction commit does not write a WAL record either
+and has very little effect, since lock waiters need to wait for the
+parent transaction to complete.
+Not all transactional behaviour is emulated, for example we do not insert
+a transaction entry into the lock table, nor do we maintain the transaction
+stack in memory. Clog entries are made normally. Multitrans is not maintained
+because its purpose is to record tuple level locks that an application has
+requested to prevent write locks. Since write locks cannot be obtained at all,
+there is never any conflict and so there is no reason to update multitrans.
+Subtrans is maintained during recovery but the details of the transaction
+tree are ignored and all subtransactions reference the top-level TransactionId
+directly. Since commit is atomic this provides correct lock wait behaviour
+yet simplifies emulation of subtransactions considerably.
+Further details on locking mechanics in recovery are given in comments
+with the Lock rmgr code.
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -26,7 +26,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.53 2009/06/11 14:48:54 momjian Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.54 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -574,7 +574,7 @@ ExtendCLOG(TransactionId newestXact)
 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
 	/* Zero the page and make an XLOG entry about it */
-	ZeroCLOGPage(pageno, true);
+	ZeroCLOGPage(pageno, !InRecovery);
 	LWLockRelease(CLogControlLock);
 }

--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -42,7 +42,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.32 2009/11/23 09:58:36 heikki Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.33 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -59,6 +59,7 @@
 #include "storage/backendid.h"
 #include "storage/lmgr.h"
 #include "storage/procarray.h"
+#include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -220,7 +221,6 @@ static MultiXactId GetNewMultiXactId(int nxids, MultiXactOffset *offset);
 static MultiXactId mXactCacheGetBySet(int nxids, TransactionId *xids);
 static int	mXactCacheGetById(MultiXactId multi, TransactionId **xids);
 static void mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids);
-static int	xidComparator(const void *arg1, const void *arg2);
 #ifdef MULTIXACT_DEBUG
 static char *mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids);
@@ -1221,27 +1221,6 @@ mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids)
 	MXactCache = entry;
 }
-/*
- * xidComparator
- *		qsort comparison function for XIDs
- *
- * We don't need to use wraparound comparison for XIDs, and indeed must
- * not do so since that does not respect the triangle inequality!  Any
- * old sort order will do.
- */
-static int
-xidComparator(const void *arg1, const void *arg2)
-{
-	TransactionId xid1 = *(const TransactionId *) arg1;
-	TransactionId xid2 = *(const TransactionId *) arg2;
-	if (xid1 > xid2)
-		return 1;
-	if (xid1 < xid2)
-		return -1;
-	return 0;
-}
 #ifdef MULTIXACT_DEBUG
 static char *
 mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids)
@@ -2051,11 +2030,18 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record)
 			if (TransactionIdPrecedes(max_xid, xids[i]))
 				max_xid = xids[i];
 		}
+		/* We don't expect anyone else to modify nextXid, hence startup process
+		 * doesn't need to hold a lock while checking this. We still acquire
+		 * the lock to modify it, though.
+		 */
 		if (TransactionIdFollowsOrEquals(max_xid,
 										 ShmemVariableCache->nextXid))
 		{
+			LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 			ShmemVariableCache->nextXid = max_xid;
 			TransactionIdAdvance(ShmemVariableCache->nextXid);
+			LWLockRelease(XidGenLock);
 		}
 	}
 	else

--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -79,3 +79,10 @@
 #
 #
 #---------------------------------------------------------------------------
+# HOT STANDBY PARAMETERS
+#---------------------------------------------------------------------------
+#
+# If you want to enable read-only connections during recovery, enable
+# recovery_connections in postgresql.conf
+#
+#---------------------------------------------------------------------------
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -3,7 +3,7 @@
 *
 * Resource managers definition
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.27 2008/11/19 10:34:50 heikki Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.28 2009/12/19 01:32:33 sriggs Exp $
 */
 #include "postgres.h"
@@ -21,6 +21,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "storage/freespace.h"
+#include "storage/standby.h"
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
@@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
 	{"Reserved 7", NULL, NULL, NULL, NULL, NULL},
-	{"Reserved 8", NULL, NULL, NULL, NULL, NULL},
+	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
 	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
 	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},

--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -22,7 +22,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.24 2009/01/01 17:23:36 momjian Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.25 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -68,15 +68,19 @@ static bool SubTransPagePrecedes(int page1, int page2);
 /*
 * Record the parent of a subtransaction in the subtrans log.
+ *
+ * In some cases we may need to overwrite an existing value.
 */
 void
-SubTransSetParent(TransactionId xid, TransactionId parent)
+SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
 {
 	int			pageno = TransactionIdToPage(xid);
 	int			entryno = TransactionIdToEntry(xid);
 	int			slotno;
 	TransactionId *ptr;
+	Assert(TransactionIdIsValid(parent));
 	LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
 	slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
@@ -84,7 +88,8 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
 	ptr += entryno;
 	/* Current state should be 0 */
-	Assert(*ptr == InvalidTransactionId);
+	Assert(*ptr == InvalidTransactionId ||
+			(*ptr == parent && overwriteOK));
 	*ptr = parent;

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
--- a/src/backend/access/transam/twophase_rmgr.c
+++ b/src/backend/access/transam/twophase_rmgr.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.10 2009/11/23 09:58:36 heikki Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.11 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -19,14 +19,12 @@
 #include "commands/async.h"
 #include "pgstat.h"
 #include "storage/lock.h"
-#include "utils/inval.h"
 const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_recover,		/* Lock */
-	NULL,						/* Inval */
 	NULL,						/* notify/listen */
 	NULL,						/* pgstat */
 	multixact_twophase_recover	/* MultiXact */
@@ -36,7 +34,6 @@ const TwoPhaseCallback twophase_postcommit_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_postcommit,	/* Lock */
-	inval_twophase_postcommit,	/* Inval */
 	notify_twophase_postcommit, /* notify/listen */
 	pgstat_twophase_postcommit,	/* pgstat */
 	multixact_twophase_postcommit /* MultiXact */
@@ -46,8 +43,16 @@ const TwoPhaseCallback twophase_postabort_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_postabort,	/* Lock */
-	NULL,						/* Inval */
 	NULL,						/* notify/listen */
 	pgstat_twophase_postabort,	/* pgstat */
 	multixact_twophase_postabort /* MultiXact */
 };
+const TwoPhaseCallback twophase_standby_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
+{
+	NULL,						/* END ID */
+	lock_twophase_standby_recover,		/* Lock */
+	NULL,						/* notify/listen */
+	NULL,						/* pgstat */
+	NULL						/* MultiXact */
+};
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -13,7 +13,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.228 2009/11/12 02:46:16 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.229 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -26,6 +26,7 @@
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -48,6 +49,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -1941,6 +1943,26 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record)
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		if (InHotStandby)
+		{
+			VirtualTransactionId *database_users;
+			/*
+			 * Find all users connected to this database and ask them
+			 * politely to immediately kill their sessions before processing
+			 * the drop database record, after the usual grace period.
+			 * We don't wait for commit because drop database is
+			 * non-transactional.
+			 */
+		    database_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+													   xlrec->db_id,
+													   false);
+			ResolveRecoveryConflictWithVirtualXIDs(database_users,
+												   "drop database",
+												   CONFLICT_MODE_FATAL);
+		}
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);

--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.25 2009/06/11 14:48:56 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.26 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -47,6 +47,16 @@ LockTableCommand(LockStmt *lockstmt)
 		reloid = RangeVarGetRelid(relation, false);
+		/*
+		 * During recovery we only accept these variations:
+		 *   LOCK TABLE foo IN ACCESS SHARE MODE
+		 *   LOCK TABLE foo IN ROW SHARE MODE
+		 *   LOCK TABLE foo IN ROW EXCLUSIVE MODE
+		 * This test must match the restrictions defined in LockAcquire()
+		 */
+		if (lockstmt->mode > RowExclusiveLock)
+			PreventCommandDuringRecovery();
 		LockTableRecurse(reloid, relation,
 						 lockstmt->mode, lockstmt->nowait, recurse);
 	}

--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.162 2009/10/13 00:53:07 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.163 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -458,6 +458,9 @@ nextval_internal(Oid relid)
 				rescnt = 0;
 	bool		logit = false;
+	/* nextval() writes to database and must be prevented during recovery */
+	PreventCommandDuringRecovery();
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, &elm, &seqrel);

--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -37,7 +37,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.63 2009/11/10 18:53:38 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.64 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -50,6 +50,7 @@
 #include "access/heapam.h"
 #include "access/sysattr.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -60,6 +61,8 @@
 #include "miscadmin.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
+#include "storage/standby.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -1317,12 +1320,59 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
+		/*
+		 * If we issued a WAL record for a drop tablespace it is
+		 * because there were no files in it at all. That means that
+		 * no permanent objects can exist in it at this point.
+		 *
+		 * It is possible for standby users to be using this tablespace
+		 * as a location for their temporary files, so if we fail to
+		 * remove all files then do conflict processing and try again,
+		 * if currently enabled.
+		 */
+		if (!remove_tablespace_directories(xlrec->ts_id, true))
+		{
+			VirtualTransactionId *temp_file_users;
+			/*
+			 * Standby users may be currently using this tablespace for
+			 * for their temporary files. We only care about current
+			 * users because temp_tablespace parameter will just ignore
+			 * tablespaces that no longer exist.
+			 *
+			 * Ask everybody to cancel their queries immediately so
+			 * we can ensure no temp files remain and we can remove the
+			 * tablespace. Nuke the entire site from orbit, it's the only
+			 * way to be sure.
+			 *
+			 * XXX: We could work out the pids of active backends
+			 * using this tablespace by examining the temp filenames in the
+			 * directory. We would then convert the pids into VirtualXIDs
+			 * before attempting to cancel them.
+			 *
+			 * We don't wait for commit because drop tablespace is
+			 * non-transactional.
+			 */
+			temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+														InvalidOid,
+														false);
+			ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
+												   "drop tablespace",
+												   CONFLICT_MODE_ERROR);
+			/*
+			 * If we did recovery processing then hopefully the
+			 * backends who wrote temp files should have cleaned up and
+			 * exited by now. So lets recheck before we throw an error.
+			 * If !process_conflicts then this will just fail again.
+			 */
 			if (!remove_tablespace_directories(xlrec->ts_id, true))
 				ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("tablespace %u is not empty",
 							xlrec->ts_id)));
 		}
+	}
 	else
 		elog(PANIC, "tblspc_redo: unknown op code %u", info);
 }

--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -13,7 +13,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.398 2009/12/09 21:57:51 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.399 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -141,6 +141,7 @@ typedef struct VRelStats
 	/* vtlinks array for tuple chain following - sorted by new_tid */
 	int			num_vtlinks;
 	VTupleLink	vtlinks;
+	TransactionId	latestRemovedXid;
 } VRelStats;
 /*----------------------------------------------------------------------
@@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel,
 static bool repair_frag(VRelStats *vacrelstats, Relation onerel,
 			VacPageList vacuum_pages, VacPageList fraged_pages,
 			int nindexes, Relation *Irel);
-static void move_chain_tuple(Relation rel,
+static void move_chain_tuple(VRelStats *vacrelstats, Relation rel,
 				 Buffer old_buf, Page old_page, HeapTuple old_tup,
 				 Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
 				 ExecContext ec, ItemPointer ctid, bool cleanVpd);
@@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages,
 				 int num_moved);
 static void vacuum_heap(VRelStats *vacrelstats, Relation onerel,
 			VacPageList vacpagelist);
-static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage);
+static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage);
 static void vacuum_index(VacPageList vacpagelist, Relation indrel,
 			 double num_tuples, int keep_tuples);
 static void scan_index(Relation indrel, double num_tuples);
@@ -1300,6 +1301,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt)
 	vacrelstats->rel_tuples = 0;
 	vacrelstats->rel_indexed_tuples = 0;
 	vacrelstats->hasindex = false;
+	vacrelstats->latestRemovedXid = InvalidTransactionId;
 	/* scan the heap */
 	vacuum_pages.num_pages = fraged_pages.num_pages = 0;
@@ -1708,6 +1710,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel,
 			{
 				ItemId		lpp;
+				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+											&vacrelstats->latestRemovedXid);
 				/*
 				 * Here we are building a temporary copy of the page with dead
 				 * tuples removed.	Below we will apply
@@ -2025,7 +2030,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				/* there are dead tuples on this page - clean them */
 				Assert(!isempty);
 				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-				vacuum_page(onerel, buf, last_vacuum_page);
+				vacuum_page(vacrelstats, onerel, buf, last_vacuum_page);
 				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			}
 			else
@@ -2514,7 +2519,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 					tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid);
 					tuple_len = tuple.t_len = ItemIdGetLength(Citemid);
-					move_chain_tuple(onerel, Cbuf, Cpage, &tuple,
+					move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple,
 									 dst_buffer, dst_page, destvacpage,
 									 &ec, &Ctid, vtmove[ti].cleanVpd);
@@ -2600,7 +2605,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				dst_page = BufferGetPage(dst_buffer);
 				/* if this page was not used before - clean it */
 				if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0)
-					vacuum_page(onerel, dst_buffer, dst_vacpage);
+					vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage);
 			}
 			else
 				LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -2753,7 +2758,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 		HOLD_INTERRUPTS();
 		heldoff = true;
 		ForceSyncCommit();
-		(void) RecordTransactionCommit();
+		(void) RecordTransactionCommit(true);
 	}
 	/*
@@ -2781,7 +2786,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 			page = BufferGetPage(buf);
 			if (!PageIsEmpty(page))
-				vacuum_page(onerel, buf, *curpage);
+				vacuum_page(vacrelstats, onerel, buf, *curpage);
 			UnlockReleaseBuffer(buf);
 		}
 	}
@@ -2917,7 +2922,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				recptr = log_heap_clean(onerel, buf,
 										NULL, 0, NULL, 0,
 										unused, uncnt,
-										false);
+										vacrelstats->latestRemovedXid, false);
 				PageSetLSN(page, recptr);
 				PageSetTLI(page, ThisTimeLineID);
 			}
@@ -2969,7 +2974,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 *		already too long and almost unreadable.
 */
 static void
-move_chain_tuple(Relation rel,
+move_chain_tuple(VRelStats *vacrelstats, Relation rel,
 				 Buffer old_buf, Page old_page, HeapTuple old_tup,
 				 Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
 				 ExecContext ec, ItemPointer ctid, bool cleanVpd)
@@ -3027,7 +3032,7 @@ move_chain_tuple(Relation rel,
 		int			sv_offsets_used = dst_vacpage->offsets_used;
 		dst_vacpage->offsets_used = 0;
-		vacuum_page(rel, dst_buf, dst_vacpage);
+		vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage);
 		dst_vacpage->offsets_used = sv_offsets_used;
 	}
@@ -3367,7 +3372,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
 			buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno,
 									 RBM_NORMAL, vac_strategy);
 			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-			vacuum_page(onerel, buf, *vacpage);
+			vacuum_page(vacrelstats, onerel, buf, *vacpage);
 			UnlockReleaseBuffer(buf);
 		}
 	}
@@ -3397,7 +3402,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
 * Caller must hold pin and lock on buffer.
 */
 static void
-vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
+vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage)
 {
 	Page		page = BufferGetPage(buffer);
 	int			i;
@@ -3426,7 +3431,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								vacpage->offsets, vacpage->offsets_free,
-								false);
+								vacrelstats->latestRemovedXid, false);
 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
 	}

--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -29,7 +29,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.124 2009/11/16 21:32:06 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.125 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -98,6 +98,7 @@ typedef struct LVRelStats
 	int			max_dead_tuples;	/* # slots allocated in array */
 	ItemPointer dead_tuples;	/* array of ItemPointerData */
 	int			num_index_scans;
+	TransactionId latestRemovedXid;
 } LVRelStats;
@@ -265,6 +266,34 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 	return heldoff;
 }
+/*
+ * For Hot Standby we need to know the highest transaction id that will
+ * be removed by any change. VACUUM proceeds in a number of passes so
+ * we need to consider how each pass operates. The first phase runs
+ * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
+ * progresses - these will have a latestRemovedXid on each record.
+ * In some cases this removes all of the tuples to be removed, though
+ * often we have dead tuples with index pointers so we must remember them
+ * for removal in phase 3. Index records for those rows are removed
+ * in phase 2 and index blocks do not have MVCC information attached.
+ * So before we can allow removal of any index tuples we need to issue
+ * a WAL record containing the latestRemovedXid of rows that will be
+ * removed in phase three. This allows recovery queries to block at the
+ * correct place, i.e. before phase two, rather than during phase three
+ * which would be after the rows have become inaccessible.
+ */
+static void
+vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+{
+	/*
+	 * No need to log changes for temp tables, they do not contain
+	 * data visible on the standby server.
+	 */
+	if (rel->rd_istemp || !XLogArchivingActive())
+		return;
+	(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+}
 /*
 *	lazy_scan_heap() -- scan an open heap relation
@@ -315,6 +344,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	nblocks = RelationGetNumberOfBlocks(onerel);
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->nonempty_pages = 0;
+	vacrelstats->latestRemovedXid = InvalidTransactionId;
 	lazy_space_alloc(vacrelstats, nblocks);
@@ -373,6 +403,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
 			vacrelstats->num_dead_tuples > 0)
 		{
+			/* Log cleanup info before we touch indexes */
+			vacuum_log_cleanup_info(onerel, vacrelstats);
 			/* Remove index entries */
 			for (i = 0; i < nindexes; i++)
 				lazy_vacuum_index(Irel[i],
@@ -382,6 +415,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			lazy_vacuum_heap(onerel, vacrelstats);
 			/* Forget the now-vacuumed tuples, and press on */
 			vacrelstats->num_dead_tuples = 0;
+			vacrelstats->latestRemovedXid = InvalidTransactionId;
 			vacrelstats->num_index_scans++;
 		}
@@ -613,6 +647,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			if (tupgone)
 			{
 				lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												&vacrelstats->latestRemovedXid);
 				tups_vacuumed += 1;
 			}
 			else
@@ -661,6 +697,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats);
 			/* Forget the now-vacuumed tuples, and press on */
 			vacrelstats->num_dead_tuples = 0;
+			vacrelstats->latestRemovedXid = InvalidTransactionId;
 			vacuumed_pages++;
 		}
@@ -724,6 +761,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	/* XXX put a threshold on min number of tuples here? */
 	if (vacrelstats->num_dead_tuples > 0)
 	{
+		/* Log cleanup info before we touch indexes */
+		vacuum_log_cleanup_info(onerel, vacrelstats);
 		/* Remove index entries */
 		for (i = 0; i < nindexes; i++)
 			lazy_vacuum_index(Irel[i],
@@ -868,7 +908,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								unused, uncnt,
-								false);
+								vacrelstats->latestRemovedXid, false);
 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
 	}

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -37,7 +37,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.596 2009/09/08 17:08:36 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.597 2009/12/19 01:32:34 sriggs Exp $
 *
 * NOTES
 *
@@ -245,8 +245,9 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
 * When archive recovery is finished, the startup process exits with exit
 * code 0 and we switch to PM_RUN state.
 *
- * Normal child backends can only be launched when we are in PM_RUN state.
+ * Normal child backends can only be launched when we are in PM_RUN or
- * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
+ * PM_RECOVERY_CONSISTENT state.  (We also allow launch of normal
+ * child backends in PM_WAIT_BACKUP state, but only for superusers.)
 * In other states we handle connection requests by launching "dead_end"
 * child processes, which will simply send the client an error message and
 * quit.  (We track these in the BackendList so that we can know when they
@@ -1868,7 +1869,7 @@ static enum CAC_state
 canAcceptConnections(void)
 {
 	/*
-	 * Can't start backends when in startup/shutdown/recovery state.
+	 * Can't start backends when in startup/shutdown/inconsistent recovery state.
 	 *
 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
 	 * allowed so that a superuser can end online backup mode); we return
@@ -1882,9 +1883,11 @@ canAcceptConnections(void)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		if (!FatalError &&
 			(pmState == PM_STARTUP ||
-			 pmState == PM_RECOVERY ||
+			 pmState == PM_RECOVERY))
-			 pmState == PM_RECOVERY_CONSISTENT))
 			return CAC_STARTUP; /* normal startup */
+		if (!FatalError &&
+			 pmState == PM_RECOVERY_CONSISTENT)
+			return CAC_OK; /* connection OK during recovery */
 		return CAC_RECOVERY;	/* else must be crash recovery */
 	}
@@ -4003,9 +4006,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(PgStatPID == 0);
 		PgStatPID = pgstat_start();
-		/* XXX at this point we could accept read-only connections */
+		ereport(LOG,
-		ereport(DEBUG1,
+				 (errmsg("database system is ready to accept read only connections")));
-				(errmsg("database system is in consistent recovery mode")));
 		pmState = PM_RECOVERY_CONSISTENT;
 	}

--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
 #
 # Makefile for storage/ipc
 #
-# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.22 2009/07/31 20:26:23 tgl Exp $
+# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.23 2009/12/19 01:32:35 sriggs Exp $
 #
 subdir = src/backend/storage/ipc
@@ -16,6 +16,6 @@ endif
 endif
 OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
-	sinval.o sinvaladt.o
+	sinval.o sinvaladt.o standby.o
 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.79 2009/07/31 20:26:23 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.80 2009/12/19 01:32:35 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -144,6 +144,13 @@ typedef struct ProcState
 	bool		resetState;		/* backend needs to reset its state */
 	bool		signaled;		/* backend has been sent catchup signal */
+	/*
+	 * Backend only sends invalidations, never receives them. This only makes sense
+	 * for Startup process during recovery because it doesn't maintain a relcache,
+	 * yet it fires inval messages to allow query backends to see schema changes.
+	 */
+	bool		sendOnly;		/* backend only sends, never receives */
 	/*
 	 * Next LocalTransactionId to use for each idle backend slot.  We keep
 	 * this here because it is indexed by BackendId and it is convenient to
@@ -249,7 +256,7 @@ CreateSharedInvalidationState(void)
 *		Initialize a new backend to operate on the sinval buffer
 */
 void
-SharedInvalBackendInit(void)
+SharedInvalBackendInit(bool sendOnly)
 {
 	int			index;
 	ProcState  *stateP = NULL;
@@ -308,6 +315,7 @@ SharedInvalBackendInit(void)
 	stateP->nextMsgNum = segP->maxMsgNum;
 	stateP->resetState = false;
 	stateP->signaled = false;
+	stateP->sendOnly = sendOnly;
 	LWLockRelease(SInvalWriteLock);
@@ -579,7 +587,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
 	/*
 	 * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
 	 * furthest-back backend that needs signaling (if any), and reset any
-	 * backends that are too far back.
+	 * backends that are too far back.  Note that because we ignore sendOnly
+	 * backends here it is possible for them to keep sending messages without
+	 * a problem even when they are the only active backend.
 	 */
 	min = segP->maxMsgNum;
 	minsig = min - SIG_THRESHOLD;
@@ -591,7 +601,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
 		int			n = stateP->nextMsgNum;
 		/* Ignore if inactive or already in reset state */
-		if (stateP->procPid == 0 || stateP->resetState)
+		if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly)
 			continue;
 		/*

--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
-$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.24 2008/03/21 13:23:28 momjian Exp $
+$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.25 2009/12/19 01:32:35 sriggs Exp $
 Locking Overview
 ================
@@ -517,3 +517,27 @@ interfere with each other.
 User locks are always held as session locks, so that they are not released at
 transaction end.  They must be released explicitly by the application --- but
 they are released automatically when a backend terminates.
+Locking during Hot Standby
+--------------------------
+The Startup process is the only backend that can make changes during
+recovery, all other backends are read only.  As a result the Startup
+process does not acquire locks on relations or objects except when the lock
+level is AccessExclusiveLock.
+Regular backends are only allowed to take locks on relations or objects
+at RowExclusiveLock or lower. This ensures that they do not conflict with
+each other or with the Startup process, unless AccessExclusiveLocks are
+requested by one of the backends.
+Deadlocks involving AccessExclusiveLocks are not possible, so we need
+not be concerned that a user initiated deadlock can prevent recovery from
+progressing.
+AccessExclusiveLocks on the primary or master node generate WAL records
+that are then applied by the Startup process. Locks are released at end
+of transaction just as they are in normal processing. These locks are
+held by the Startup process, acting as a proxy for the backends that
+originally acquired these locks. Again, these locks cannot conflict with
+one another, so the Startup process cannot deadlock itself either.
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.209 2009/08/31 19:41:00 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.210 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -318,6 +318,7 @@ InitProcess(void)
 	MyProc->waitProcLock = NULL;
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(MyProc->myProcLocks[i]));
+	MyProc->recoveryConflictMode = 0;
 	/*
 	 * We might be reusing a semaphore that belonged to a failed process. So
@@ -374,6 +375,11 @@ InitProcessPhase2(void)
 * to the ProcArray or the sinval messaging mechanism, either.	They also
 * don't get a VXID assigned, since this is only useful when we actually
 * hold lockmgr locks.
+ *
+ * Startup process however uses locks but never waits for them in the
+ * normal backend sense. Startup process also takes part in sinval messaging
+ * as a sendOnly process, so never reads messages from sinval queue. So
+ * Startup process does have a VXID and does show up in pg_locks.
 */
 void
 InitAuxiliaryProcess(void)
@@ -461,6 +467,24 @@ InitAuxiliaryProcess(void)
 	on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype));
 }
+/*
+ * Record the PID and PGPROC structures for the Startup process, for use in
+ * ProcSendSignal().  See comments there for further explanation.
+ */
+void
+PublishStartupProcessInformation(void)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile PROC_HDR *procglobal = ProcGlobal;
+	SpinLockAcquire(ProcStructLock);
+	procglobal->startupProc = MyProc;
+	procglobal->startupProcPid = MyProcPid;
+	SpinLockRelease(ProcStructLock);
+}
 /*
 * Check whether there are at least N free PGPROC objects.
 *
@@ -1289,7 +1313,31 @@ ProcWaitForSignal(void)
 void
 ProcSendSignal(int pid)
 {
-	PGPROC	   *proc = BackendPidGetProc(pid);
+	PGPROC	   *proc = NULL;
+	if (RecoveryInProgress())
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		SpinLockAcquire(ProcStructLock);
+		/*
+		 * Check to see whether it is the Startup process we wish to signal.
+		 * This call is made by the buffer manager when it wishes to wake
+		 * up a process that has been waiting for a pin in so it can obtain a
+		 * cleanup lock using LockBufferForCleanup(). Startup is not a normal
+		 * backend, so BackendPidGetProc() will not return any pid at all.
+		 * So we remember the information for this special case.
+		 */
+		if (pid == procglobal->startupProcPid)
+			proc = procglobal->startupProc;
+		SpinLockRelease(ProcStructLock);
+	}
+	if (proc == NULL)
+		proc = BackendPidGetProc(pid);
 	if (proc != NULL)
 		PGSemaphoreUnlock(&proc->sem);

--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
--- a/src/backend/utils/adt/txid.c
+++ b/src/backend/utils/adt/txid.c
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
--- a/src/include/storage/sinvaladt.h
+++ b/src/include/storage/sinvaladt.h
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
--- a/src/test/regress/expected/hs_standby_allowed.out
+++ b/src/test/regress/expected/hs_standby_allowed.out
--- a/src/test/regress/expected/hs_standby_check.out
+++ b/src/test/regress/expected/hs_standby_check.out
--- a/src/test/regress/expected/hs_standby_disallowed.out
+++ b/src/test/regress/expected/hs_standby_disallowed.out
--- a/src/test/regress/expected/hs_standby_functions.out
+++ b/src/test/regress/expected/hs_standby_functions.out
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
--- a/src/test/regress/sql/hs_primary_extremes.sql
+++ b/src/test/regress/sql/hs_primary_extremes.sql
--- a/src/test/regress/sql/hs_primary_setup.sql
+++ b/src/test/regress/sql/hs_primary_setup.sql
--- a/src/test/regress/sql/hs_standby_allowed.sql
+++ b/src/test/regress/sql/hs_standby_allowed.sql
--- a/src/test/regress/sql/hs_standby_check.sql
+++ b/src/test/regress/sql/hs_standby_check.sql
--- a/src/test/regress/sql/hs_standby_disallowed.sql
+++ b/src/test/regress/sql/hs_standby_disallowed.sql
--- a/src/test/regress/sql/hs_standby_functions.sql
+++ b/src/test/regress/sql/hs_standby_functions.sql
--- a/src/test/regress/standby_schedule
+++ b/src/test/regress/standby_schedule