Commit efc16ea5 authored by Simon Riggs's avatar Simon Riggs

Allow read only connections during recovery, known as Hot Standby.

Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
parent 78a09145
This diff is collapsed.
<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.238 2009/12/17 14:36:16 rhaas Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.239 2009/12/19 01:32:31 sriggs Exp $ -->
<chapter Id="runtime-config"> <chapter Id="runtime-config">
<title>Server Configuration</title> <title>Server Configuration</title>
...@@ -376,6 +376,12 @@ SET ENABLE_SEQSCAN TO OFF; ...@@ -376,6 +376,12 @@ SET ENABLE_SEQSCAN TO OFF;
allows. See <xref linkend="sysvipc"> for information on how to allows. See <xref linkend="sysvipc"> for information on how to
adjust those parameters, if necessary. adjust those parameters, if necessary.
</para> </para>
<para>
When running a standby server, you must set this parameter to the
same or higher value than on the master server. Otherwise, queries
will not be allowed in the standby server.
</para>
</listitem> </listitem>
</varlistentry> </varlistentry>
...@@ -826,6 +832,12 @@ SET ENABLE_SEQSCAN TO OFF; ...@@ -826,6 +832,12 @@ SET ENABLE_SEQSCAN TO OFF;
allows. See <xref linkend="sysvipc"> for information on how to allows. See <xref linkend="sysvipc"> for information on how to
adjust those parameters, if necessary. adjust those parameters, if necessary.
</para> </para>
<para>
When running a standby server, you must set this parameter to the
same or higher value than on the master server. Otherwise, queries
will not be allowed in the standby server.
</para>
</listitem> </listitem>
</varlistentry> </varlistentry>
...@@ -1733,6 +1745,51 @@ archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows ...@@ -1733,6 +1745,51 @@ archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows
</variablelist> </variablelist>
</sect2> </sect2>
<sect2 id="runtime-config-standby">
<title>Standby Servers</title>
<variablelist>
<varlistentry id="recovery-connections" xreflabel="recovery_connections">
<term><varname>recovery_connections</varname> (<type>boolean</type>)</term>
<listitem>
<para>
Parameter has two roles. During recovery, specifies whether or not
you can connect and run queries to enable <xref linkend="hot-standby">.
During normal running, specifies whether additional information is written
to WAL to allow recovery connections on a standby server that reads
WAL data generated by this server. The default value is
<literal>on</literal>. It is thought that there is little
measurable difference in performance from using this feature, so
feedback is welcome if any production impacts are noticeable.
It is likely that this parameter will be removed in later releases.
This parameter can only be set at server start.
</para>
</listitem>
</varlistentry>
<varlistentry id="max-standby-delay" xreflabel="max_standby_delay">
<term><varname>max_standby_delay</varname> (<type>string</type>)</term>
<listitem>
<para>
When server acts as a standby, this parameter specifies a wait policy
for queries that conflict with incoming data changes. Valid settings
are -1, meaning wait forever, or a wait time of 0 or more seconds.
If a conflict should occur the server will delay up to this
amount before it begins trying to resolve things less amicably, as
described in <xref linkend="hot-standby-conflict">. Typically,
this parameter makes sense only during replication, so when
performing an archive recovery to recover from data loss a
parameter setting of 0 is recommended. The default is 30 seconds.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect2>
</sect1> </sect1>
<sect1 id="runtime-config-query"> <sect1 id="runtime-config-query">
...@@ -4161,6 +4218,29 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; ...@@ -4161,6 +4218,29 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
<term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)</term>
<indexterm>
<primary><varname>vacuum_defer_cleanup_age</> configuration parameter</primary>
</indexterm>
<listitem>
<para>
Specifies the number of transactions by which <command>VACUUM</> and
<acronym>HOT</> updates will defer cleanup of dead row versions. The
default is 0 transactions, meaning that dead row versions will be
removed as soon as possible. You may wish to set this to a non-zero
value when planning or maintaining a <xref linkend="hot-standby">
configuration. The recommended value is <literal>0</> unless you have
clear reason to increase it. The purpose of the parameter is to
allow the user to specify an approximate time delay before cleanup
occurs. However, it should be noted that there is no direct link with
any specific time delay and so the results will be application and
installation specific, as well as variable over time, depending upon
the transaction rate (of writes only).
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-bytea-output" xreflabel="bytea_output"> <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
<term><varname>bytea_output</varname> (<type>enum</type>)</term> <term><varname>bytea_output</varname> (<type>enum</type>)</term>
<indexterm> <indexterm>
...@@ -4689,6 +4769,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' ...@@ -4689,6 +4769,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
allows. See <xref linkend="sysvipc"> for information on how to allows. See <xref linkend="sysvipc"> for information on how to
adjust those parameters, if necessary. adjust those parameters, if necessary.
</para> </para>
<para>
When running a standby server, you must set this parameter to the
same or higher value than on the master server. Otherwise, queries
will not be allowed in the standby server.
</para>
</listitem> </listitem>
</varlistentry> </varlistentry>
...@@ -5546,6 +5632,32 @@ plruby.use_strict = true # generates error: unknown class name ...@@ -5546,6 +5632,32 @@ plruby.use_strict = true # generates error: unknown class name
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry id="guc-trace-recovery-messages" xreflabel="trace_recovery_messages">
<term><varname>trace_recovery_messages</varname> (<type>string</type>)</term>
<indexterm>
<primary><varname>trace_recovery_messages</> configuration parameter</primary>
</indexterm>
<listitem>
<para>
Controls which message levels are written to the server log
for system modules needed for recovery processing. This allows
the user to override the normal setting of log_min_messages,
but only for specific messages. This is intended for use in
debugging Hot Standby.
Valid values are <literal>DEBUG5</>, <literal>DEBUG4</>,
<literal>DEBUG3</>, <literal>DEBUG2</>, <literal>DEBUG1</>,
<literal>INFO</>, <literal>NOTICE</>, <literal>WARNING</>,
<literal>ERROR</>, <literal>LOG</>, <literal>FATAL</>, and
<literal>PANIC</>. Each level includes all the levels that
follow it. The later the level, the fewer messages are sent
to the log. The default is <literal>WARNING</>. Note that
<literal>LOG</> has a different rank here than in
<varname>client_min_messages</>.
Parameter should be set in the postgresql.conf only.
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages"> <varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages">
<term><varname>zero_damaged_pages</varname> (<type>boolean</type>)</term> <term><varname>zero_damaged_pages</varname> (<type>boolean</type>)</term>
<indexterm> <indexterm>
......
<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.493 2009/12/15 17:57:46 tgl Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.494 2009/12/19 01:32:31 sriggs Exp $ -->
<chapter id="functions"> <chapter id="functions">
<title>Functions and Operators</title> <title>Functions and Operators</title>
...@@ -13132,6 +13132,38 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); ...@@ -13132,6 +13132,38 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<xref linkend="continuous-archiving">. <xref linkend="continuous-archiving">.
</para> </para>
<indexterm>
<primary>pg_is_in_recovery</primary>
</indexterm>
<para>
The functions shown in <xref
linkend="functions-recovery-info-table"> provide information
about the current status of Hot Standby.
These functions may be executed during both recovery and in normal running.
</para>
<table id="functions-recovery-info-table">
<title>Recovery Information Functions</title>
<tgroup cols="3">
<thead>
<row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<literal><function>pg_is_in_recovery</function>()</literal>
</entry>
<entry><type>bool</type></entry>
<entry>True if recovery is still in progress.
</entry>
</row>
</tbody>
</tgroup>
</table>
<para> <para>
The functions shown in <xref linkend="functions-admin-dbsize"> calculate The functions shown in <xref linkend="functions-admin-dbsize"> calculate
the disk space usage of database objects. the disk space usage of database objects.
......
<!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.16 2008/11/14 10:22:45 petere Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.17 2009/12/19 01:32:31 sriggs Exp $ -->
<refentry id="sql-checkpoint"> <refentry id="sql-checkpoint">
<refmeta> <refmeta>
...@@ -42,6 +42,11 @@ CHECKPOINT ...@@ -42,6 +42,11 @@ CHECKPOINT
<xref linkend="wal"> for more information about the WAL system. <xref linkend="wal"> for more information about the WAL system.
</para> </para>
<para>
If executed during recovery, the <command>CHECKPOINT</command> command
will force a restartpoint rather than writing a new checkpoint.
</para>
<para> <para>
Only superusers can call <command>CHECKPOINT</command>. The command is Only superusers can call <command>CHECKPOINT</command>. The command is
not intended for use during normal operation. not intended for use during normal operation.
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.19 2009/06/11 14:48:53 momjian Exp $ * $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.20 2009/12/19 01:32:31 sriggs Exp $
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include "postgres.h" #include "postgres.h"
...@@ -621,6 +621,10 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -621,6 +621,10 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
{ {
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
/*
* GIN indexes do not require any conflict processing.
*/
RestoreBkpBlocks(lsn, record, false); RestoreBkpBlocks(lsn, record, false);
topCtx = MemoryContextSwitchTo(opCtx); topCtx = MemoryContextSwitchTo(opCtx);
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.32 2009/01/20 18:59:36 heikki Exp $ * $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.33 2009/12/19 01:32:32 sriggs Exp $
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include "postgres.h" #include "postgres.h"
...@@ -396,6 +396,12 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -396,6 +396,12 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record)
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
MemoryContext oldCxt; MemoryContext oldCxt;
/*
* GIST indexes do not require any conflict processing. NB: If we ever
* implement a similar optimization we have in b-tree, and remove killed
* tuples outside VACUUM, we'll need to handle that here.
*/
RestoreBkpBlocks(lsn, record, false); RestoreBkpBlocks(lsn, record, false);
oldCxt = MemoryContextSwitchTo(opCtx); oldCxt = MemoryContextSwitchTo(opCtx);
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.278 2009/08/24 02:18:31 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.279 2009/12/19 01:32:32 sriggs Exp $
* *
* *
* INTERFACE ROUTINES * INTERFACE ROUTINES
...@@ -59,6 +59,7 @@ ...@@ -59,6 +59,7 @@
#include "storage/lmgr.h" #include "storage/lmgr.h"
#include "storage/procarray.h" #include "storage/procarray.h"
#include "storage/smgr.h" #include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/datum.h" #include "utils/datum.h"
#include "utils/inval.h" #include "utils/inval.h"
#include "utils/lsyscache.h" #include "utils/lsyscache.h"
...@@ -248,8 +249,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page) ...@@ -248,8 +249,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
/* /*
* If the all-visible flag indicates that all tuples on the page are * If the all-visible flag indicates that all tuples on the page are
* visible to everyone, we can skip the per-tuple visibility tests. * visible to everyone, we can skip the per-tuple visibility tests.
* But not in hot standby mode. A tuple that's already visible to all
* transactions in the master might still be invisible to a read-only
* transaction in the standby.
*/ */
all_visible = PageIsAllVisible(dp); all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;
for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
lineoff <= lines; lineoff <= lines;
...@@ -3769,6 +3773,60 @@ heap_restrpos(HeapScanDesc scan) ...@@ -3769,6 +3773,60 @@ heap_restrpos(HeapScanDesc scan)
} }
} }
/*
* If 'tuple' contains any XID greater than latestRemovedXid, update
* latestRemovedXid to the greatest one found.
*/
void
HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
TransactionId *latestRemovedXid)
{
TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
if (tuple->t_infomask & HEAP_MOVED_OFF ||
tuple->t_infomask & HEAP_MOVED_IN)
{
if (TransactionIdPrecedes(*latestRemovedXid, xvac))
*latestRemovedXid = xvac;
}
if (TransactionIdPrecedes(*latestRemovedXid, xmax))
*latestRemovedXid = xmax;
if (TransactionIdPrecedes(*latestRemovedXid, xmin))
*latestRemovedXid = xmin;
Assert(TransactionIdIsValid(*latestRemovedXid));
}
/*
* Perform XLogInsert to register a heap cleanup info message. These
* messages are sent once per VACUUM and are required because
* of the phasing of removal operations during a lazy VACUUM.
* see comments for vacuum_log_cleanup_info().
*/
XLogRecPtr
log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
{
xl_heap_cleanup_info xlrec;
XLogRecPtr recptr;
XLogRecData rdata;
xlrec.node = rnode;
xlrec.latestRemovedXid = latestRemovedXid;
rdata.data = (char *) &xlrec;
rdata.len = SizeOfHeapCleanupInfo;
rdata.buffer = InvalidBuffer;
rdata.next = NULL;
recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata);
return recptr;
}
/* /*
* Perform XLogInsert for a heap-clean operation. Caller must already * Perform XLogInsert for a heap-clean operation. Caller must already
* have modified the buffer and marked it dirty. * have modified the buffer and marked it dirty.
...@@ -3776,13 +3834,17 @@ heap_restrpos(HeapScanDesc scan) ...@@ -3776,13 +3834,17 @@ heap_restrpos(HeapScanDesc scan)
* Note: prior to Postgres 8.3, the entries in the nowunused[] array were * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
* zero-based tuple indexes. Now they are one-based like other uses * zero-based tuple indexes. Now they are one-based like other uses
* of OffsetNumber. * of OffsetNumber.
*
* We also include latestRemovedXid, which is the greatest XID present in
* the removed tuples. That allows recovery processing to cancel or wait
* for long standby queries that can still see these tuples.
*/ */
XLogRecPtr XLogRecPtr
log_heap_clean(Relation reln, Buffer buffer, log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected, OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead, OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused, OffsetNumber *nowunused, int nunused,
bool redirect_move) TransactionId latestRemovedXid, bool redirect_move)
{ {
xl_heap_clean xlrec; xl_heap_clean xlrec;
uint8 info; uint8 info;
...@@ -3794,6 +3856,7 @@ log_heap_clean(Relation reln, Buffer buffer, ...@@ -3794,6 +3856,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.node = reln->rd_node; xlrec.node = reln->rd_node;
xlrec.block = BufferGetBlockNumber(buffer); xlrec.block = BufferGetBlockNumber(buffer);
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected; xlrec.nredirected = nredirected;
xlrec.ndead = ndead; xlrec.ndead = ndead;
...@@ -4067,6 +4130,33 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno, ...@@ -4067,6 +4130,33 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
return recptr; return recptr;
} }
/*
* Handles CLEANUP_INFO
*/
static void
heap_xlog_cleanup_info(XLogRecPtr lsn, XLogRecord *record)
{
xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
if (InHotStandby)
{
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
InvalidOid,
true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
"VACUUM index cleanup",
CONFLICT_MODE_ERROR);
}
/*
* Actual operation is a no-op. Record type exists to provide a means
* for conflict processing to occur before we begin index vacuum actions.
* see vacuumlazy.c and also comments in btvacuumpage()
*/
}
/* /*
* Handles CLEAN and CLEAN_MOVE record types * Handles CLEAN and CLEAN_MOVE record types
*/ */
...@@ -4085,12 +4175,31 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) ...@@ -4085,12 +4175,31 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
int nunused; int nunused;
Size freespace; Size freespace;
/*
* We're about to remove tuples. In Hot Standby mode, ensure that there's
* no queries running for which the removed tuples are still visible.
*/
if (InHotStandby)
{
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
InvalidOid,
true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
"VACUUM heap cleanup",
CONFLICT_MODE_ERROR);
}
RestoreBkpBlocks(lsn, record, true);
if (record->xl_info & XLR_BKP_BLOCK_1) if (record->xl_info & XLR_BKP_BLOCK_1)
return; return;
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
if (!BufferIsValid(buffer)) if (!BufferIsValid(buffer))
return; return;
LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer); page = (Page) BufferGetPage(buffer);
if (XLByteLE(lsn, PageGetLSN(page))) if (XLByteLE(lsn, PageGetLSN(page)))
...@@ -4145,12 +4254,40 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record) ...@@ -4145,12 +4254,40 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
Buffer buffer; Buffer buffer;
Page page; Page page;
/*
* In Hot Standby mode, ensure that there's no queries running which still
* consider the frozen xids as running.
*/
if (InHotStandby)
{
VirtualTransactionId *backends;
/*
* XXX: Using cutoff_xid is overly conservative. Even if cutoff_xid
* is recent enough to conflict with a backend, the actual values
* being frozen might not be. With a typical vacuum_freeze_min_age
* setting in the ballpark of millions of transactions, it won't make
* a difference, but it might if you run a manual VACUUM FREEZE.
* Typically the cutoff is much earlier than any recently deceased
* tuple versions removed by this vacuum, so don't worry too much.
*/
backends = GetConflictingVirtualXIDs(cutoff_xid,
InvalidOid,
true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
"VACUUM heap freeze",
CONFLICT_MODE_ERROR);
}
RestoreBkpBlocks(lsn, record, false);
if (record->xl_info & XLR_BKP_BLOCK_1) if (record->xl_info & XLR_BKP_BLOCK_1)
return; return;
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
if (!BufferIsValid(buffer)) if (!BufferIsValid(buffer))
return; return;
LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer); page = (Page) BufferGetPage(buffer);
if (XLByteLE(lsn, PageGetLSN(page))) if (XLByteLE(lsn, PageGetLSN(page)))
...@@ -4740,6 +4877,11 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -4740,6 +4877,11 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record)
{ {
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
/*
* These operations don't overwrite MVCC data so no conflict
* processing is required. The ones in heap2 rmgr do.
*/
RestoreBkpBlocks(lsn, record, false); RestoreBkpBlocks(lsn, record, false);
switch (info & XLOG_HEAP_OPMASK) switch (info & XLOG_HEAP_OPMASK)
...@@ -4778,20 +4920,25 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -4778,20 +4920,25 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
{ {
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
/*
* Note that RestoreBkpBlocks() is called after conflict processing
* within each record type handling function.
*/
switch (info & XLOG_HEAP_OPMASK) switch (info & XLOG_HEAP_OPMASK)
{ {
case XLOG_HEAP2_FREEZE: case XLOG_HEAP2_FREEZE:
RestoreBkpBlocks(lsn, record, false);
heap_xlog_freeze(lsn, record); heap_xlog_freeze(lsn, record);
break; break;
case XLOG_HEAP2_CLEAN: case XLOG_HEAP2_CLEAN:
RestoreBkpBlocks(lsn, record, true);
heap_xlog_clean(lsn, record, false); heap_xlog_clean(lsn, record, false);
break; break;
case XLOG_HEAP2_CLEAN_MOVE: case XLOG_HEAP2_CLEAN_MOVE:
RestoreBkpBlocks(lsn, record, true);
heap_xlog_clean(lsn, record, true); heap_xlog_clean(lsn, record, true);
break; break;
case XLOG_HEAP2_CLEANUP_INFO:
heap_xlog_cleanup_info(lsn, record);
break;
default: default:
elog(PANIC, "heap2_redo: unknown op code %u", info); elog(PANIC, "heap2_redo: unknown op code %u", info);
} }
...@@ -4921,17 +5068,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec) ...@@ -4921,17 +5068,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
{ {
xl_heap_clean *xlrec = (xl_heap_clean *) rec; xl_heap_clean *xlrec = (xl_heap_clean *) rec;
appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u", appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->block); xlrec->node.relNode, xlrec->block,
xlrec->latestRemovedXid);
} }
else if (info == XLOG_HEAP2_CLEAN_MOVE) else if (info == XLOG_HEAP2_CLEAN_MOVE)
{ {
xl_heap_clean *xlrec = (xl_heap_clean *) rec; xl_heap_clean *xlrec = (xl_heap_clean *) rec;
appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u", appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->block); xlrec->node.relNode, xlrec->block,
xlrec->latestRemovedXid);
}
else if (info == XLOG_HEAP2_CLEANUP_INFO)
{
xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
appendStringInfo(buf, "cleanup info: remxid %u",
xlrec->latestRemovedXid);
} }
else else
appendStringInfo(buf, "UNKNOWN"); appendStringInfo(buf, "UNKNOWN");
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.18 2009/06/11 14:48:53 momjian Exp $ * $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.19 2009/12/19 01:32:32 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -30,6 +30,7 @@ ...@@ -30,6 +30,7 @@
typedef struct typedef struct
{ {
TransactionId new_prune_xid; /* new prune hint value for page */ TransactionId new_prune_xid; /* new prune hint value for page */
TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
int nredirected; /* numbers of entries in arrays below */ int nredirected; /* numbers of entries in arrays below */
int ndead; int ndead;
int nunused; int nunused;
...@@ -84,6 +85,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin) ...@@ -84,6 +85,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
if (!PageIsPrunable(page, OldestXmin)) if (!PageIsPrunable(page, OldestXmin))
return; return;
/*
* We can't write WAL in recovery mode, so there's no point trying to
* clean the page. The master will likely issue a cleaning WAL record
* soon anyway, so this is no particular loss.
*/
if (RecoveryInProgress())
return;
/* /*
* We prune when a previous UPDATE failed to find enough space on the page * We prune when a previous UPDATE failed to find enough space on the page
* for a new tuple version, or when free space falls below the relation's * for a new tuple version, or when free space falls below the relation's
...@@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, ...@@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
* of our working state. * of our working state.
*/ */
prstate.new_prune_xid = InvalidTransactionId; prstate.new_prune_xid = InvalidTransactionId;
prstate.latestRemovedXid = InvalidTransactionId;
prstate.nredirected = prstate.ndead = prstate.nunused = 0; prstate.nredirected = prstate.ndead = prstate.nunused = 0;
memset(prstate.marked, 0, sizeof(prstate.marked)); memset(prstate.marked, 0, sizeof(prstate.marked));
...@@ -257,7 +267,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, ...@@ -257,7 +267,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
prstate.redirected, prstate.nredirected, prstate.redirected, prstate.nredirected,
prstate.nowdead, prstate.ndead, prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused, prstate.nowunused, prstate.nunused,
redirect_move); prstate.latestRemovedXid, redirect_move);
PageSetLSN(BufferGetPage(buffer), recptr); PageSetLSN(BufferGetPage(buffer), recptr);
PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
...@@ -395,6 +405,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, ...@@ -395,6 +405,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup)) == HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
{ {
heap_prune_record_unused(prstate, rootoffnum); heap_prune_record_unused(prstate, rootoffnum);
HeapTupleHeaderAdvanceLatestRemovedXid(htup,
&prstate->latestRemovedXid);
ndeleted++; ndeleted++;
} }
...@@ -520,7 +532,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, ...@@ -520,7 +532,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
* find another DEAD tuple is a fairly unusual corner case.) * find another DEAD tuple is a fairly unusual corner case.)
*/ */
if (tupdead) if (tupdead)
{
latestdead = offnum; latestdead = offnum;
HeapTupleHeaderAdvanceLatestRemovedXid(htup,
&prstate->latestRemovedXid);
}
else if (!recent_dead) else if (!recent_dead)
break; break;
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.77 2009/12/07 05:22:21 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.78 2009/12/19 01:32:32 sriggs Exp $
* *
* NOTES * NOTES
* many of the old access method routines have been turned into * many of the old access method routines have been turned into
...@@ -91,8 +91,19 @@ RelationGetIndexScan(Relation indexRelation, ...@@ -91,8 +91,19 @@ RelationGetIndexScan(Relation indexRelation,
else else
scan->keyData = NULL; scan->keyData = NULL;
/*
* During recovery we ignore killed tuples and don't bother to kill them
* either. We do this because the xmin on the primary node could easily
* be later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples. Do *not* set ignore_killed_tuples to true when running
* in a transaction that was started during recovery.
* xactStartedInRecovery should not be altered by index AMs.
*/
scan->kill_prior_tuple = false; scan->kill_prior_tuple = false;
scan->ignore_killed_tuples = true; /* default setting */ scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
scan->opaque = NULL; scan->opaque = NULL;
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.115 2009/07/29 20:56:18 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.116 2009/12/19 01:32:32 sriggs Exp $
* *
* INTERFACE ROUTINES * INTERFACE ROUTINES
* index_open - open an index relation by relation OID * index_open - open an index relation by relation OID
...@@ -455,8 +455,11 @@ index_getnext(IndexScanDesc scan, ScanDirection direction) ...@@ -455,8 +455,11 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)
/* /*
* If we scanned a whole HOT chain and found only dead tuples, * If we scanned a whole HOT chain and found only dead tuples,
* tell index AM to kill its entry for that TID. * tell index AM to kill its entry for that TID. We do not do
* this when in recovery because it may violate MVCC to do so.
* see comments in RelationGetIndexScan().
*/ */
if (!scan->xactStartedInRecovery)
scan->kill_prior_tuple = scan->xs_hot_dead; scan->kill_prior_tuple = scan->xs_hot_dead;
/* /*
......
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.20 2008/03/21 13:23:27 momjian Exp $ $PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.21 2009/12/19 01:32:32 sriggs Exp $
Btree Indexing Btree Indexing
============== ==============
...@@ -401,6 +401,33 @@ of the WAL entry.) If the parent page becomes half-dead but is not ...@@ -401,6 +401,33 @@ of the WAL entry.) If the parent page becomes half-dead but is not
immediately deleted due to a subsequent crash, there is no loss of immediately deleted due to a subsequent crash, there is no loss of
consistency, and the empty page will be picked up by the next VACUUM. consistency, and the empty page will be picked up by the next VACUUM.
Scans during Recovery
---------------------
The btree index type can be safely used during recovery. During recovery
we have at most one writer and potentially many readers. In that
situation the locking requirements can be relaxed and we do not need
double locking during block splits. Each WAL record makes changes to a
single level of the btree using the correct locking sequence and so
is safe for concurrent readers. Some readers may observe a block split
in progress as they descend the tree, but they will simply move right
onto the correct page.
During recovery all index scans start with ignore_killed_tuples = false
and we never set kill_prior_tuple. We do this because the oldest xmin
on the standby server can be older than the oldest xmin on the master
server, which means tuples can be marked as killed even when they are
still visible on the standby. We don't WAL log tuple killed bits, but
they can still appear in the standby because of full page writes. So
we must always ignore them in standby, and that means it's not worth
setting them either.
Note that we talk about scans that are started during recovery. We go to
a little trouble to allow a scan to start during recovery and end during
normal running after recovery has completed. This is a key capability
because it allows running applications to continue while the standby
changes state into a normally running server.
Other Things That Are Handy to Know Other Things That Are Handy to Know
----------------------------------- -----------------------------------
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.174 2009/10/02 21:14:04 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.175 2009/12/19 01:32:32 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -2025,7 +2025,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer) ...@@ -2025,7 +2025,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer)
} }
if (ndeletable > 0) if (ndeletable > 0)
_bt_delitems(rel, buffer, deletable, ndeletable); _bt_delitems(rel, buffer, deletable, ndeletable, false, 0);
/* /*
* Note: if we didn't find any LP_DEAD items, then the page's * Note: if we didn't find any LP_DEAD items, then the page's
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.113 2009/05/05 19:02:22 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.114 2009/12/19 01:32:33 sriggs Exp $
* *
* NOTES * NOTES
* Postgres btree pages look like ordinary relation pages. The opaque * Postgres btree pages look like ordinary relation pages. The opaque
...@@ -653,18 +653,32 @@ _bt_page_recyclable(Page page) ...@@ -653,18 +653,32 @@ _bt_page_recyclable(Page page)
* *
* This routine assumes that the caller has pinned and locked the buffer. * This routine assumes that the caller has pinned and locked the buffer.
* Also, the given itemnos *must* appear in increasing order in the array. * Also, the given itemnos *must* appear in increasing order in the array.
*
* We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
* we need to be able to pin all of the blocks in the btree in physical
* order when replaying the effects of a VACUUM, just as we do for the
* original VACUUM itself. lastBlockVacuumed allows us to tell whether an
* intermediate range of blocks has had no changes at all by VACUUM,
* and so must be scanned anyway during replay. We always write a WAL record
* for the last block in the index, whether or not it contained any items
* to be removed. This allows us to scan right up to end of index to
* ensure correct locking.
*/ */
void void
_bt_delitems(Relation rel, Buffer buf, _bt_delitems(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems) OffsetNumber *itemnos, int nitems, bool isVacuum,
BlockNumber lastBlockVacuumed)
{ {
Page page = BufferGetPage(buf); Page page = BufferGetPage(buf);
BTPageOpaque opaque; BTPageOpaque opaque;
Assert(isVacuum || lastBlockVacuumed == 0);
/* No ereport(ERROR) until changes are logged */ /* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION(); START_CRIT_SECTION();
/* Fix the page */ /* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems); PageIndexMultiDelete(page, itemnos, nitems);
/* /*
...@@ -688,15 +702,36 @@ _bt_delitems(Relation rel, Buffer buf, ...@@ -688,15 +702,36 @@ _bt_delitems(Relation rel, Buffer buf,
/* XLOG stuff */ /* XLOG stuff */
if (!rel->rd_istemp) if (!rel->rd_istemp)
{ {
xl_btree_delete xlrec;
XLogRecPtr recptr; XLogRecPtr recptr;
XLogRecData rdata[2]; XLogRecData rdata[2];
xlrec.node = rel->rd_node; if (isVacuum)
xlrec.block = BufferGetBlockNumber(buf); {
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.node = rel->rd_node;
xlrec_vacuum.block = BufferGetBlockNumber(buf);
rdata[0].data = (char *) &xlrec; xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
rdata[0].data = (char *) &xlrec_vacuum;
rdata[0].len = SizeOfBtreeVacuum;
}
else
{
xl_btree_delete xlrec_delete;
xlrec_delete.node = rel->rd_node;
xlrec_delete.block = BufferGetBlockNumber(buf);
/*
* XXX: We would like to set an accurate latestRemovedXid, but
* there is no easy way of obtaining a useful value. So we punt
* and store InvalidTransactionId, which forces the standby to
* wait for/cancel all currently running transactions.
*/
xlrec_delete.latestRemovedXid = InvalidTransactionId;
rdata[0].data = (char *) &xlrec_delete;
rdata[0].len = SizeOfBtreeDelete; rdata[0].len = SizeOfBtreeDelete;
}
rdata[0].buffer = InvalidBuffer; rdata[0].buffer = InvalidBuffer;
rdata[0].next = &(rdata[1]); rdata[0].next = &(rdata[1]);
...@@ -719,6 +754,9 @@ _bt_delitems(Relation rel, Buffer buf, ...@@ -719,6 +754,9 @@ _bt_delitems(Relation rel, Buffer buf,
rdata[1].buffer_std = true; rdata[1].buffer_std = true;
rdata[1].next = NULL; rdata[1].next = NULL;
if (isVacuum)
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
else
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata); recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
PageSetLSN(page, recptr); PageSetLSN(page, recptr);
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.172 2009/07/29 20:56:18 tgl Exp $ * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.173 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -57,7 +57,8 @@ typedef struct ...@@ -57,7 +57,8 @@ typedef struct
IndexBulkDeleteCallback callback; IndexBulkDeleteCallback callback;
void *callback_state; void *callback_state;
BTCycleId cycleid; BTCycleId cycleid;
BlockNumber lastUsedPage; BlockNumber lastBlockVacuumed; /* last blkno reached by Vacuum scan */
BlockNumber lastUsedPage; /* blkno of last non-recyclable page */
BlockNumber totFreePages; /* true total # of free pages */ BlockNumber totFreePages; /* true total # of free pages */
MemoryContext pagedelcontext; MemoryContext pagedelcontext;
} BTVacState; } BTVacState;
...@@ -629,6 +630,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -629,6 +630,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
vstate.callback = callback; vstate.callback = callback;
vstate.callback_state = callback_state; vstate.callback_state = callback_state;
vstate.cycleid = cycleid; vstate.cycleid = cycleid;
vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
vstate.lastUsedPage = BTREE_METAPAGE; vstate.lastUsedPage = BTREE_METAPAGE;
vstate.totFreePages = 0; vstate.totFreePages = 0;
...@@ -705,6 +707,32 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ...@@ -705,6 +707,32 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages = new_pages; num_pages = new_pages;
} }
/*
* InHotStandby we need to scan right up to the end of the index for
* correct locking, so we may need to write a WAL record for the final
* block in the index if it was not vacuumed. It's possible that VACUUMing
* has actually removed zeroed pages at the end of the index so we need to
* take care to issue the record for last actual block and not for the
* last block that was scanned. Ignore empty indexes.
*/
if (XLogStandbyInfoActive() &&
num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1))
{
Buffer buf;
/*
* We can't use _bt_getbuf() here because it always applies
* _bt_checkpage(), which will barf on an all-zero page. We want to
* recycle all-zero pages, not fail. Also, we want to use a nondefault
* buffer access strategy.
*/
buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
info->strategy);
LockBufferForCleanup(buf);
_bt_delitems(rel, buf, NULL, 0, true, vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
MemoryContextDelete(vstate.pagedelcontext); MemoryContextDelete(vstate.pagedelcontext);
/* update statistics */ /* update statistics */
...@@ -847,6 +875,26 @@ restart: ...@@ -847,6 +875,26 @@ restart:
itup = (IndexTuple) PageGetItem(page, itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum)); PageGetItemId(page, offnum));
htup = &(itup->t_tid); htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that XLOG_BTREE_VACUUM
* records do not produce conflicts. That is only true as long
* as the callback function depends only upon whether the index
* tuple refers to heap tuples removed in the initial heap scan.
* When vacuum starts it derives a value of OldestXmin. Backends
* taking later snapshots could have a RecentGlobalXmin with a
* later xid than the vacuum's OldestXmin, so it is possible that
* row versions deleted after OldestXmin could be marked as killed
* by other backends. The callback function *could* look at the
* index tuple state in isolation and decide to delete the index
* tuple, though currently it does not. If it ever did, we would
* need to reconsider whether XLOG_BTREE_VACUUM records should
* cause conflicts. If they did cause conflicts they would be
* fairly harsh conflicts, since we haven't yet worked out a way
* to pass a useful value for latestRemovedXid on the
* XLOG_BTREE_VACUUM records. This applies to *any* type of index
* that marks index tuples as killed.
*/
if (callback(htup, callback_state)) if (callback(htup, callback_state))
deletable[ndeletable++] = offnum; deletable[ndeletable++] = offnum;
} }
...@@ -858,7 +906,19 @@ restart: ...@@ -858,7 +906,19 @@ restart:
*/ */
if (ndeletable > 0) if (ndeletable > 0)
{ {
_bt_delitems(rel, buf, deletable, ndeletable); BlockNumber lastBlockVacuumed = BufferGetBlockNumber(buf);
_bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed);
/*
* Keep track of the block number of the lastBlockVacuumed, so
* we can scan those blocks as well during WAL replay. This then
* provides concurrency protection and allows btrees to be used
* while in recovery.
*/
if (lastBlockVacuumed > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = lastBlockVacuumed;
stats->tuples_removed += ndeletable; stats->tuples_removed += ndeletable;
/* must recompute maxoff */ /* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page); maxoff = PageGetMaxOffsetNumber(page);
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.55 2009/06/11 14:48:54 momjian Exp $ * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.56 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -16,7 +16,11 @@ ...@@ -16,7 +16,11 @@
#include "access/nbtree.h" #include "access/nbtree.h"
#include "access/transam.h" #include "access/transam.h"
#include "access/xact.h"
#include "storage/bufmgr.h" #include "storage/bufmgr.h"
#include "storage/procarray.h"
#include "storage/standby.h"
#include "miscadmin.h"
/* /*
* We must keep track of expected insertions due to page splits, and apply * We must keep track of expected insertions due to page splits, and apply
...@@ -458,6 +462,97 @@ btree_xlog_split(bool onleft, bool isroot, ...@@ -458,6 +462,97 @@ btree_xlog_split(bool onleft, bool isroot,
xlrec->leftsib, xlrec->rightsib, isroot); xlrec->leftsib, xlrec->rightsib, isroot);
} }
static void
btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_vacuum *xlrec;
Buffer buffer;
Page page;
BTPageOpaque opaque;
xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
/*
* If queries might be active then we need to ensure every block is unpinned
* between the lastBlockVacuumed and the current block, if there are any.
* This ensures that every block in the index is touched during VACUUM as
* required to ensure scans work correctly.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
(xlrec->lastBlockVacuumed + 1) != xlrec->block)
{
BlockNumber blkno = xlrec->lastBlockVacuumed + 1;
for (; blkno < xlrec->block; blkno++)
{
/*
* XXX we don't actually need to read the block, we
* just need to confirm it is unpinned. If we had a special call
* into the buffer manager we could optimise this so that
* if the block is not in shared_buffers we confirm it as unpinned.
*
* Another simple optimization would be to check if there's any
* backends running; if not, we could just skip this.
*/
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
if (BufferIsValid(buffer))
{
LockBufferForCleanup(buffer);
UnlockReleaseBuffer(buffer);
}
}
}
/*
* If the block was restored from a full page image, nothing more to do.
* The RestoreBkpBlocks() call already pinned and took cleanup lock on
* it. XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop
* above, to make the disk access more sequential.
*/
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
/*
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf
* page. See nbtree/README for details.
*/
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
if (!BufferIsValid(buffer))
return;
LockBufferForCleanup(buffer);
page = (Page) BufferGetPage(buffer);
if (XLByteLE(lsn, PageGetLSN(page)))
{
UnlockReleaseBuffer(buffer);
return;
}
if (record->xl_len > SizeOfBtreeVacuum)
{
OffsetNumber *unused;
OffsetNumber *unend;
unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeVacuum);
unend = (OffsetNumber *) ((char *) xlrec + record->xl_len);
if ((unend - unused) > 0)
PageIndexMultiDelete(page, unused, unend - unused);
}
/*
* Mark the page as not containing any LP_DEAD items --- see comments in
* _bt_delitems().
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
}
static void static void
btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
{ {
...@@ -470,6 +565,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) ...@@ -470,6 +565,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
return; return;
xlrec = (xl_btree_delete *) XLogRecGetData(record); xlrec = (xl_btree_delete *) XLogRecGetData(record);
/*
* We don't need to take a cleanup lock to apply these changes.
* See nbtree/README for details.
*/
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
if (!BufferIsValid(buffer)) if (!BufferIsValid(buffer))
return; return;
...@@ -714,7 +814,43 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -714,7 +814,43 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
{ {
uint8 info = record->xl_info & ~XLR_INFO_MASK; uint8 info = record->xl_info & ~XLR_INFO_MASK;
RestoreBkpBlocks(lsn, record, false); /*
* Btree delete records can conflict with standby queries. You might
* think that vacuum records would conflict as well, but we've handled
* that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
* cleaned by the vacuum of the heap and so we can resolve any conflicts
* just once when that arrives. After that any we know that no conflicts
* exist from individual btree vacuum records on that index.
*/
if (InHotStandby)
{
if (info == XLOG_BTREE_DELETE)
{
xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
VirtualTransactionId *backends;
/*
* XXX Currently we put everybody on death row, because
* currently _bt_delitems() supplies InvalidTransactionId.
* This can be fairly painful, so providing a better value
* here is worth some thought and possibly some effort to
* improve.
*/
backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
InvalidOid,
true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
"b-tree delete",
CONFLICT_MODE_ERROR);
}
}
/*
* Vacuum needs to pin and take cleanup lock on every leaf page,
* a regular exclusive lock is enough for all other purposes.
*/
RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));
switch (info) switch (info)
{ {
...@@ -739,6 +875,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -739,6 +875,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
case XLOG_BTREE_SPLIT_R_ROOT: case XLOG_BTREE_SPLIT_R_ROOT:
btree_xlog_split(false, true, lsn, record); btree_xlog_split(false, true, lsn, record);
break; break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(lsn, record);
break;
case XLOG_BTREE_DELETE: case XLOG_BTREE_DELETE:
btree_xlog_delete(lsn, record); btree_xlog_delete(lsn, record);
break; break;
...@@ -843,13 +982,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec) ...@@ -843,13 +982,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->level, xlrec->firstright); xlrec->level, xlrec->firstright);
break; break;
} }
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
appendStringInfo(buf, "vacuum: rel %u/%u/%u; blk %u, lastBlockVacuumed %u",
xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->block,
xlrec->lastBlockVacuumed);
break;
}
case XLOG_BTREE_DELETE: case XLOG_BTREE_DELETE:
{ {
xl_btree_delete *xlrec = (xl_btree_delete *) rec; xl_btree_delete *xlrec = (xl_btree_delete *) rec;
appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u", appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->block); xlrec->node.relNode, xlrec->block,
xlrec->latestRemovedXid);
break; break;
} }
case XLOG_BTREE_DELETE_PAGE: case XLOG_BTREE_DELETE_PAGE:
......
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.12 2008/10/20 19:18:18 alvherre Exp $ $PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $
The Transaction System The Transaction System
====================== ======================
...@@ -649,3 +649,34 @@ fsync it down to disk without any sort of interlock, as soon as it finishes ...@@ -649,3 +649,34 @@ fsync it down to disk without any sort of interlock, as soon as it finishes
the bulk update. However, all these paths are designed to write data that the bulk update. However, all these paths are designed to write data that
no other transaction can see until after T1 commits. The situation is thus no other transaction can see until after T1 commits. The situation is thus
not different from ordinary WAL-logged updates. not different from ordinary WAL-logged updates.
Transaction Emulation during Recovery
-------------------------------------
During Recovery we replay transaction changes in the order they occurred.
As part of this replay we emulate some transactional behaviour, so that
read only backends can take MVCC snapshots. We do this by maintaining a
list of XIDs belonging to transactions that are being replayed, so that
each transaction that has recorded WAL records for database writes exist
in the array until it commits. Further details are given in comments in
procarray.c.
Many actions write no WAL records at all, for example read only transactions.
These have no effect on MVCC in recovery and we can pretend they never
occurred at all. Subtransaction commit does not write a WAL record either
and has very little effect, since lock waiters need to wait for the
parent transaction to complete.
Not all transactional behaviour is emulated, for example we do not insert
a transaction entry into the lock table, nor do we maintain the transaction
stack in memory. Clog entries are made normally. Multitrans is not maintained
because its purpose is to record tuple level locks that an application has
requested to prevent write locks. Since write locks cannot be obtained at all,
there is never any conflict and so there is no reason to update multitrans.
Subtrans is maintained during recovery but the details of the transaction
tree are ignored and all subtransactions reference the top-level TransactionId
directly. Since commit is atomic this provides correct lock wait behaviour
yet simplifies emulation of subtransactions considerably.
Further details on locking mechanics in recovery are given in comments
with the Lock rmgr code.
...@@ -26,7 +26,7 @@ ...@@ -26,7 +26,7 @@
* Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.53 2009/06/11 14:48:54 momjian Exp $ * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.54 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -574,7 +574,7 @@ ExtendCLOG(TransactionId newestXact) ...@@ -574,7 +574,7 @@ ExtendCLOG(TransactionId newestXact)
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */ /* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true); ZeroCLOGPage(pageno, !InRecovery);
LWLockRelease(CLogControlLock); LWLockRelease(CLogControlLock);
} }
......
...@@ -42,7 +42,7 @@ ...@@ -42,7 +42,7 @@
* Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.32 2009/11/23 09:58:36 heikki Exp $ * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.33 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -59,6 +59,7 @@ ...@@ -59,6 +59,7 @@
#include "storage/backendid.h" #include "storage/backendid.h"
#include "storage/lmgr.h" #include "storage/lmgr.h"
#include "storage/procarray.h" #include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/memutils.h" #include "utils/memutils.h"
...@@ -220,7 +221,6 @@ static MultiXactId GetNewMultiXactId(int nxids, MultiXactOffset *offset); ...@@ -220,7 +221,6 @@ static MultiXactId GetNewMultiXactId(int nxids, MultiXactOffset *offset);
static MultiXactId mXactCacheGetBySet(int nxids, TransactionId *xids); static MultiXactId mXactCacheGetBySet(int nxids, TransactionId *xids);
static int mXactCacheGetById(MultiXactId multi, TransactionId **xids); static int mXactCacheGetById(MultiXactId multi, TransactionId **xids);
static void mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids); static void mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids);
static int xidComparator(const void *arg1, const void *arg2);
#ifdef MULTIXACT_DEBUG #ifdef MULTIXACT_DEBUG
static char *mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids); static char *mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids);
...@@ -1221,27 +1221,6 @@ mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids) ...@@ -1221,27 +1221,6 @@ mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids)
MXactCache = entry; MXactCache = entry;
} }
/*
* xidComparator
* qsort comparison function for XIDs
*
* We don't need to use wraparound comparison for XIDs, and indeed must
* not do so since that does not respect the triangle inequality! Any
* old sort order will do.
*/
static int
xidComparator(const void *arg1, const void *arg2)
{
TransactionId xid1 = *(const TransactionId *) arg1;
TransactionId xid2 = *(const TransactionId *) arg2;
if (xid1 > xid2)
return 1;
if (xid1 < xid2)
return -1;
return 0;
}
#ifdef MULTIXACT_DEBUG #ifdef MULTIXACT_DEBUG
static char * static char *
mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids) mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids)
...@@ -2051,11 +2030,18 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -2051,11 +2030,18 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record)
if (TransactionIdPrecedes(max_xid, xids[i])) if (TransactionIdPrecedes(max_xid, xids[i]))
max_xid = xids[i]; max_xid = xids[i];
} }
/* We don't expect anyone else to modify nextXid, hence startup process
* doesn't need to hold a lock while checking this. We still acquire
* the lock to modify it, though.
*/
if (TransactionIdFollowsOrEquals(max_xid, if (TransactionIdFollowsOrEquals(max_xid,
ShmemVariableCache->nextXid)) ShmemVariableCache->nextXid))
{ {
LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
ShmemVariableCache->nextXid = max_xid; ShmemVariableCache->nextXid = max_xid;
TransactionIdAdvance(ShmemVariableCache->nextXid); TransactionIdAdvance(ShmemVariableCache->nextXid);
LWLockRelease(XidGenLock);
} }
} }
else else
......
...@@ -79,3 +79,10 @@ ...@@ -79,3 +79,10 @@
# #
# #
#--------------------------------------------------------------------------- #---------------------------------------------------------------------------
# HOT STANDBY PARAMETERS
#---------------------------------------------------------------------------
#
# If you want to enable read-only connections during recovery, enable
# recovery_connections in postgresql.conf
#
#---------------------------------------------------------------------------
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
* *
* Resource managers definition * Resource managers definition
* *
* $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.27 2008/11/19 10:34:50 heikki Exp $ * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.28 2009/12/19 01:32:33 sriggs Exp $
*/ */
#include "postgres.h" #include "postgres.h"
...@@ -21,6 +21,7 @@ ...@@ -21,6 +21,7 @@
#include "commands/sequence.h" #include "commands/sequence.h"
#include "commands/tablespace.h" #include "commands/tablespace.h"
#include "storage/freespace.h" #include "storage/freespace.h"
#include "storage/standby.h"
const RmgrData RmgrTable[RM_MAX_ID + 1] = { const RmgrData RmgrTable[RM_MAX_ID + 1] = {
...@@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = { ...@@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = {
{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
{"Reserved 7", NULL, NULL, NULL, NULL, NULL}, {"Reserved 7", NULL, NULL, NULL, NULL, NULL},
{"Reserved 8", NULL, NULL, NULL, NULL, NULL}, {"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
{"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint}, {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
......
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
* Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California * Portions Copyright (c) 1994, Regents of the University of California
* *
* $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.24 2009/01/01 17:23:36 momjian Exp $ * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.25 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -68,15 +68,19 @@ static bool SubTransPagePrecedes(int page1, int page2); ...@@ -68,15 +68,19 @@ static bool SubTransPagePrecedes(int page1, int page2);
/* /*
* Record the parent of a subtransaction in the subtrans log. * Record the parent of a subtransaction in the subtrans log.
*
* In some cases we may need to overwrite an existing value.
*/ */
void void
SubTransSetParent(TransactionId xid, TransactionId parent) SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
{ {
int pageno = TransactionIdToPage(xid); int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid); int entryno = TransactionIdToEntry(xid);
int slotno; int slotno;
TransactionId *ptr; TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid); slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
...@@ -84,7 +88,8 @@ SubTransSetParent(TransactionId xid, TransactionId parent) ...@@ -84,7 +88,8 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
ptr += entryno; ptr += entryno;
/* Current state should be 0 */ /* Current state should be 0 */
Assert(*ptr == InvalidTransactionId); Assert(*ptr == InvalidTransactionId ||
(*ptr == parent && overwriteOK));
*ptr = parent; *ptr = parent;
......
This diff is collapsed.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.10 2009/11/23 09:58:36 heikki Exp $ * $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.11 2009/12/19 01:32:33 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -19,14 +19,12 @@ ...@@ -19,14 +19,12 @@
#include "commands/async.h" #include "commands/async.h"
#include "pgstat.h" #include "pgstat.h"
#include "storage/lock.h" #include "storage/lock.h"
#include "utils/inval.h"
const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] = const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
{ {
NULL, /* END ID */ NULL, /* END ID */
lock_twophase_recover, /* Lock */ lock_twophase_recover, /* Lock */
NULL, /* Inval */
NULL, /* notify/listen */ NULL, /* notify/listen */
NULL, /* pgstat */ NULL, /* pgstat */
multixact_twophase_recover /* MultiXact */ multixact_twophase_recover /* MultiXact */
...@@ -36,7 +34,6 @@ const TwoPhaseCallback twophase_postcommit_callbacks[TWOPHASE_RM_MAX_ID + 1] = ...@@ -36,7 +34,6 @@ const TwoPhaseCallback twophase_postcommit_callbacks[TWOPHASE_RM_MAX_ID + 1] =
{ {
NULL, /* END ID */ NULL, /* END ID */
lock_twophase_postcommit, /* Lock */ lock_twophase_postcommit, /* Lock */
inval_twophase_postcommit, /* Inval */
notify_twophase_postcommit, /* notify/listen */ notify_twophase_postcommit, /* notify/listen */
pgstat_twophase_postcommit, /* pgstat */ pgstat_twophase_postcommit, /* pgstat */
multixact_twophase_postcommit /* MultiXact */ multixact_twophase_postcommit /* MultiXact */
...@@ -46,8 +43,16 @@ const TwoPhaseCallback twophase_postabort_callbacks[TWOPHASE_RM_MAX_ID + 1] = ...@@ -46,8 +43,16 @@ const TwoPhaseCallback twophase_postabort_callbacks[TWOPHASE_RM_MAX_ID + 1] =
{ {
NULL, /* END ID */ NULL, /* END ID */
lock_twophase_postabort, /* Lock */ lock_twophase_postabort, /* Lock */
NULL, /* Inval */
NULL, /* notify/listen */ NULL, /* notify/listen */
pgstat_twophase_postabort, /* pgstat */ pgstat_twophase_postabort, /* pgstat */
multixact_twophase_postabort /* MultiXact */ multixact_twophase_postabort /* MultiXact */
}; };
const TwoPhaseCallback twophase_standby_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
{
NULL, /* END ID */
lock_twophase_standby_recover, /* Lock */
NULL, /* notify/listen */
NULL, /* pgstat */
NULL /* MultiXact */
};
This diff is collapsed.
This diff is collapsed.
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.228 2009/11/12 02:46:16 tgl Exp $ * $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.229 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -26,6 +26,7 @@ ...@@ -26,6 +26,7 @@
#include "access/genam.h" #include "access/genam.h"
#include "access/heapam.h" #include "access/heapam.h"
#include "access/transam.h"
#include "access/xact.h" #include "access/xact.h"
#include "access/xlogutils.h" #include "access/xlogutils.h"
#include "catalog/catalog.h" #include "catalog/catalog.h"
...@@ -48,6 +49,7 @@ ...@@ -48,6 +49,7 @@
#include "storage/ipc.h" #include "storage/ipc.h"
#include "storage/procarray.h" #include "storage/procarray.h"
#include "storage/smgr.h" #include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/acl.h" #include "utils/acl.h"
#include "utils/builtins.h" #include "utils/builtins.h"
#include "utils/fmgroids.h" #include "utils/fmgroids.h"
...@@ -1941,6 +1943,26 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -1941,6 +1943,26 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record)
dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
if (InHotStandby)
{
VirtualTransactionId *database_users;
/*
* Find all users connected to this database and ask them
* politely to immediately kill their sessions before processing
* the drop database record, after the usual grace period.
* We don't wait for commit because drop database is
* non-transactional.
*/
database_users = GetConflictingVirtualXIDs(InvalidTransactionId,
xlrec->db_id,
false);
ResolveRecoveryConflictWithVirtualXIDs(database_users,
"drop database",
CONFLICT_MODE_FATAL);
}
/* Drop pages for this database that are in the shared buffer cache */ /* Drop pages for this database that are in the shared buffer cache */
DropDatabaseBuffers(xlrec->db_id); DropDatabaseBuffers(xlrec->db_id);
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.25 2009/06/11 14:48:56 momjian Exp $ * $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.26 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -47,6 +47,16 @@ LockTableCommand(LockStmt *lockstmt) ...@@ -47,6 +47,16 @@ LockTableCommand(LockStmt *lockstmt)
reloid = RangeVarGetRelid(relation, false); reloid = RangeVarGetRelid(relation, false);
/*
* During recovery we only accept these variations:
* LOCK TABLE foo IN ACCESS SHARE MODE
* LOCK TABLE foo IN ROW SHARE MODE
* LOCK TABLE foo IN ROW EXCLUSIVE MODE
* This test must match the restrictions defined in LockAcquire()
*/
if (lockstmt->mode > RowExclusiveLock)
PreventCommandDuringRecovery();
LockTableRecurse(reloid, relation, LockTableRecurse(reloid, relation,
lockstmt->mode, lockstmt->nowait, recurse); lockstmt->mode, lockstmt->nowait, recurse);
} }
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.162 2009/10/13 00:53:07 tgl Exp $ * $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.163 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -458,6 +458,9 @@ nextval_internal(Oid relid) ...@@ -458,6 +458,9 @@ nextval_internal(Oid relid)
rescnt = 0; rescnt = 0;
bool logit = false; bool logit = false;
/* nextval() writes to database and must be prevented during recovery */
PreventCommandDuringRecovery();
/* open and AccessShareLock sequence */ /* open and AccessShareLock sequence */
init_sequence(relid, &elm, &seqrel); init_sequence(relid, &elm, &seqrel);
......
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.63 2009/11/10 18:53:38 tgl Exp $ * $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.64 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -50,6 +50,7 @@ ...@@ -50,6 +50,7 @@
#include "access/heapam.h" #include "access/heapam.h"
#include "access/sysattr.h" #include "access/sysattr.h"
#include "access/transam.h"
#include "access/xact.h" #include "access/xact.h"
#include "catalog/catalog.h" #include "catalog/catalog.h"
#include "catalog/dependency.h" #include "catalog/dependency.h"
...@@ -60,6 +61,8 @@ ...@@ -60,6 +61,8 @@
#include "miscadmin.h" #include "miscadmin.h"
#include "postmaster/bgwriter.h" #include "postmaster/bgwriter.h"
#include "storage/fd.h" #include "storage/fd.h"
#include "storage/procarray.h"
#include "storage/standby.h"
#include "utils/acl.h" #include "utils/acl.h"
#include "utils/builtins.h" #include "utils/builtins.h"
#include "utils/fmgroids.h" #include "utils/fmgroids.h"
...@@ -1317,12 +1320,59 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record) ...@@ -1317,12 +1320,59 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
{ {
xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record); xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
/*
* If we issued a WAL record for a drop tablespace it is
* because there were no files in it at all. That means that
* no permanent objects can exist in it at this point.
*
* It is possible for standby users to be using this tablespace
* as a location for their temporary files, so if we fail to
* remove all files then do conflict processing and try again,
* if currently enabled.
*/
if (!remove_tablespace_directories(xlrec->ts_id, true))
{
VirtualTransactionId *temp_file_users;
/*
* Standby users may be currently using this tablespace for
* for their temporary files. We only care about current
* users because temp_tablespace parameter will just ignore
* tablespaces that no longer exist.
*
* Ask everybody to cancel their queries immediately so
* we can ensure no temp files remain and we can remove the
* tablespace. Nuke the entire site from orbit, it's the only
* way to be sure.
*
* XXX: We could work out the pids of active backends
* using this tablespace by examining the temp filenames in the
* directory. We would then convert the pids into VirtualXIDs
* before attempting to cancel them.
*
* We don't wait for commit because drop tablespace is
* non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
InvalidOid,
false);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
"drop tablespace",
CONFLICT_MODE_ERROR);
/*
* If we did recovery processing then hopefully the
* backends who wrote temp files should have cleaned up and
* exited by now. So lets recheck before we throw an error.
* If !process_conflicts then this will just fail again.
*/
if (!remove_tablespace_directories(xlrec->ts_id, true)) if (!remove_tablespace_directories(xlrec->ts_id, true))
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tablespace %u is not empty", errmsg("tablespace %u is not empty",
xlrec->ts_id))); xlrec->ts_id)));
} }
}
else else
elog(PANIC, "tblspc_redo: unknown op code %u", info); elog(PANIC, "tblspc_redo: unknown op code %u", info);
} }
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.398 2009/12/09 21:57:51 tgl Exp $ * $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.399 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -141,6 +141,7 @@ typedef struct VRelStats ...@@ -141,6 +141,7 @@ typedef struct VRelStats
/* vtlinks array for tuple chain following - sorted by new_tid */ /* vtlinks array for tuple chain following - sorted by new_tid */
int num_vtlinks; int num_vtlinks;
VTupleLink vtlinks; VTupleLink vtlinks;
TransactionId latestRemovedXid;
} VRelStats; } VRelStats;
/*---------------------------------------------------------------------- /*----------------------------------------------------------------------
...@@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel, ...@@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel,
static bool repair_frag(VRelStats *vacrelstats, Relation onerel, static bool repair_frag(VRelStats *vacrelstats, Relation onerel,
VacPageList vacuum_pages, VacPageList fraged_pages, VacPageList vacuum_pages, VacPageList fraged_pages,
int nindexes, Relation *Irel); int nindexes, Relation *Irel);
static void move_chain_tuple(Relation rel, static void move_chain_tuple(VRelStats *vacrelstats, Relation rel,
Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer old_buf, Page old_page, HeapTuple old_tup,
Buffer dst_buf, Page dst_page, VacPage dst_vacpage, Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
ExecContext ec, ItemPointer ctid, bool cleanVpd); ExecContext ec, ItemPointer ctid, bool cleanVpd);
...@@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages, ...@@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages,
int num_moved); int num_moved);
static void vacuum_heap(VRelStats *vacrelstats, Relation onerel, static void vacuum_heap(VRelStats *vacrelstats, Relation onerel,
VacPageList vacpagelist); VacPageList vacpagelist);
static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage); static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage);
static void vacuum_index(VacPageList vacpagelist, Relation indrel, static void vacuum_index(VacPageList vacpagelist, Relation indrel,
double num_tuples, int keep_tuples); double num_tuples, int keep_tuples);
static void scan_index(Relation indrel, double num_tuples); static void scan_index(Relation indrel, double num_tuples);
...@@ -1300,6 +1301,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt) ...@@ -1300,6 +1301,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt)
vacrelstats->rel_tuples = 0; vacrelstats->rel_tuples = 0;
vacrelstats->rel_indexed_tuples = 0; vacrelstats->rel_indexed_tuples = 0;
vacrelstats->hasindex = false; vacrelstats->hasindex = false;
vacrelstats->latestRemovedXid = InvalidTransactionId;
/* scan the heap */ /* scan the heap */
vacuum_pages.num_pages = fraged_pages.num_pages = 0; vacuum_pages.num_pages = fraged_pages.num_pages = 0;
...@@ -1708,6 +1710,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel, ...@@ -1708,6 +1710,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel,
{ {
ItemId lpp; ItemId lpp;
HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
&vacrelstats->latestRemovedXid);
/* /*
* Here we are building a temporary copy of the page with dead * Here we are building a temporary copy of the page with dead
* tuples removed. Below we will apply * tuples removed. Below we will apply
...@@ -2025,7 +2030,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2025,7 +2030,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
/* there are dead tuples on this page - clean them */ /* there are dead tuples on this page - clean them */
Assert(!isempty); Assert(!isempty);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
vacuum_page(onerel, buf, last_vacuum_page); vacuum_page(vacrelstats, onerel, buf, last_vacuum_page);
LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BUFFER_LOCK_UNLOCK);
} }
else else
...@@ -2514,7 +2519,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2514,7 +2519,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid); tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid);
tuple_len = tuple.t_len = ItemIdGetLength(Citemid); tuple_len = tuple.t_len = ItemIdGetLength(Citemid);
move_chain_tuple(onerel, Cbuf, Cpage, &tuple, move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple,
dst_buffer, dst_page, destvacpage, dst_buffer, dst_page, destvacpage,
&ec, &Ctid, vtmove[ti].cleanVpd); &ec, &Ctid, vtmove[ti].cleanVpd);
...@@ -2600,7 +2605,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2600,7 +2605,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
dst_page = BufferGetPage(dst_buffer); dst_page = BufferGetPage(dst_buffer);
/* if this page was not used before - clean it */ /* if this page was not used before - clean it */
if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0) if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0)
vacuum_page(onerel, dst_buffer, dst_vacpage); vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage);
} }
else else
LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE); LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE);
...@@ -2753,7 +2758,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2753,7 +2758,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
HOLD_INTERRUPTS(); HOLD_INTERRUPTS();
heldoff = true; heldoff = true;
ForceSyncCommit(); ForceSyncCommit();
(void) RecordTransactionCommit(); (void) RecordTransactionCommit(true);
} }
/* /*
...@@ -2781,7 +2786,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2781,7 +2786,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(buf); page = BufferGetPage(buf);
if (!PageIsEmpty(page)) if (!PageIsEmpty(page))
vacuum_page(onerel, buf, *curpage); vacuum_page(vacrelstats, onerel, buf, *curpage);
UnlockReleaseBuffer(buf); UnlockReleaseBuffer(buf);
} }
} }
...@@ -2917,7 +2922,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2917,7 +2922,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
recptr = log_heap_clean(onerel, buf, recptr = log_heap_clean(onerel, buf,
NULL, 0, NULL, 0, NULL, 0, NULL, 0,
unused, uncnt, unused, uncnt,
false); vacrelstats->latestRemovedXid, false);
PageSetLSN(page, recptr); PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID); PageSetTLI(page, ThisTimeLineID);
} }
...@@ -2969,7 +2974,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, ...@@ -2969,7 +2974,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
* already too long and almost unreadable. * already too long and almost unreadable.
*/ */
static void static void
move_chain_tuple(Relation rel, move_chain_tuple(VRelStats *vacrelstats, Relation rel,
Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer old_buf, Page old_page, HeapTuple old_tup,
Buffer dst_buf, Page dst_page, VacPage dst_vacpage, Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
ExecContext ec, ItemPointer ctid, bool cleanVpd) ExecContext ec, ItemPointer ctid, bool cleanVpd)
...@@ -3027,7 +3032,7 @@ move_chain_tuple(Relation rel, ...@@ -3027,7 +3032,7 @@ move_chain_tuple(Relation rel,
int sv_offsets_used = dst_vacpage->offsets_used; int sv_offsets_used = dst_vacpage->offsets_used;
dst_vacpage->offsets_used = 0; dst_vacpage->offsets_used = 0;
vacuum_page(rel, dst_buf, dst_vacpage); vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage);
dst_vacpage->offsets_used = sv_offsets_used; dst_vacpage->offsets_used = sv_offsets_used;
} }
...@@ -3367,7 +3372,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) ...@@ -3367,7 +3372,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno, buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno,
RBM_NORMAL, vac_strategy); RBM_NORMAL, vac_strategy);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
vacuum_page(onerel, buf, *vacpage); vacuum_page(vacrelstats, onerel, buf, *vacpage);
UnlockReleaseBuffer(buf); UnlockReleaseBuffer(buf);
} }
} }
...@@ -3397,7 +3402,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) ...@@ -3397,7 +3402,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
* Caller must hold pin and lock on buffer. * Caller must hold pin and lock on buffer.
*/ */
static void static void
vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage)
{ {
Page page = BufferGetPage(buffer); Page page = BufferGetPage(buffer);
int i; int i;
...@@ -3426,7 +3431,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) ...@@ -3426,7 +3431,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
recptr = log_heap_clean(onerel, buffer, recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0, NULL, 0, NULL, 0,
vacpage->offsets, vacpage->offsets_free, vacpage->offsets, vacpage->offsets_free,
false); vacrelstats->latestRemovedXid, false);
PageSetLSN(page, recptr); PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID); PageSetTLI(page, ThisTimeLineID);
} }
......
...@@ -29,7 +29,7 @@ ...@@ -29,7 +29,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.124 2009/11/16 21:32:06 tgl Exp $ * $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.125 2009/12/19 01:32:34 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -98,6 +98,7 @@ typedef struct LVRelStats ...@@ -98,6 +98,7 @@ typedef struct LVRelStats
int max_dead_tuples; /* # slots allocated in array */ int max_dead_tuples; /* # slots allocated in array */
ItemPointer dead_tuples; /* array of ItemPointerData */ ItemPointer dead_tuples; /* array of ItemPointerData */
int num_index_scans; int num_index_scans;
TransactionId latestRemovedXid;
} LVRelStats; } LVRelStats;
...@@ -265,6 +266,34 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, ...@@ -265,6 +266,34 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
return heldoff; return heldoff;
} }
/*
* For Hot Standby we need to know the highest transaction id that will
* be removed by any change. VACUUM proceeds in a number of passes so
* we need to consider how each pass operates. The first phase runs
* heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
* progresses - these will have a latestRemovedXid on each record.
* In some cases this removes all of the tuples to be removed, though
* often we have dead tuples with index pointers so we must remember them
* for removal in phase 3. Index records for those rows are removed
* in phase 2 and index blocks do not have MVCC information attached.
* So before we can allow removal of any index tuples we need to issue
* a WAL record containing the latestRemovedXid of rows that will be
* removed in phase three. This allows recovery queries to block at the
* correct place, i.e. before phase two, rather than during phase three
* which would be after the rows have become inaccessible.
*/
static void
vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
{
/*
* No need to log changes for temp tables, they do not contain
* data visible on the standby server.
*/
if (rel->rd_istemp || !XLogArchivingActive())
return;
(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
}
/* /*
* lazy_scan_heap() -- scan an open heap relation * lazy_scan_heap() -- scan an open heap relation
...@@ -315,6 +344,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -315,6 +344,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
nblocks = RelationGetNumberOfBlocks(onerel); nblocks = RelationGetNumberOfBlocks(onerel);
vacrelstats->rel_pages = nblocks; vacrelstats->rel_pages = nblocks;
vacrelstats->nonempty_pages = 0; vacrelstats->nonempty_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats, nblocks); lazy_space_alloc(vacrelstats, nblocks);
...@@ -373,6 +403,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -373,6 +403,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
vacrelstats->num_dead_tuples > 0) vacrelstats->num_dead_tuples > 0)
{ {
/* Log cleanup info before we touch indexes */
vacuum_log_cleanup_info(onerel, vacrelstats);
/* Remove index entries */ /* Remove index entries */
for (i = 0; i < nindexes; i++) for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i], lazy_vacuum_index(Irel[i],
...@@ -382,6 +415,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -382,6 +415,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
lazy_vacuum_heap(onerel, vacrelstats); lazy_vacuum_heap(onerel, vacrelstats);
/* Forget the now-vacuumed tuples, and press on */ /* Forget the now-vacuumed tuples, and press on */
vacrelstats->num_dead_tuples = 0; vacrelstats->num_dead_tuples = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
vacrelstats->num_index_scans++; vacrelstats->num_index_scans++;
} }
...@@ -613,6 +647,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -613,6 +647,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
if (tupgone) if (tupgone)
{ {
lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
&vacrelstats->latestRemovedXid);
tups_vacuumed += 1; tups_vacuumed += 1;
} }
else else
...@@ -661,6 +697,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -661,6 +697,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats); lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats);
/* Forget the now-vacuumed tuples, and press on */ /* Forget the now-vacuumed tuples, and press on */
vacrelstats->num_dead_tuples = 0; vacrelstats->num_dead_tuples = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
vacuumed_pages++; vacuumed_pages++;
} }
...@@ -724,6 +761,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ...@@ -724,6 +761,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/* XXX put a threshold on min number of tuples here? */ /* XXX put a threshold on min number of tuples here? */
if (vacrelstats->num_dead_tuples > 0) if (vacrelstats->num_dead_tuples > 0)
{ {
/* Log cleanup info before we touch indexes */
vacuum_log_cleanup_info(onerel, vacrelstats);
/* Remove index entries */ /* Remove index entries */
for (i = 0; i < nindexes; i++) for (i = 0; i < nindexes; i++)
lazy_vacuum_index(Irel[i], lazy_vacuum_index(Irel[i],
...@@ -868,7 +908,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, ...@@ -868,7 +908,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer, recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0, NULL, 0, NULL, 0,
unused, uncnt, unused, uncnt,
false); vacrelstats->latestRemovedXid, false);
PageSetLSN(page, recptr); PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID); PageSetTLI(page, ThisTimeLineID);
} }
......
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.596 2009/09/08 17:08:36 tgl Exp $ * $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.597 2009/12/19 01:32:34 sriggs Exp $
* *
* NOTES * NOTES
* *
...@@ -245,8 +245,9 @@ static bool RecoveryError = false; /* T if WAL recovery failed */ ...@@ -245,8 +245,9 @@ static bool RecoveryError = false; /* T if WAL recovery failed */
* When archive recovery is finished, the startup process exits with exit * When archive recovery is finished, the startup process exits with exit
* code 0 and we switch to PM_RUN state. * code 0 and we switch to PM_RUN state.
* *
* Normal child backends can only be launched when we are in PM_RUN state. * Normal child backends can only be launched when we are in PM_RUN or
* (We also allow it in PM_WAIT_BACKUP state, but only for superusers.) * PM_RECOVERY_CONSISTENT state. (We also allow launch of normal
* child backends in PM_WAIT_BACKUP state, but only for superusers.)
* In other states we handle connection requests by launching "dead_end" * In other states we handle connection requests by launching "dead_end"
* child processes, which will simply send the client an error message and * child processes, which will simply send the client an error message and
* quit. (We track these in the BackendList so that we can know when they * quit. (We track these in the BackendList so that we can know when they
...@@ -1868,7 +1869,7 @@ static enum CAC_state ...@@ -1868,7 +1869,7 @@ static enum CAC_state
canAcceptConnections(void) canAcceptConnections(void)
{ {
/* /*
* Can't start backends when in startup/shutdown/recovery state. * Can't start backends when in startup/shutdown/inconsistent recovery state.
* *
* In state PM_WAIT_BACKUP only superusers can connect (this must be * In state PM_WAIT_BACKUP only superusers can connect (this must be
* allowed so that a superuser can end online backup mode); we return * allowed so that a superuser can end online backup mode); we return
...@@ -1882,9 +1883,11 @@ canAcceptConnections(void) ...@@ -1882,9 +1883,11 @@ canAcceptConnections(void)
return CAC_SHUTDOWN; /* shutdown is pending */ return CAC_SHUTDOWN; /* shutdown is pending */
if (!FatalError && if (!FatalError &&
(pmState == PM_STARTUP || (pmState == PM_STARTUP ||
pmState == PM_RECOVERY || pmState == PM_RECOVERY))
pmState == PM_RECOVERY_CONSISTENT))
return CAC_STARTUP; /* normal startup */ return CAC_STARTUP; /* normal startup */
if (!FatalError &&
pmState == PM_RECOVERY_CONSISTENT)
return CAC_OK; /* connection OK during recovery */
return CAC_RECOVERY; /* else must be crash recovery */ return CAC_RECOVERY; /* else must be crash recovery */
} }
...@@ -4003,9 +4006,8 @@ sigusr1_handler(SIGNAL_ARGS) ...@@ -4003,9 +4006,8 @@ sigusr1_handler(SIGNAL_ARGS)
Assert(PgStatPID == 0); Assert(PgStatPID == 0);
PgStatPID = pgstat_start(); PgStatPID = pgstat_start();
/* XXX at this point we could accept read-only connections */ ereport(LOG,
ereport(DEBUG1, (errmsg("database system is ready to accept read only connections")));
(errmsg("database system is in consistent recovery mode")));
pmState = PM_RECOVERY_CONSISTENT; pmState = PM_RECOVERY_CONSISTENT;
} }
......
# #
# Makefile for storage/ipc # Makefile for storage/ipc
# #
# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.22 2009/07/31 20:26:23 tgl Exp $ # $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.23 2009/12/19 01:32:35 sriggs Exp $
# #
subdir = src/backend/storage/ipc subdir = src/backend/storage/ipc
...@@ -16,6 +16,6 @@ endif ...@@ -16,6 +16,6 @@ endif
endif endif
OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \ OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
sinval.o sinvaladt.o sinval.o sinvaladt.o standby.o
include $(top_srcdir)/src/backend/common.mk include $(top_srcdir)/src/backend/common.mk
This diff is collapsed.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.79 2009/07/31 20:26:23 tgl Exp $ * $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.80 2009/12/19 01:32:35 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -144,6 +144,13 @@ typedef struct ProcState ...@@ -144,6 +144,13 @@ typedef struct ProcState
bool resetState; /* backend needs to reset its state */ bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */ bool signaled; /* backend has been sent catchup signal */
/*
* Backend only sends invalidations, never receives them. This only makes sense
* for Startup process during recovery because it doesn't maintain a relcache,
* yet it fires inval messages to allow query backends to see schema changes.
*/
bool sendOnly; /* backend only sends, never receives */
/* /*
* Next LocalTransactionId to use for each idle backend slot. We keep * Next LocalTransactionId to use for each idle backend slot. We keep
* this here because it is indexed by BackendId and it is convenient to * this here because it is indexed by BackendId and it is convenient to
...@@ -249,7 +256,7 @@ CreateSharedInvalidationState(void) ...@@ -249,7 +256,7 @@ CreateSharedInvalidationState(void)
* Initialize a new backend to operate on the sinval buffer * Initialize a new backend to operate on the sinval buffer
*/ */
void void
SharedInvalBackendInit(void) SharedInvalBackendInit(bool sendOnly)
{ {
int index; int index;
ProcState *stateP = NULL; ProcState *stateP = NULL;
...@@ -308,6 +315,7 @@ SharedInvalBackendInit(void) ...@@ -308,6 +315,7 @@ SharedInvalBackendInit(void)
stateP->nextMsgNum = segP->maxMsgNum; stateP->nextMsgNum = segP->maxMsgNum;
stateP->resetState = false; stateP->resetState = false;
stateP->signaled = false; stateP->signaled = false;
stateP->sendOnly = sendOnly;
LWLockRelease(SInvalWriteLock); LWLockRelease(SInvalWriteLock);
...@@ -579,7 +587,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) ...@@ -579,7 +587,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
/* /*
* Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
* furthest-back backend that needs signaling (if any), and reset any * furthest-back backend that needs signaling (if any), and reset any
* backends that are too far back. * backends that are too far back. Note that because we ignore sendOnly
* backends here it is possible for them to keep sending messages without
* a problem even when they are the only active backend.
*/ */
min = segP->maxMsgNum; min = segP->maxMsgNum;
minsig = min - SIG_THRESHOLD; minsig = min - SIG_THRESHOLD;
...@@ -591,7 +601,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) ...@@ -591,7 +601,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
int n = stateP->nextMsgNum; int n = stateP->nextMsgNum;
/* Ignore if inactive or already in reset state */ /* Ignore if inactive or already in reset state */
if (stateP->procPid == 0 || stateP->resetState) if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly)
continue; continue;
/* /*
......
This diff is collapsed.
$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.24 2008/03/21 13:23:28 momjian Exp $ $PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.25 2009/12/19 01:32:35 sriggs Exp $
Locking Overview Locking Overview
================ ================
...@@ -517,3 +517,27 @@ interfere with each other. ...@@ -517,3 +517,27 @@ interfere with each other.
User locks are always held as session locks, so that they are not released at User locks are always held as session locks, so that they are not released at
transaction end. They must be released explicitly by the application --- but transaction end. They must be released explicitly by the application --- but
they are released automatically when a backend terminates. they are released automatically when a backend terminates.
Locking during Hot Standby
--------------------------
The Startup process is the only backend that can make changes during
recovery, all other backends are read only. As a result the Startup
process does not acquire locks on relations or objects except when the lock
level is AccessExclusiveLock.
Regular backends are only allowed to take locks on relations or objects
at RowExclusiveLock or lower. This ensures that they do not conflict with
each other or with the Startup process, unless AccessExclusiveLocks are
requested by one of the backends.
Deadlocks involving AccessExclusiveLocks are not possible, so we need
not be concerned that a user initiated deadlock can prevent recovery from
progressing.
AccessExclusiveLocks on the primary or master node generate WAL records
that are then applied by the Startup process. Locks are released at end
of transaction just as they are in normal processing. These locks are
held by the Startup process, acting as a proxy for the backends that
originally acquired these locks. Again, these locks cannot conflict with
one another, so the Startup process cannot deadlock itself either.
This diff is collapsed.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
* *
* *
* IDENTIFICATION * IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.209 2009/08/31 19:41:00 tgl Exp $ * $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.210 2009/12/19 01:32:36 sriggs Exp $
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
...@@ -318,6 +318,7 @@ InitProcess(void) ...@@ -318,6 +318,7 @@ InitProcess(void)
MyProc->waitProcLock = NULL; MyProc->waitProcLock = NULL;
for (i = 0; i < NUM_LOCK_PARTITIONS; i++) for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
SHMQueueInit(&(MyProc->myProcLocks[i])); SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictMode = 0;
/* /*
* We might be reusing a semaphore that belonged to a failed process. So * We might be reusing a semaphore that belonged to a failed process. So
...@@ -374,6 +375,11 @@ InitProcessPhase2(void) ...@@ -374,6 +375,11 @@ InitProcessPhase2(void)
* to the ProcArray or the sinval messaging mechanism, either. They also * to the ProcArray or the sinval messaging mechanism, either. They also
* don't get a VXID assigned, since this is only useful when we actually * don't get a VXID assigned, since this is only useful when we actually
* hold lockmgr locks. * hold lockmgr locks.
*
* Startup process however uses locks but never waits for them in the
* normal backend sense. Startup process also takes part in sinval messaging
* as a sendOnly process, so never reads messages from sinval queue. So
* Startup process does have a VXID and does show up in pg_locks.
*/ */
void void
InitAuxiliaryProcess(void) InitAuxiliaryProcess(void)
...@@ -461,6 +467,24 @@ InitAuxiliaryProcess(void) ...@@ -461,6 +467,24 @@ InitAuxiliaryProcess(void)
on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype)); on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype));
} }
/*
* Record the PID and PGPROC structures for the Startup process, for use in
* ProcSendSignal(). See comments there for further explanation.
*/
void
PublishStartupProcessInformation(void)
{
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;
SpinLockAcquire(ProcStructLock);
procglobal->startupProc = MyProc;
procglobal->startupProcPid = MyProcPid;
SpinLockRelease(ProcStructLock);
}
/* /*
* Check whether there are at least N free PGPROC objects. * Check whether there are at least N free PGPROC objects.
* *
...@@ -1289,7 +1313,31 @@ ProcWaitForSignal(void) ...@@ -1289,7 +1313,31 @@ ProcWaitForSignal(void)
void void
ProcSendSignal(int pid) ProcSendSignal(int pid)
{ {
PGPROC *proc = BackendPidGetProc(pid); PGPROC *proc = NULL;
if (RecoveryInProgress())
{
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;
SpinLockAcquire(ProcStructLock);
/*
* Check to see whether it is the Startup process we wish to signal.
* This call is made by the buffer manager when it wishes to wake
* up a process that has been waiting for a pin in so it can obtain a
* cleanup lock using LockBufferForCleanup(). Startup is not a normal
* backend, so BackendPidGetProc() will not return any pid at all.
* So we remember the information for this special case.
*/
if (pid == procglobal->startupProcPid)
proc = procglobal->startupProc;
SpinLockRelease(ProcStructLock);
}
if (proc == NULL)
proc = BackendPidGetProc(pid);
if (proc != NULL) if (proc != NULL)
PGSemaphoreUnlock(&proc->sem); PGSemaphoreUnlock(&proc->sem);
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment