Commit f11e8be3 authored by Robert Haas's avatar Robert Haas

Make commit_delay much smarter.

Instead of letting every backend participating in a group commit wait
independently, have the first one that becomes ready to flush WAL wait
for the configured delay, and let all the others wait just long enough
for that first process to complete its flush.  This greatly increases
the chances of being able to configure a commit_delay setting that
actually improves performance.

As a side consequence of this change, commit_delay now affects all WAL
flushes, rather than just commits.  There was some discussion on
pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay,
but in the absence of consensus I am leaving it alone for now.

Peter Geoghegan, with some changes, mostly to the documentation, by me.
parent f83b5999
...@@ -1866,23 +1866,26 @@ SET ENABLE_SEQSCAN TO OFF; ...@@ -1866,23 +1866,26 @@ SET ENABLE_SEQSCAN TO OFF;
</indexterm> </indexterm>
<listitem> <listitem>
<para> <para>
When the commit data for a transaction is flushed to disk, any
additional commits ready at that time are also flushed out.
<varname>commit_delay</varname> adds a time delay, set in <varname>commit_delay</varname> adds a time delay, set in
microseconds, before a transaction attempts to microseconds, before a WAL flush is initiated. This can improve
flush the WAL buffer out to disk. A nonzero delay can allow more group commit throughput by allowing a larger number of transactions
transactions to be committed with only one flush operation, if to commit via a single WAL flush, if system load is high enough
system load is high enough that additional transactions become that additional transactions become ready to commit within the
ready to commit within the given interval. But the delay is given interval. However, it also increases latency by up to
just wasted if no other transactions become ready to <varname>commit_delay</varname> microseconds for each WAL
commit. Therefore, the delay is only performed if at least flush. Because the delay is just wasted if no other transactions
<varname>commit_siblings</varname> other transactions are become ready to commit, it is only performed if at least
active at the instant that a server process has written its <varname>commit_siblings</varname> other transactions are active
commit record. immediately before a flush would otherwise have been initiated.
The default <varname>commit_delay</> is zero (no delay). In <productname>PostgreSQL</> releases prior to 9.3,
Since all pending commit data will be written at every flush <varname>commit_delay</varname> behaved differently and was much
regardless of this setting, it is rare that adding delay less effective: it affected only commits, rather than all WAL flushes,
by increasing this parameter will actually improve performance. and waited for the entire configured delay even if the WAL flush
was completed sooner. Beginning in <productname>PostgreSQL</> 9.3,
the first process that becomes ready to flush waits for the configured
interval, while subsequent processes wait only until the leader
completes the flush. The default <varname>commit_delay</> is zero
(no delay).
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>
......
...@@ -376,9 +376,7 @@ ...@@ -376,9 +376,7 @@
<acronym>WAL</acronym> to disk, in the hope that a single flush <acronym>WAL</acronym> to disk, in the hope that a single flush
executed by one such transaction can also serve other transactions executed by one such transaction can also serve other transactions
committing at about the same time. Setting <varname>commit_delay</varname> committing at about the same time. Setting <varname>commit_delay</varname>
can only help when there are many concurrently committing transactions, can only help when there are many concurrently committing transactions.
and it is difficult to tune it to a value that actually helps rather
than hurt throughput.
</para> </para>
</sect1> </sect1>
......
...@@ -68,9 +68,6 @@ bool XactDeferrable; ...@@ -68,9 +68,6 @@ bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON; int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
/* /*
* MyXactAccessedTempRel is set when a temporary relation is accessed. * MyXactAccessedTempRel is set when a temporary relation is accessed.
* We don't allow PREPARE TRANSACTION in that case. (This is global * We don't allow PREPARE TRANSACTION in that case. (This is global
...@@ -1123,22 +1120,6 @@ RecordTransactionCommit(void) ...@@ -1123,22 +1120,6 @@ RecordTransactionCommit(void)
if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) || if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
forceSyncCommit || nrels > 0) forceSyncCommit || nrels > 0)
{ {
/*
* Synchronous commit case:
*
* Sleep before flush! So we can flush more than one commit records
* per single fsync. (The idea is some other backend may do the
* XLogFlush while we're sleeping. This needs work still, because on
* most Unixen, the minimum select() delay is 10msec or more, which is
* way too long.)
*
* We do not sleep if enableFsync is not turned on, nor if there are
* fewer than CommitSiblings other backends with active transactions.
*/
if (CommitDelay > 0 && enableFsync &&
MinimumActiveBackends(CommitSiblings))
pg_usleep(CommitDelay);
XLogFlush(XactLastRecEnd); XLogFlush(XactLastRecEnd);
/* /*
......
...@@ -80,6 +80,8 @@ bool fullPageWrites = true; ...@@ -80,6 +80,8 @@ bool fullPageWrites = true;
bool log_checkpoints = false; bool log_checkpoints = false;
int sync_method = DEFAULT_SYNC_METHOD; int sync_method = DEFAULT_SYNC_METHOD;
int wal_level = WAL_LEVEL_MINIMAL; int wal_level = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
#ifdef WAL_DEBUG #ifdef WAL_DEBUG
bool XLOG_DEBUG = false; bool XLOG_DEBUG = false;
...@@ -2098,34 +2100,49 @@ XLogFlush(XLogRecPtr record) ...@@ -2098,34 +2100,49 @@ XLogFlush(XLogRecPtr record)
*/ */
continue; continue;
} }
/* Got the lock */
/* Got the lock; recheck whether request is satisfied */
LogwrtResult = XLogCtl->LogwrtResult; LogwrtResult = XLogCtl->LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush)) if (XLByteLE(record, LogwrtResult.Flush))
break;
/*
* Sleep before flush! By adding a delay here, we may give further
* backends the opportunity to join the backlog of group commit
* followers; this can significantly improve transaction throughput, at
* the risk of increasing transaction latency.
*
* We do not sleep if enableFsync is not turned on, nor if there are
* fewer than CommitSiblings other backends with active transactions.
*/
if (CommitDelay > 0 && enableFsync &&
MinimumActiveBackends(CommitSiblings))
pg_usleep(CommitDelay);
/* try to write/flush later additions to XLOG as well */
if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
{ {
/* try to write/flush later additions to XLOG as well */ XLogCtlInsert *Insert = &XLogCtl->Insert;
if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) uint32 freespace = INSERT_FREESPACE(Insert);
{
XLogCtlInsert *Insert = &XLogCtl->Insert;
uint32 freespace = INSERT_FREESPACE(Insert);
if (freespace == 0) /* buffer is full */ if (freespace == 0) /* buffer is full */
WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
else
{
WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
WriteRqstPtr -= freespace;
}
LWLockRelease(WALInsertLock);
WriteRqst.Write = WriteRqstPtr;
WriteRqst.Flush = WriteRqstPtr;
}
else else
{ {
WriteRqst.Write = WriteRqstPtr; WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
WriteRqst.Flush = record; WriteRqstPtr -= freespace;
} }
XLogWrite(WriteRqst, false, false); LWLockRelease(WALInsertLock);
WriteRqst.Write = WriteRqstPtr;
WriteRqst.Flush = WriteRqstPtr;
} }
else
{
WriteRqst.Write = WriteRqstPtr;
WriteRqst.Flush = record;
}
XLogWrite(WriteRqst, false, false);
LWLockRelease(WALWriteLock); LWLockRelease(WALWriteLock);
/* done */ /* done */
break; break;
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment