Commit 88e98230 authored by Heikki Linnakangas's avatar Heikki Linnakangas

Replace checkpoint_segments with min_wal_size and max_wal_size.

Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and trying to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.

The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.

Reviewed by Amit Kapila and Venkata Balaji N.
parent 0fec0003
...@@ -1325,7 +1325,7 @@ include_dir 'conf.d' ...@@ -1325,7 +1325,7 @@ include_dir 'conf.d'
40% of RAM to <varname>shared_buffers</varname> will work better than a 40% of RAM to <varname>shared_buffers</varname> will work better than a
smaller amount. Larger settings for <varname>shared_buffers</varname> smaller amount. Larger settings for <varname>shared_buffers</varname>
usually require a corresponding increase in usually require a corresponding increase in
<varname>checkpoint_segments</varname>, in order to spread out the <varname>max_wal_size</varname>, in order to spread out the
process of writing large quantities of new or changed data over a process of writing large quantities of new or changed data over a
longer period of time. longer period of time.
</para> </para>
...@@ -2394,18 +2394,20 @@ include_dir 'conf.d' ...@@ -2394,18 +2394,20 @@ include_dir 'conf.d'
<title>Checkpoints</title> <title>Checkpoints</title>
<variablelist> <variablelist>
<varlistentry id="guc-checkpoint-segments" xreflabel="checkpoint_segments"> <varlistentry id="guc-max-wal-size" xreflabel="max_wal_size">
<term><varname>checkpoint_segments</varname> (<type>integer</type>) <term><varname>max_wal_size</varname> (<type>integer</type>)</term>
<indexterm> <indexterm>
<primary><varname>checkpoint_segments</> configuration parameter</primary> <primary><varname>max_wal_size</> configuration parameter</primary>
</indexterm> </indexterm>
</term>
<listitem> <listitem>
<para> <para>
Maximum number of log file segments between automatic WAL Maximum size to let the WAL grow to between automatic WAL
checkpoints (each segment is normally 16 megabytes). The default checkpoints. This is a soft limit; WAL size can exceed
is three segments. Increasing this parameter can increase the <varname>max_wal_size</> under special circumstances, like
amount of time needed for crash recovery. under heavy load, a failing <varname>archive_command</>, or a high
<varname>wal_keep_segments</> setting. The default is 128 MB.
Increasing this parameter can increase the amount of time needed for
crash recovery.
This parameter can only be set in the <filename>postgresql.conf</> This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line. file or on the server command line.
</para> </para>
...@@ -2458,7 +2460,7 @@ include_dir 'conf.d' ...@@ -2458,7 +2460,7 @@ include_dir 'conf.d'
Write a message to the server log if checkpoints caused by Write a message to the server log if checkpoints caused by
the filling of checkpoint segment files happen closer together the filling of checkpoint segment files happen closer together
than this many seconds (which suggests that than this many seconds (which suggests that
<varname>checkpoint_segments</> ought to be raised). The default is <varname>max_wal_size</> ought to be raised). The default is
30 seconds (<literal>30s</>). Zero disables the warning. 30 seconds (<literal>30s</>). Zero disables the warning.
No warnings will be generated if <varname>checkpoint_timeout</varname> No warnings will be generated if <varname>checkpoint_timeout</varname>
is less than <varname>checkpoint_warning</varname>. is less than <varname>checkpoint_warning</varname>.
...@@ -2468,6 +2470,24 @@ include_dir 'conf.d' ...@@ -2468,6 +2470,24 @@ include_dir 'conf.d'
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
<term><varname>min_wal_size</varname> (<type>integer</type>)</term>
<indexterm>
<primary><varname>min_wal_size</> configuration parameter</primary>
</indexterm>
<listitem>
<para>
As long as WAL disk usage stays below this setting, old WAL files are
always recycled for future use at a checkpoint, rather than removed.
This can be used to ensure that enough WAL space is reserved to
handle spikes in WAL usage, for example when running large batch
jobs. The default is 80 MB.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
</variablelist> </variablelist>
</sect2> </sect2>
<sect2 id="runtime-config-wal-archiving"> <sect2 id="runtime-config-wal-archiving">
......
...@@ -1328,19 +1328,19 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; ...@@ -1328,19 +1328,19 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
</para> </para>
</sect2> </sect2>
<sect2 id="populate-checkpoint-segments"> <sect2 id="populate-max-wal-size">
<title>Increase <varname>checkpoint_segments</varname></title> <title>Increase <varname>max_wal_size</varname></title>
<para> <para>
Temporarily increasing the <xref Temporarily increasing the <xref linkend="guc-max-wal-size">
linkend="guc-checkpoint-segments"> configuration variable can also configuration variable can also
make large data loads faster. This is because loading a large make large data loads faster. This is because loading a large
amount of data into <productname>PostgreSQL</productname> will amount of data into <productname>PostgreSQL</productname> will
cause checkpoints to occur more often than the normal checkpoint cause checkpoints to occur more often than the normal checkpoint
frequency (specified by the <varname>checkpoint_timeout</varname> frequency (specified by the <varname>checkpoint_timeout</varname>
configuration variable). Whenever a checkpoint occurs, all dirty configuration variable). Whenever a checkpoint occurs, all dirty
pages must be flushed to disk. By increasing pages must be flushed to disk. By increasing
<varname>checkpoint_segments</varname> temporarily during bulk <varname>max_wal_size</varname> temporarily during bulk
data loads, the number of checkpoints that are required can be data loads, the number of checkpoints that are required can be
reduced. reduced.
</para> </para>
...@@ -1445,7 +1445,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; ...@@ -1445,7 +1445,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
<para> <para>
Set appropriate (i.e., larger than normal) values for Set appropriate (i.e., larger than normal) values for
<varname>maintenance_work_mem</varname> and <varname>maintenance_work_mem</varname> and
<varname>checkpoint_segments</varname>. <varname>max_wal_size</varname>.
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
...@@ -1512,7 +1512,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; ...@@ -1512,7 +1512,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
So when loading a data-only dump, it is up to you to drop and recreate So when loading a data-only dump, it is up to you to drop and recreate
indexes and foreign keys if you wish to use those techniques. indexes and foreign keys if you wish to use those techniques.
It's still useful to increase <varname>checkpoint_segments</varname> It's still useful to increase <varname>max_wal_size</varname>
while loading the data, but don't bother increasing while loading the data, but don't bother increasing
<varname>maintenance_work_mem</varname>; rather, you'd do that while <varname>maintenance_work_mem</varname>; rather, you'd do that while
manually recreating indexes and foreign keys afterwards. manually recreating indexes and foreign keys afterwards.
...@@ -1577,7 +1577,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; ...@@ -1577,7 +1577,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
<listitem> <listitem>
<para> <para>
Increase <xref linkend="guc-checkpoint-segments"> and <xref Increase <xref linkend="guc-max-wal-size"> and <xref
linkend="guc-checkpoint-timeout"> ; this reduces the frequency linkend="guc-checkpoint-timeout"> ; this reduces the frequency
of checkpoints, but increases the storage requirements of of checkpoints, but increases the storage requirements of
<filename>/pg_xlog</>. <filename>/pg_xlog</>.
......
...@@ -472,9 +472,10 @@ ...@@ -472,9 +472,10 @@
<para> <para>
The server's checkpointer process automatically performs The server's checkpointer process automatically performs
a checkpoint every so often. A checkpoint is begun every <xref a checkpoint every so often. A checkpoint is begun every <xref
linkend="guc-checkpoint-segments"> log segments, or every <xref linkend="guc-checkpoint-timeout"> seconds, or if
linkend="guc-checkpoint-timeout"> seconds, whichever comes first. <xref linkend="guc-max-wal-size"> is about to be exceeded,
The default settings are 3 segments and 300 seconds (5 minutes), respectively. whichever comes first.
The default settings are 5 minutes and 128 MB, respectively.
If no WAL has been written since the previous checkpoint, new checkpoints If no WAL has been written since the previous checkpoint, new checkpoints
will be skipped even if <varname>checkpoint_timeout</> has passed. will be skipped even if <varname>checkpoint_timeout</> has passed.
(If WAL archiving is being used and you want to put a lower limit on how (If WAL archiving is being used and you want to put a lower limit on how
...@@ -486,8 +487,8 @@ ...@@ -486,8 +487,8 @@
</para> </para>
<para> <para>
Reducing <varname>checkpoint_segments</varname> and/or Reducing <varname>checkpoint_timeout</varname> and/or
<varname>checkpoint_timeout</varname> causes checkpoints to occur <varname>max_wal_size</varname> causes checkpoints to occur
more often. This allows faster after-crash recovery, since less work more often. This allows faster after-crash recovery, since less work
will need to be redone. However, one must balance this against the will need to be redone. However, one must balance this against the
increased cost of flushing dirty data pages more often. If increased cost of flushing dirty data pages more often. If
...@@ -510,11 +511,11 @@ ...@@ -510,11 +511,11 @@
parameter. If checkpoints happen closer together than parameter. If checkpoints happen closer together than
<varname>checkpoint_warning</> seconds, <varname>checkpoint_warning</> seconds,
a message will be output to the server log recommending increasing a message will be output to the server log recommending increasing
<varname>checkpoint_segments</varname>. Occasional appearance of such <varname>max_wal_size</varname>. Occasional appearance of such
a message is not cause for alarm, but if it appears often then the a message is not cause for alarm, but if it appears often then the
checkpoint control parameters should be increased. Bulk operations such checkpoint control parameters should be increased. Bulk operations such
as large <command>COPY</> transfers might cause a number of such warnings as large <command>COPY</> transfers might cause a number of such warnings
to appear if you have not set <varname>checkpoint_segments</> high to appear if you have not set <varname>max_wal_size</> high
enough. enough.
</para> </para>
...@@ -525,10 +526,10 @@ ...@@ -525,10 +526,10 @@
<xref linkend="guc-checkpoint-completion-target">, which is <xref linkend="guc-checkpoint-completion-target">, which is
given as a fraction of the checkpoint interval. given as a fraction of the checkpoint interval.
The I/O rate is adjusted so that the checkpoint finishes when the The I/O rate is adjusted so that the checkpoint finishes when the
given fraction of <varname>checkpoint_segments</varname> WAL segments given fraction of
have been consumed since checkpoint start, or the given fraction of <varname>checkpoint_timeout</varname> seconds have elapsed, or before
<varname>checkpoint_timeout</varname> seconds have elapsed, <varname>max_wal_size</varname> is exceeded, whichever is sooner.
whichever is sooner. With the default value of 0.5, With the default value of 0.5,
<productname>PostgreSQL</> can be expected to complete each checkpoint <productname>PostgreSQL</> can be expected to complete each checkpoint
in about half the time before the next checkpoint starts. On a system in about half the time before the next checkpoint starts. On a system
that's very close to maximum I/O throughput during normal operation, that's very close to maximum I/O throughput during normal operation,
...@@ -545,18 +546,35 @@ ...@@ -545,18 +546,35 @@
</para> </para>
<para> <para>
There will always be at least one WAL segment file, and will normally The number of WAL segment files in <filename>pg_xlog</> directory depends on
not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1 <varname>min_wal_size</>, <varname>max_wal_size</> and
or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1 the amount of WAL generated in previous checkpoint cycles. When old log
files. Each segment file is normally 16 MB (though this size can be segment files are no longer needed, they are removed or recycled (that is,
altered when building the server). You can use this to estimate space renamed to become future segments in the numbered sequence). If, due to a
requirements for <acronym>WAL</acronym>. short-term peak of log output rate, <varname>max_wal_size</> is
Ordinarily, when old log segment files are no longer needed, they exceeded, the unneeded segment files will be removed until the system
are recycled (that is, renamed to become future segments in the numbered gets back under this limit. Below that limit, the system recycles enough
sequence). If, due to a short-term peak of log output rate, there WAL files to cover the estimated need until the next checkpoint, and
are more than 3 * <varname>checkpoint_segments</varname> + 1 removes the rest. The estimate is based on a moving average of the number
segment files, the unneeded segment files will be deleted instead of WAL files used in previous checkpoint cycles. The moving average
of recycled until the system gets back under this limit. is increased immediately if the actual usage exceeds the estimate, so it
accommodates peak usage rather average usage to some extent.
<varname>min_wal_size</> puts a minimum on the amount of WAL files
recycled for future usage; that much WAL is always recycled for future use,
even if the system is idle and the WAL usage estimate suggests that little
WAL is needed.
</para>
<para>
Independently of <varname>max_wal_size</varname>,
<xref linkend="guc-wal-keep-segments"> + 1 most recent WAL files are
kept at all times. Also, if WAL archiving is used, old segments can not be
removed or recycled until they are archived. If WAL archiving cannot keep up
with the pace that WAL is generated, or if <varname>archive_command</varname>
fails repeatedly, old WAL files will accumulate in <filename>pg_xlog</>
until the situation is resolved. A slow or failed standby server that
uses a replication slot will have the same effect (see
<xref linkend="streaming-replication-slots">).
</para> </para>
<para> <para>
...@@ -571,9 +589,8 @@ ...@@ -571,9 +589,8 @@
master because restartpoints can only be performed at checkpoint records. master because restartpoints can only be performed at checkpoint records.
A restartpoint is triggered when a checkpoint record is reached if at A restartpoint is triggered when a checkpoint record is reached if at
least <varname>checkpoint_timeout</> seconds have passed since the last least <varname>checkpoint_timeout</> seconds have passed since the last
restartpoint. In standby mode, a restartpoint is also triggered if at restartpoint, or if WAL size is about to exceed
least <varname>checkpoint_segments</> log segments have been replayed <varname>max_wal_size</>.
since the last restartpoint.
</para> </para>
<para> <para>
......
This diff is collapsed.
...@@ -471,7 +471,7 @@ CheckpointerMain(void) ...@@ -471,7 +471,7 @@ CheckpointerMain(void)
"checkpoints are occurring too frequently (%d seconds apart)", "checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs, elapsed_secs,
elapsed_secs), elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); errhint("Consider increasing the configuration parameter \"max_wal_size\".")));
/* /*
* Initialize checkpointer-private variables used during * Initialize checkpointer-private variables used during
...@@ -749,11 +749,11 @@ IsCheckpointOnSchedule(double progress) ...@@ -749,11 +749,11 @@ IsCheckpointOnSchedule(double progress)
return false; return false;
/* /*
* Check progress against WAL segments written and checkpoint_segments. * Check progress against WAL segments written and CheckPointSegments.
* *
* We compare the current WAL insert location against the location * We compare the current WAL insert location against the location
* computed before calling CreateCheckPoint. The code in XLogInsert that * computed before calling CreateCheckPoint. The code in XLogInsert that
* actually triggers a checkpoint when checkpoint_segments is exceeded * actually triggers a checkpoint when CheckPointSegments is exceeded
* compares against RedoRecptr, so this is not completely accurate. * compares against RedoRecptr, so this is not completely accurate.
* However, it's good enough for our purposes, we're only calculating an * However, it's good enough for our purposes, we're only calculating an
* estimate anyway. * estimate anyway.
......
...@@ -685,6 +685,9 @@ typedef struct ...@@ -685,6 +685,9 @@ typedef struct
#if XLOG_BLCKSZ < 1024 || XLOG_BLCKSZ > (1024*1024) #if XLOG_BLCKSZ < 1024 || XLOG_BLCKSZ > (1024*1024)
#error XLOG_BLCKSZ must be between 1KB and 1MB #error XLOG_BLCKSZ must be between 1KB and 1MB
#endif #endif
#if XLOG_SEG_SIZE < (1024*1024) || XLOG_BLCKSZ > (1024*1024*1024)
#error XLOG_SEG_SIZE must be between 1MB and 1GB
#endif
static const char *memory_units_hint = static const char *memory_units_hint =
gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\", and \"TB\"."); gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\", and \"TB\".");
...@@ -706,6 +709,11 @@ static const unit_conversion memory_unit_conversion_table[] = ...@@ -706,6 +709,11 @@ static const unit_conversion memory_unit_conversion_table[] =
{ "MB", GUC_UNIT_XBLOCKS, 1024 / (XLOG_BLCKSZ / 1024) }, { "MB", GUC_UNIT_XBLOCKS, 1024 / (XLOG_BLCKSZ / 1024) },
{ "kB", GUC_UNIT_XBLOCKS, -(XLOG_BLCKSZ / 1024) }, { "kB", GUC_UNIT_XBLOCKS, -(XLOG_BLCKSZ / 1024) },
{ "TB", GUC_UNIT_XSEGS, (1024*1024*1024) / (XLOG_SEG_SIZE / 1024) },
{ "GB", GUC_UNIT_XSEGS, (1024*1024) / (XLOG_SEG_SIZE / 1024) },
{ "MB", GUC_UNIT_XSEGS, -(XLOG_SEG_SIZE / (1024 * 1024)) },
{ "kB", GUC_UNIT_XSEGS, -(XLOG_SEG_SIZE / 1024) },
{ "" } /* end of table marker */ { "" } /* end of table marker */
}; };
...@@ -2146,15 +2154,27 @@ static struct config_int ConfigureNamesInt[] = ...@@ -2146,15 +2154,27 @@ static struct config_int ConfigureNamesInt[] =
}, },
{ {
{"checkpoint_segments", PGC_SIGHUP, WAL_CHECKPOINTS, {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
gettext_noop("Sets the maximum distance in log segments between automatic WAL checkpoints."), gettext_noop("Sets the minimum size to shrink the WAL to."),
NULL NULL,
GUC_UNIT_XSEGS
}, },
&CheckPointSegments, &min_wal_size,
3, 1, INT_MAX, 5, 2, INT_MAX,
NULL, NULL, NULL NULL, NULL, NULL
}, },
{
{"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
gettext_noop("Sets the WAL size that triggers a checkpoint."),
NULL,
GUC_UNIT_XSEGS
},
&max_wal_size,
8, 2, INT_MAX,
NULL, assign_max_wal_size, NULL
},
{ {
{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS, {"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
gettext_noop("Sets the maximum time between automatic WAL checkpoints."), gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
......
...@@ -197,8 +197,9 @@ ...@@ -197,8 +197,9 @@
# - Checkpoints - # - Checkpoints -
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h #checkpoint_timeout = 5min # range 30s-1h
#max_wal_size = 128MB # in logfile segments
#min_wal_size = 80MB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables #checkpoint_warning = 30s # 0 disables
......
...@@ -89,7 +89,8 @@ extern XLogRecPtr XactLastRecEnd; ...@@ -89,7 +89,8 @@ extern XLogRecPtr XactLastRecEnd;
extern bool reachedConsistency; extern bool reachedConsistency;
/* these variables are GUC parameters related to XLOG */ /* these variables are GUC parameters related to XLOG */
extern int CheckPointSegments; extern int min_wal_size;
extern int max_wal_size;
extern int wal_keep_segments; extern int wal_keep_segments;
extern int XLOGbuffers; extern int XLOGbuffers;
extern int XLogArchiveTimeout; extern int XLogArchiveTimeout;
...@@ -101,6 +102,8 @@ extern bool fullPageWrites; ...@@ -101,6 +102,8 @@ extern bool fullPageWrites;
extern bool wal_log_hints; extern bool wal_log_hints;
extern bool log_checkpoints; extern bool log_checkpoints;
extern int CheckPointSegments;
/* WAL levels */ /* WAL levels */
typedef enum WalLevel typedef enum WalLevel
{ {
...@@ -246,6 +249,9 @@ extern bool CheckPromoteSignal(void); ...@@ -246,6 +249,9 @@ extern bool CheckPromoteSignal(void);
extern void WakeupRecovery(void); extern void WakeupRecovery(void);
extern void SetWalWriterSleeping(bool sleeping); extern void SetWalWriterSleeping(bool sleeping);
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
/* /*
* Starting/stopping a base backup * Starting/stopping a base backup
*/ */
......
...@@ -207,6 +207,7 @@ typedef enum ...@@ -207,6 +207,7 @@ typedef enum
#define GUC_UNIT_KB 0x1000 /* value is in kilobytes */ #define GUC_UNIT_KB 0x1000 /* value is in kilobytes */
#define GUC_UNIT_BLOCKS 0x2000 /* value is in blocks */ #define GUC_UNIT_BLOCKS 0x2000 /* value is in blocks */
#define GUC_UNIT_XBLOCKS 0x3000 /* value is in xlog blocks */ #define GUC_UNIT_XBLOCKS 0x3000 /* value is in xlog blocks */
#define GUC_UNIT_XSEGS 0x4000 /* value is in xlog segments */
#define GUC_UNIT_MEMORY 0xF000 /* mask for KB, BLOCKS, XBLOCKS */ #define GUC_UNIT_MEMORY 0xF000 /* mask for KB, BLOCKS, XBLOCKS */
#define GUC_UNIT_MS 0x10000 /* value is in milliseconds */ #define GUC_UNIT_MS 0x10000 /* value is in milliseconds */
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment