Commit df8b7bc9 authored by Tom Lane's avatar Tom Lane

Improve our mechanism for controlling the Linux out-of-memory killer.

Arrange for postmaster child processes to respond to two environment
variables, PG_OOM_ADJUST_FILE and PG_OOM_ADJUST_VALUE, to determine whether
they reset their OOM score adjustments and if so to what.  This is superior
to the previous design involving #ifdef's in several ways.  The behavior is
now available in a default build, and both ends of the adjustment --- the
original adjustment of the postmaster's level and the subsequent
readjustment by child processes --- can now be controlled in one place,
namely the postmaster launch script.  So it's no longer necessary for the
launch script to act on faith that the server was compiled with the
appropriate options.  In addition, if someone wants to use an OOM score
other than zero for the child processes, that doesn't take a recompile
anymore; and we no longer have to cater separately to the two different
historical kernel APIs for this adjustment.

Gurjeet Singh, somewhat revised by me
parent 96066198
...@@ -43,14 +43,17 @@ PGLOG="$PGDATA/serverlog" ...@@ -43,14 +43,17 @@ PGLOG="$PGDATA/serverlog"
# It's often a good idea to protect the postmaster from being killed by the # It's often a good idea to protect the postmaster from being killed by the
# OOM killer (which will tend to preferentially kill the postmaster because # OOM killer (which will tend to preferentially kill the postmaster because
# of the way it accounts for shared memory). Setting the OOM_SCORE_ADJ value # of the way it accounts for shared memory). Setting the OOM_SCORE_ADJ value
# to -1000 will disable OOM kill altogether. If you enable this, you probably # to -1000 will disable OOM kill altogether, which is a good thing for the
# want to compile PostgreSQL with "-DLINUX_OOM_SCORE_ADJ=0", so that # postmaster, but not so much for individual backends. If you enable this,
# individual backends can still be killed by the OOM killer. # also uncomment the DAEMON_ENV line, which will instruct backends to set
# their OOM adjustments back to the default setting of zero.
#OOM_SCORE_ADJ=-1000 #OOM_SCORE_ADJ=-1000
#DAEMON_ENV="PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj"
# Older Linux kernels may not have /proc/self/oom_score_adj, but instead # Older Linux kernels may not have /proc/self/oom_score_adj, but instead
# /proc/self/oom_adj, which works similarly except the disable value is -17. # /proc/self/oom_adj, which works similarly except the disable value is -17.
# For such a system, enable this and compile with "-DLINUX_OOM_ADJ=0". # For such a system, uncomment these two lines instead.
#OOM_ADJ=-17 #OOM_ADJ=-17
#DAEMON_ENV="PG_OOM_ADJUST_FILE=/proc/self/oom_adj"
## STOP EDITING HERE ## STOP EDITING HERE
...@@ -84,7 +87,7 @@ case $1 in ...@@ -84,7 +87,7 @@ case $1 in
echo -n "Starting PostgreSQL: " echo -n "Starting PostgreSQL: "
test x"$OOM_SCORE_ADJ" != x && echo "$OOM_SCORE_ADJ" > /proc/self/oom_score_adj test x"$OOM_SCORE_ADJ" != x && echo "$OOM_SCORE_ADJ" > /proc/self/oom_score_adj
test x"$OOM_ADJ" != x && echo "$OOM_ADJ" > /proc/self/oom_adj test x"$OOM_ADJ" != x && echo "$OOM_ADJ" > /proc/self/oom_adj
su - $PGUSER -c "$DAEMON -D '$PGDATA' &" >>$PGLOG 2>&1 su - $PGUSER -c "$DAEMON_ENV $DAEMON -D '$PGDATA' &" >>$PGLOG 2>&1
echo "ok" echo "ok"
;; ;;
stop) stop)
...@@ -97,7 +100,7 @@ case $1 in ...@@ -97,7 +100,7 @@ case $1 in
su - $PGUSER -c "$PGCTL stop -D '$PGDATA' -s -m fast -w" su - $PGUSER -c "$PGCTL stop -D '$PGDATA' -s -m fast -w"
test x"$OOM_SCORE_ADJ" != x && echo "$OOM_SCORE_ADJ" > /proc/self/oom_score_adj test x"$OOM_SCORE_ADJ" != x && echo "$OOM_SCORE_ADJ" > /proc/self/oom_score_adj
test x"$OOM_ADJ" != x && echo "$OOM_ADJ" > /proc/self/oom_adj test x"$OOM_ADJ" != x && echo "$OOM_ADJ" > /proc/self/oom_adj
su - $PGUSER -c "$DAEMON -D '$PGDATA' &" >>$PGLOG 2>&1 su - $PGUSER -c "$DAEMON_ENV $DAEMON -D '$PGDATA' &" >>$PGLOG 2>&1
echo "ok" echo "ok"
;; ;;
reload) reload)
......
...@@ -1275,7 +1275,7 @@ sysctl -w vm.overcommit_memory=2 ...@@ -1275,7 +1275,7 @@ sysctl -w vm.overcommit_memory=2
<para> <para>
Another approach, which can be used with or without altering Another approach, which can be used with or without altering
<varname>vm.overcommit_memory</>, is to set the process-specific <varname>vm.overcommit_memory</>, is to set the process-specific
<varname>oom_score_adj</> value for the postmaster process to <firstterm>OOM score adjustment</> value for the postmaster process to
<literal>-1000</>, thereby guaranteeing it will not be targeted by the OOM <literal>-1000</>, thereby guaranteeing it will not be targeted by the OOM
killer. The simplest way to do this is to execute killer. The simplest way to do this is to execute
<programlisting> <programlisting>
...@@ -1284,20 +1284,28 @@ echo -1000 > /proc/self/oom_score_adj ...@@ -1284,20 +1284,28 @@ echo -1000 > /proc/self/oom_score_adj
in the postmaster's startup script just before invoking the postmaster. in the postmaster's startup script just before invoking the postmaster.
Note that this action must be done as root, or it will have no effect; Note that this action must be done as root, or it will have no effect;
so a root-owned startup script is the easiest place to do it. If you so a root-owned startup script is the easiest place to do it. If you
do this, you may also wish to build <productname>PostgreSQL</> do this, you should also set these environment variables in the startup
with <literal>-DLINUX_OOM_SCORE_ADJ=0</> added to <varname>CPPFLAGS</>. script before invoking the postmaster:
That will cause postmaster child processes to run with the normal <programlisting>
<varname>oom_score_adj</> value of zero, so that the OOM killer can still export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
target them at need. export PG_OOM_ADJUST_VALUE=0
</programlisting>
These settings will cause postmaster child processes to run with the
normal OOM score adjustment of zero, so that the OOM killer can still
target them at need. You could use some other value for
<envar>PG_OOM_ADJUST_VALUE</> if you want the child processes to run
with some other OOM score adjustment. (<envar>PG_OOM_ADJUST_VALUE</>
can also be omitted, in which case it defaults to zero.) If you do not
set <envar>PG_OOM_ADJUST_FILE</>, the child processes will run with the
same OOM score adjustment as the postmaster, which is unwise since the
whole point is to ensure that the postmaster has a preferential setting.
</para> </para>
<para> <para>
Older Linux kernels do not offer <filename>/proc/self/oom_score_adj</>, Older Linux kernels do not offer <filename>/proc/self/oom_score_adj</>,
but may have a previous version of the same functionality called but may have a previous version of the same functionality called
<filename>/proc/self/oom_adj</>. This works the same except the disable <filename>/proc/self/oom_adj</>. This works the same except the disable
value is <literal>-17</> not <literal>-1000</>. The corresponding value is <literal>-17</> not <literal>-1000</>.
build flag for <productname>PostgreSQL</> is
<literal>-DLINUX_OOM_ADJ=0</>.
</para> </para>
<note> <note>
......
...@@ -31,6 +31,7 @@ pid_t ...@@ -31,6 +31,7 @@ pid_t
fork_process(void) fork_process(void)
{ {
pid_t result; pid_t result;
const char *oomfilename;
#ifdef LINUX_PROFILE #ifdef LINUX_PROFILE
struct itimerval prof_itimer; struct itimerval prof_itimer;
...@@ -71,62 +72,40 @@ fork_process(void) ...@@ -71,62 +72,40 @@ fork_process(void)
* process sizes *including shared memory*. (This is unbelievably * process sizes *including shared memory*. (This is unbelievably
* stupid, but the kernel hackers seem uninterested in improving it.) * stupid, but the kernel hackers seem uninterested in improving it.)
* Therefore it's often a good idea to protect the postmaster by * Therefore it's often a good idea to protect the postmaster by
* setting its oom_score_adj value negative (which has to be done in a * setting its OOM score adjustment negative (which has to be done in
* root-owned startup script). If you just do that much, all child * a root-owned startup script). Since the adjustment is inherited by
* processes will also be protected against OOM kill, which might not * child processes, this would ordinarily mean that all the
* be desirable. You can then choose to build with * postmaster's children are equally protected against OOM kill, which
* LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you * is not such a good idea. So we provide this code to allow the
* want child processes to adopt here. * children to change their OOM score adjustments again. Both the
* file name to write to and the value to write are controlled by
* environment variables, which can be set by the same startup script
* that did the original adjustment.
*/ */
#ifdef LINUX_OOM_SCORE_ADJ oomfilename = getenv("PG_OOM_ADJUST_FILE");
{
/*
* Use open() not stdio, to ensure we control the open flags. Some
* Linux security environments reject anything but O_WRONLY.
*/
int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0);
/* We ignore all errors */
if (fd >= 0)
{
char buf[16];
int rc;
snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_SCORE_ADJ); if (oomfilename != NULL)
rc = write(fd, buf, strlen(buf));
(void) rc;
close(fd);
}
}
#endif /* LINUX_OOM_SCORE_ADJ */
/*
* Older Linux kernels have oom_adj not oom_score_adj. This works
* similarly except with a different scale of adjustment values. If
* it's necessary to build Postgres to work with either API, you can
* define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ.
*/
#ifdef LINUX_OOM_ADJ
{ {
/* /*
* Use open() not stdio, to ensure we control the open flags. Some * Use open() not stdio, to ensure we control the open flags. Some
* Linux security environments reject anything but O_WRONLY. * Linux security environments reject anything but O_WRONLY.
*/ */
int fd = open("/proc/self/oom_adj", O_WRONLY, 0); int fd = open(oomfilename, O_WRONLY, 0);
/* We ignore all errors */ /* We ignore all errors */
if (fd >= 0) if (fd >= 0)
{ {
char buf[16]; const char *oomvalue = getenv("PG_OOM_ADJUST_VALUE");
int rc; int rc;
snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); if (oomvalue == NULL) /* supply a useful default */
rc = write(fd, buf, strlen(buf)); oomvalue = "0";
rc = write(fd, oomvalue, strlen(oomvalue));
(void) rc; (void) rc;
close(fd); close(fd);
} }
} }
#endif /* LINUX_OOM_ADJ */
/* /*
* Make sure processes do not share OpenSSL randomness state. * Make sure processes do not share OpenSSL randomness state.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment