• Tom Lane's avatar
    Make pg_ctl stop/restart/promote recheck postmaster aliveness. · 1e8c5cf7
    Tom Lane authored
    "pg_ctl stop/restart" checked that the postmaster PID is valid just
    once, as a side-effect of sending the stop signal, and then would
    wait-till-timeout for the postmaster.pid file to go away.  This
    neglects the case wherein the postmaster dies uncleanly after we
    signal it.  Similarly, once "pg_ctl promote" has sent the signal,
    it'd wait for the corresponding on-disk state change to occur
    even if the postmaster dies.
    
    I'm not sure how we've managed not to notice this problem, but it
    seems to explain slow execution of the 017_shm.pl test script on AIX
    since commit 4fdbf9af5, which added a speculative "pg_ctl stop" with
    the idea of making real sure that the postmaster isn't there.  In the
    test steps that kill-9 and then restart the postmaster, it's possible
    to get past the initial signal attempt before kill() stops working
    for the doomed postmaster.  If that happens, pg_ctl waited till
    PGCTLTIMEOUT before giving up ... and the buildfarm's AIX members
    have that set very high.
    
    To fix, include a "kill(pid, 0)" test (similar to what
    postmaster_is_alive uses) in these wait loops, so that we'll
    give up immediately if the postmaster PID disappears.
    
    While here, I chose to refactor those loops out of where they were.
    do_stop() and do_restart() can perfectly well share one copy of the
    wait-for-stop loop, and it seems desirable to put a similar function
    beside that for wait-for-promote.
    
    Back-patch to all supported versions, since pg_ctl's wait logic
    is substantially identical in all, and we're seeing the slow test
    behavior in all branches.
    
    Discussion: https://postgr.es/m/20220210023537.GA3222837@rfd.leadboat.com
    1e8c5cf7
pg_ctl.c 66.9 KB