• Tom Lane's avatar
    Fix postmaster state machine to handle dead_end child crashes better. · ee327823
    Tom Lane authored
    A report from Alvaro Herrera shows that if we're in PM_STARTUP
    state, and we spawn a dead_end child to reject some incoming
    connection request, and that child dies with an unexpected exit
    code, the postmaster does not respond well.  We correctly send
    SIGQUIT to the startup process, but then:
    
    * if the startup process exits with nonzero exit code, as expected,
    we thought that that indicated a crash and aborted startup.
    
    * if the startup process exits with zero exit code, which is possible
    due to the inherent race condition, we'd advance to PM_RUN state
    which is fine --- but the code forgot that AbortStartTime would be
    nonzero in this situation.  We'd either die on the Asserts saying
    that it was zero, or perhaps misbehave later on.  (A quick look
    suggests that the only misbehavior might be busy-waiting due to
    DetermineSleepTime doing the wrong thing.)
    
    To fix the first point, adjust the state-machine logic to recognize
    that a nonzero exit code is expected after sending SIGQUIT, and have
    it transition to a state where we can restart the startup process.
    To fix the second point, change the Asserts to clear the variable
    rather than just claiming it should be clear already.
    
    Perhaps we could improve this further by not treating a crash of
    a dead_end child as a reason for panic'ing the database.  However,
    since those child processes are connected to shared memory, that
    seems a bit risky.  There are few good reasons for a dead_end child
    to report failure anyway (the cause of this in Alvaro's report is
    quite unclear).  On balance, therefore, a minimal fix seems best.
    
    This is an oversight in commit 45811be9.  While that was back-patched,
    I'm hesitant to back-patch this change.  The lack of reasons for a
    dead_end child to fail suggests that the case should be very rare in
    the field, which squares with the lack of reports; so it seems like
    this might not be worth the risk of introducing new issues.  In any
    case we can let it bake awhile in HEAD before considering a back-patch.
    
    Discussion: https://postgr.es/m/20190615160950.GA31378@alvherre.pgsql
    ee327823
postmaster.c 181 KB