• Tom Lane's avatar
    Don't lose walreceiver start requests due to race condition in postmaster. · e5d494d7
    Tom Lane authored
    When a walreceiver dies, the startup process will notice that and send
    a PMSIGNAL_START_WALRECEIVER signal to the postmaster, asking for a new
    walreceiver to be launched.  There's a race condition, which at least
    in HEAD is very easy to hit, whereby the postmaster might see that
    signal before it processes the SIGCHLD from the walreceiver process.
    In that situation, sigusr1_handler() just dropped the start request
    on the floor, reasoning that it must be redundant.  Eventually, after
    10 seconds (WALRCV_STARTUP_TIMEOUT), the startup process would make a
    fresh request --- but that's a long time if the connection could have
    been re-established almost immediately.
    
    Fix it by setting a state flag inside the postmaster that we won't
    clear until we do launch a walreceiver.  In cases where that results
    in an extra walreceiver launch, it's up to the walreceiver to realize
    it's unwanted and go away --- but we have, and need, that logic anyway
    for the opposite race case.
    
    I came across this through investigating unexpected delays in the
    src/test/recovery TAP tests: it manifests there in test cases where
    a master server is stopped and restarted while leaving streaming
    slaves active.
    
    This logic has been broken all along, so back-patch to all supported
    branches.
    
    Discussion: https://postgr.es/m/21344.1498494720@sss.pgh.pa.us
    e5d494d7
postmaster.c 176 KB