• Tom Lane's avatar
    Fix assorted race conditions in the new timeout infrastructure. · 16e1b7a1
    Tom Lane authored
    Prevent handle_sig_alarm from losing control partway through due to a query
    cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
    timeout handler functions).  That would at least result in failure to
    schedule any required future interrupt, and might result in actual
    corruption of timeout.c's data structures, if the interrupt happened while
    we were updating those.
    
    We could still lose control if an asynchronous SIGINT arrives just as the
    function is entered.  This wouldn't break any data structures, but it would
    have the same effect as if the SIGALRM interrupt had been silently lost:
    we'd not fire any currently-due handlers, nor schedule any new interrupt.
    To forestall that scenario, forcibly reschedule any pending timer interrupt
    during AbortTransaction and AbortSubTransaction.  We can avoid any extra
    kernel call in most cases by not doing that until we've allowed
    LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.
    
    Another hazard is that some platforms (at least Linux and *BSD) block a
    signal before calling its handler and then unblock it on return.  When we
    longjmp out of the handler, the unblock doesn't happen, and the signal is
    left blocked indefinitely.  Again, we can fix that by forcibly unblocking
    signals during AbortTransaction and AbortSubTransaction.
    
    These latter two problems do not manifest when the longjmp reaches
    postgres.c, because the error recovery code there kills all pending timeout
    events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
    mask is restored.  So errors thrown outside any transaction should be OK
    already, and cleaning up in AbortTransaction and AbortSubTransaction should
    be enough to fix these issues.  (We're assuming that any code that catches
    a query cancel error and doesn't re-throw it will do at least a
    subtransaction abort to clean up; but that was pretty much required already
    by other subsystems.)
    
    Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
    disabling that event: if a lock timeout interrupt happened after the lock
    was granted, the ensuing query cancel is still going to happen at the next
    CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
    cancel.
    
    Per reports from Dan Wood.
    
    Back-patch to 9.3 where the new timeout handling infrastructure was
    introduced.  We may at some point decide to back-patch the signal
    unblocking changes further, but I'll desist from that until we hear
    actual field complaints about it.
    16e1b7a1
proc.c 47.4 KB