• Tom Lane's avatar
    Fix race condition during replication origin drop. · 8a906204
    Tom Lane authored
    replorigin_drop() misunderstood the API for condition variables: it
    had ConditionVariablePrepareToSleep and ConditionVariableCancelSleep
    inside its test-and-sleep loop, rather than outside the loop as
    intended.  The net effect is a narrow race-condition window wherein,
    if the process using a replication slot releases it immediately after
    replorigin_drop() releases the ReplicationOriginLock, replorigin_drop()
    would get into the condition variable's wait list too late and then
    wait indefinitely for a signal that won't come.
    
    Because there's a different CV for each replication slot, we can't
    just move the ConditionVariablePrepareToSleep call to above the
    test-and-sleep loop.  What we can do, in the wake of commit 13db3b93,
    is drop the ConditionVariablePrepareToSleep call entirely.  This fix
    depends on that commit because (at least in principle) the slot matching
    the target replication origin might move around, so that once in a blue
    moon successive loop iterations might involve different CVs.  We can now
    cope with such a scenario, at the cost of an extra trip through the
    retry loop.
    
    (There are ways we could fix this bug without depending on that commit,
    but they're all a lot more complicated than this way.)
    
    While at it, upgrade the rather skimpy comments in this function.
    
    Back-patch to v10 where this code came in.
    
    Discussion: https://postgr.es/m/19947.1515455433@sss.pgh.pa.us
    8a906204
origin.c 40 KB