• Tomas Vondra's avatar
    Fix ordering of XIDs in ProcArrayApplyRecoveryInfo · fb2f8e53
    Tomas Vondra authored
    Commit 8431e296 reworked ProcArrayApplyRecoveryInfo to sort XIDs
    before adding them to KnownAssignedXids. But the XIDs are sorted using
    xidComparator, which compares the XIDs simply as uint32 values, not
    logically. KnownAssignedXidsAdd() however expects XIDs in logical order,
    and calls TransactionIdFollowsOrEquals() to enforce that. If there are
    XIDs for which the two orderings disagree, an error is raised and the
    recovery fails/restarts.
    
    Hitting this issue is fairly easy - you just need two transactions, one
    started before the 4B limit (e.g. XID 4294967290), the other sometime
    after it (e.g. XID 1000). Logically (4294967290 <= 1000) but when
    compared using xidComparator we try to add them in the opposite order.
    Which makes KnownAssignedXidsAdd() fail with an error like this:
    
      ERROR: out-of-order XID insertion in KnownAssignedXids
    
    This only happens during replica startup, while processing RUNNING_XACTS
    records to build the snapshot. Once we reach STANDBY_SNAPSHOT_READY, we
    skip these records. So this does not affect already running replicas,
    but if you restart (or create) a replica while there are transactions
    with XIDs for which the two orderings disagree, you may hit this.
    
    Long-running transactions and frequent replica restarts increase the
    likelihood of hitting this issue. Once the replica gets into this state,
    it can't be started (even if the old transactions are terminated).
    
    Fixed by sorting the XIDs logically - this is fine because we're dealing
    with normal XIDs (because it's XIDs assigned to backends) and from the
    same wraparound epoch (otherwise the backends could not be running at
    the same time on the primary node). So there are no problems with the
    triangle inequality, which is why xidComparator compares raw values.
    
    Investigation and root cause analysis by Abhijit Menon-Sen. Patch by me.
    
    This issue is present in all releases since 9.4, however releases up to
    9.6 are EOL already so backpatch to 10 only.
    
    Reviewed-by: Abhijit Menon-Sen
    Reviewed-by: Alvaro Herrera
    Backpatch-through: 10
    Discussion: https://postgr.es/m/36b8a501-5d73-277c-4972-f58a4dce088a%40enterprisedb.com
    fb2f8e53
procarray.c 161 KB