• Heikki Linnakangas's avatar
    Fix race conditions in newly-added test. · 36a4ac20
    Heikki Linnakangas authored
    Buildfarm has been failing sporadically on the new test.  I was able to
    reproduce this by adding a random 0-10 s delay in the walreceiver, just
    before it connects to the primary. There's a race condition where node_3
    is promoted before it has fully caught up with node_1, leading to diverged
    timelines. When node_1 is later reconfigured as standby following node_3,
    it fails to catch up:
    
    LOG:  primary server contains no more WAL on requested timeline 1
    LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/30000A0
    
    That's the situation where you'd need to use pg_rewind, but in this case
    it happens already when we are just setting up the actual pg_rewind
    scenario we want to test, so change the test so that it waits until
    node_3 is connected and fully caught up before promoting it, so that you
    get a clean, controlled failover.
    
    Also rewrite some of the comments, for clarity. The existing comments
    detailed what each step in the test did, but didn't give a good overview
    of the situation the steps were trying to create.
    
    For reasons I don't understand, the test setup had to be written slightly
    differently in 9.6 and 9.5 than in later versions. The 9.5/9.6 version
    needed node 1 to be reinitialized from backup, whereas in later versions
    it could be shut down and reconfigured to be a standby. But even 9.5 should
    support "clean switchover", where primary makes sure that pending WAL is
    replicated to standby on shutdown. It would be nice to figure out what's
    going on there, but that's independent of pg_rewind and the scenario that
    this test tests.
    
    Discussion: https://www.postgresql.org/message-id/b0a3b95b-82d2-6089-6892-40570f8c5e60%40iki.fi
    36a4ac20
008_min_recovery_point.pl 4.83 KB