• Heikki Linnakangas's avatar
    Fix race condition in B-tree page deletion. · efada2b8
    Heikki Linnakangas authored
    In short, we don't allow a page to be deleted if it's the rightmost child
    of its parent, but that situation can change after we check for it.
    
    Problem
    -------
    
    We check that the page to be deleted is not the rightmost child of its
    parent, and then lock its left sibling, the page itself, its right sibling,
    and the parent, in that order. However, if the parent page is split after
    the check but before acquiring the locks, the target page might become the
    rightmost child, if the split happens at the right place. That leads to an
    error in vacuum (I reproduced this by setting a breakpoint in debugger):
    
    ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"
    
    We currently re-check that the page is still the rightmost child, and throw
    the above error if it's not. We could easily just give up rather than throw
    an error, but that approach doesn't scale to half-dead pages. To recap,
    although we don't normally allow deleting the rightmost child, if the page
    is the *only* child of its parent, we delete the child page and mark the
    parent page as half-dead in one atomic operation. But before we do that, we
    check that the parent can later be deleted, by checking that it in turn is
    not the rightmost child of the grandparent (potentially recursing all the
    way up to the root). But the same situation can arise there - the
    grandparent can be split while we're not holding the locks. We end up with
    a half-dead page that we cannot delete.
    
    To make things worse, the keyspace of the deleted page has already been
    transferred to its right sibling. As the README points out, the keyspace at
    the grandparent level is "out-of-whack" until the half-dead page is deleted,
    and if enough tuples with keys in the transferred keyspace are inserted, the
    page might get split and a downlink might be inserted into the grandparent
    that is out-of-order. That might not cause any serious problem if it's
    transient (as the README ponders), but is surely bad if it stays that way.
    
    Solution
    --------
    
    This patch changes the page deletion algorithm to avoid that problem. After
    checking that the topmost page in the chain of to-be-deleted pages is not
    the rightmost child of its parent, and then deleting the pages from bottom
    up, unlink the pages from top to bottom. This way, the intermediate stages
    are similar to the intermediate stages in page splitting, and there is no
    transient stage where the keyspace is "out-of-whack". The topmost page in
    the to-be-deleted chain doesn't have a downlink pointing to it, like a page
    split before the downlink has been inserted.
    
    This also allows us to get rid of the cleanup step after WAL recovery, if we
    crash during page deletion. The deletion will be continued at next VACUUM,
    but the tree is consistent for searches and insertions at every step.
    
    This bug is old, all supported versions are affected, but this patch is too
    big to back-patch (and changes the WAL record formats of related records).
    We have not heard any reports of the bug from users, so clearly it's not
    easy to bump into. Maybe backpatch later, after this has had some field
    testing.
    
    Reviewed by Kevin Grittner and Peter Geoghegan.
    efada2b8
README 33.2 KB