• Michael Paquier's avatar
    Ensure cleanup of orphan archive status files · 6d8727f9
    Michael Paquier authored
    When a WAL segment is recycled, its ".ready" and ".done" status files
    get also automatically removed, however this is not done in a durable
    manner.  Hence, in a subsequent crash, it could be possible that a
    ".ready" status file is still around with its corresponding segment
    already gone.
    
    If the backend reaches such a state, the archive command would most
    likely complain about a segment non-existing and would keep retrying,
    causing WAL segments to bloat pg_wal/, potentially making Postgres crash
    hard when running out of space.
    
    As status files are removed after each individual segment, using
    durable_unlink() does not completely close the window either, as a crash
    could happen between the moment the WAL segment is recycled and the
    moment its status files are removed.  This has also some performance
    impact with the additional fsync() calls needed to make the removal in a
    durable manner.  Doing the cleanup at recovery is not cost-free either
    as this makes crash recovery potentially take longer than necessary.
    
    So, instead, as per an idea of Stephen Frost, make the archiver aware of
    orphan status files and remove them on-the-fly if the corresponding
    segment goes missing.  Removal failures follow a model close to what
    happens for WAL segments, where multiple attempts are done before giving
    up temporarily, and where a successful orphan removal makes the archiver
    move immediately to the next WAL segment thought as ready to be
    archived.
    
    Author: Michael Paquier
    Reviewed-by: Nathan Bossart, Andres Freund, Stephen Frost, Kyotaro
    Horiguchi
    Discussion: https://postgr.es/m/20180928032827.GF1500@paquier.xyz
    6d8727f9
pgarch.c 19.2 KB