• Andres Freund's avatar
    Allow to trigger kernel writeback after a configurable number of writes. · 428b1d6b
    Andres Freund authored
    Currently writes to the main data files of postgres all go through the
    OS page cache. This means that some operating systems can end up
    collecting a large number of dirty buffers in their respective page
    caches.  When these dirty buffers are flushed to storage rapidly, be it
    because of fsync(), timeouts, or dirty ratios, latency for other reads
    and writes can increase massively.  This is the primary reason for
    regular massive stalls observed in real world scenarios and artificial
    benchmarks; on rotating disks stalls on the order of hundreds of seconds
    have been observed.
    
    On linux it is possible to control this by reducing the global dirty
    limits significantly, reducing the above problem. But global
    configuration is rather problematic because it'll affect other
    applications; also PostgreSQL itself doesn't always generally want this
    behavior, e.g. for temporary files it's undesirable.
    
    Several operating systems allow some control over the kernel page
    cache. Linux has sync_file_range(2), several posix systems have msync(2)
    and posix_fadvise(2). sync_file_range(2) is preferable because it
    requires no special setup, whereas msync() requires the to-be-flushed
    range to be mmap'ed. For the purpose of flushing dirty data
    posix_fadvise(2) is the worst alternative, as flushing dirty data is
    just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
    from the page cache.  Thus the feature is enabled by default only on
    linux, but can be enabled on all systems that have any of the above
    APIs.
    
    While desirable and likely possible this patch does not contain an
    implementation for windows.
    
    With the infrastructure added, writes made via checkpointer, bgwriter
    and normal user backends can be flushed after a configurable number of
    writes. Each of these sources of writes controlled by a separate GUC,
    checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
    respectively; they're separate because the number of flushes that are
    good are separate, and because the performance considerations of
    controlled flushing for each of these are different.
    
    A later patch will add checkpoint sorting - after that flushes from the
    ckeckpoint will almost always be desirable. Bgwriter flushes are most of
    the time going to be random, which are slow on lots of storage hardware.
    Flushing in backends works well if the storage and bgwriter can keep up,
    but if not it can have negative consequences.  This patch is likely to
    have negative performance consequences without checkpoint sorting, but
    unfortunately so has sorting without flush control.
    
    Discussion: alpine.DEB.2.10.1506011320000.28433@sto
    Author: Fabien Coelho and Andres Freund
    428b1d6b
fd.c 76.6 KB