Commit d8179b00 authored by Tom Lane's avatar Tom Lane

Fix fsync-at-startup code to not treat errors as fatal.

Commit 2ce439f3 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.

After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/.  The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.

This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.

I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().

The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.

Back-patch to all active branches, like the prior commit.

Abhijit Menon-Sen and Tom Lane
parent d5442cb2
...@@ -866,8 +866,6 @@ static void WALInsertLockAcquireExclusive(void); ...@@ -866,8 +866,6 @@ static void WALInsertLockAcquireExclusive(void);
static void WALInsertLockRelease(void); static void WALInsertLockRelease(void);
static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
static void fsync_pgdata(char *datadir);
/* /*
* Insert an XLOG record represented by an already-constructed chain of data * Insert an XLOG record represented by an already-constructed chain of data
* chunks. This is a low-level routine; to construct the WAL record header * chunks. This is a low-level routine; to construct the WAL record header
...@@ -5951,18 +5949,6 @@ StartupXLOG(void) ...@@ -5951,18 +5949,6 @@ StartupXLOG(void)
(errmsg("database system was interrupted; last known up at %s", (errmsg("database system was interrupted; last known up at %s",
str_time(ControlFile->time)))); str_time(ControlFile->time))));
/*
* If we previously crashed, there might be data which we had written,
* intending to fsync it, but which we had not actually fsync'd yet.
* Therefore, a power failure in the near future might cause earlier
* unflushed writes to be lost, even though more recent data written to
* disk from here on would be persisted. To avoid that, fsync the entire
* data directory.
*/
if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
fsync_pgdata(data_directory);
/* This is just to allow attaching to startup process with a debugger */ /* This is just to allow attaching to startup process with a debugger */
#ifdef XLOG_REPLAY_DELAY #ifdef XLOG_REPLAY_DELAY
if (ControlFile->state != DB_SHUTDOWNED) if (ControlFile->state != DB_SHUTDOWNED)
...@@ -5976,6 +5962,18 @@ StartupXLOG(void) ...@@ -5976,6 +5962,18 @@ StartupXLOG(void)
*/ */
ValidateXLOGDirectoryStructure(); ValidateXLOGDirectoryStructure();
/*
* If we previously crashed, there might be data which we had written,
* intending to fsync it, but which we had not actually fsync'd yet.
* Therefore, a power failure in the near future might cause earlier
* unflushed writes to be lost, even though more recent data written to
* disk from here on would be persisted. To avoid that, fsync the entire
* data directory.
*/
if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
SyncDataDirectory();
/* /*
* Initialize on the assumption we want to recover to the latest timeline * Initialize on the assumption we want to recover to the latest timeline
* that's active according to pg_control. * that's active according to pg_control.
...@@ -11602,31 +11600,3 @@ SetWalWriterSleeping(bool sleeping) ...@@ -11602,31 +11600,3 @@ SetWalWriterSleeping(bool sleeping)
XLogCtl->WalWriterSleeping = sleeping; XLogCtl->WalWriterSleeping = sleeping;
SpinLockRelease(&XLogCtl->info_lck); SpinLockRelease(&XLogCtl->info_lck);
} }
/*
* Issue fsync recursively on PGDATA and all its contents.
*/
static void
fsync_pgdata(char *datadir)
{
if (!enableFsync)
return;
/*
* If possible, hint to the kernel that we're soon going to fsync the data
* directory and its contents.
*/
#if defined(HAVE_SYNC_FILE_RANGE) || \
(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
walkdir(datadir, pre_sync_fname);
#endif
/*
* Now we do the fsync()s in the same order.
*
* It's important to fsync the destination directory itself as individual
* file fsyncs don't guarantee that the directory entry for the file is
* synced.
*/
walkdir(datadir, fsync_fname);
}
This diff is collapsed.
...@@ -114,8 +114,7 @@ extern int pg_fsync_writethrough(int fd); ...@@ -114,8 +114,7 @@ extern int pg_fsync_writethrough(int fd);
extern int pg_fdatasync(int fd); extern int pg_fdatasync(int fd);
extern int pg_flush_data(int fd, off_t offset, off_t amount); extern int pg_flush_data(int fd, off_t offset, off_t amount);
extern void fsync_fname(char *fname, bool isdir); extern void fsync_fname(char *fname, bool isdir);
extern void pre_sync_fname(char *fname, bool isdir); extern void SyncDataDirectory(void);
extern void walkdir(char *path, void (*action) (char *fname, bool isdir));
/* Filename components for OpenTemporaryFile */ /* Filename components for OpenTemporaryFile */
#define PG_TEMP_FILES_DIR "pgsql_tmp" #define PG_TEMP_FILES_DIR "pgsql_tmp"
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment