Commit 989be081 authored by Fujii Masao's avatar Fujii Masao

Support multiple synchronous standby servers.

Previously synchronous replication offered only the ability to confirm
that all changes made by a transaction had been transferred to at most
one synchronous standby server.

This commit extends synchronous replication so that it supports multiple
synchronous standby servers. It enables users to consider one or more
standby servers as synchronous, and increase the level of transaction
durability by ensuring that transaction commits wait for replies from
all of those synchronous standbys.

Multiple synchronous standby servers are configured in
synchronous_standby_names which is extended to support new syntax of
'num_sync ( standby_name [ , ... ] )', where num_sync specifies
the number of synchronous standbys that transaction commits need to
wait for replies from and standby_name is the name of a standby
server.

The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before
is also still supported. It's the same as new syntax with num_sync=1.

This commit doesn't include "quorum commit" feature which was discussed
in pgsql-hackers. Synchronous standbys are chosen based on their priorities.
synchronous_standby_names determines the priority of each standby for
being chosen as a synchronous standby. The standbys whose names appear
earlier in the list are given higher priority and will be considered as
synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys.

The regression test for multiple synchronous standbys is not included
in this commit. It should come later.

Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao
Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs,
Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen,
Rajeev Rastogi

Many thanks to the various individuals who were involved in
discussing and developing this feature.
parent 2143f5e1
......@@ -2906,34 +2906,69 @@ include_dir 'conf.d'
</term>
<listitem>
<para>
Specifies a comma-separated list of standby names that can support
Specifies a list of standby names that can support
<firstterm>synchronous replication</>, as described in
<xref linkend="synchronous-replication">.
At any one time there will be at most one active synchronous standby;
There will be one or more active synchronous standbys;
transactions waiting for commit will be allowed to proceed after
this standby server confirms receipt of their data.
The synchronous standby will be the first standby named in this list
these standby servers confirm receipt of their data.
The synchronous standbys will be those whose names appear
earlier in this list, and
that is both currently connected and streaming data in real-time
(as shown by a state of <literal>streaming</literal> in the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view).
Other standby servers appearing later in this list represent potential
synchronous standbys.
If the current synchronous standby disconnects for whatever reason,
synchronous standbys. If any of the current synchronous
standbys disconnects for whatever reason,
it will be replaced immediately with the next-highest-priority standby.
Specifying more than one standby name can allow very high availability.
</para>
<para>
This parameter specifies a list of standby servers by using
either of the following syntaxes:
<synopsis>
<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...]
</synopsis>
where <replaceable class="parameter">num_sync</replaceable> is
the number of synchronous standbys that transactions need to
wait for replies from,
and <replaceable class="parameter">standby_name</replaceable>
is the name of a standby server. For example, a setting of
<literal>'3 (s1, s2, s3, s4)'</> makes transaction commits wait
until their WAL records are received by three higher priority standbys
chosen from standby servers <literal>s1</>, <literal>s2</>,
<literal>s3</> and <literal>s4</>.
</para>
<para>
The second syntax was used before <productname>PostgreSQL</>
version 9.6 and is still supported. It's the same as the first syntax
with <replaceable class="parameter">num_sync</replaceable>=1.
For example, both settings of <literal>'1 (s1, s2)'</> and
<literal>'s1, s2'</> have the same meaning; either <literal>s1</>
or <literal>s2</> is chosen as a synchronous standby.
</para>
<para>
The name of a standby server for this purpose is the
<varname>application_name</> setting of the standby, as set in the
<varname>primary_conninfo</> of the standby's WAL receiver. There is
no mechanism to enforce uniqueness. In case of duplicates one of the
matching standbys will be chosen to be the synchronous standby, though
matching standbys will be considered as higher priority, though
exactly which one is indeterminate.
The special entry <literal>*</> matches any
<varname>application_name</>, including the default application name
of <literal>walreceiver</>.
</para>
<note>
<para>
The <replaceable class="parameter">standby_name</replaceable>
must be enclosed in double quotes if a comma (<literal>,</>),
a double quote (<literal>"</>), <!-- " font-lock sanity -->
a left parentheses (<literal>(</>), a right parentheses (<literal>)</>)
or a space is used in the name of a standby server.
</para>
</note>
<para>
If no synchronous standby names are specified here, then synchronous
replication is not enabled and transaction commits will not wait for
......
......@@ -1027,10 +1027,12 @@ primary_slot_name = 'node_a_slot'
<para>
Synchronous replication offers the ability to confirm that all changes
made by a transaction have been transferred to one synchronous standby
server. This extends the standard level of durability
made by a transaction have been transferred to one or more synchronous
standby servers. This extends that standard level of durability
offered by a transaction commit. This level of protection is referred
to as 2-safe replication in computer science theory.
to as 2-safe replication in computer science theory, and group-1-safe
(group-safe and 1-safe) when <varname>synchronous_commit</> is set to
<literal>remote_write</>.
</para>
<para>
......@@ -1084,8 +1086,8 @@ primary_slot_name = 'node_a_slot'
In the case that <varname>synchronous_commit</> is set to
<literal>remote_apply</>, the standby sends reply messages when the commit
record is replayed, making the transaction visible.
If the standby is the first matching standby, as specified in
<varname>synchronous_standby_names</> on the primary, the reply
If the standby is chosen as the synchronous standby, from a priority
list of <varname>synchronous_standby_names</> on the primary, the reply
messages from that standby will be used to wake users waiting for
confirmation that the commit record has been received. These parameters
allow the administrator to specify which standby servers should be
......@@ -1126,6 +1128,40 @@ primary_slot_name = 'node_a_slot'
</sect3>
<sect3 id="synchronous-replication-multiple-standbys">
<title>Multiple Synchronous Standbys</title>
<para>
Synchronous replication supports one or more synchronous standby servers;
transactions will wait until all the standby servers which are considered
as synchronous confirm receipt of their data. The number of synchronous
standbys that transactions must wait for replies from is specified in
<varname>synchronous_standby_names</>. This parameter also specifies
a list of standby names, which determines the priority of each standby
for being chosen as a synchronous standby. The standbys whose names
appear earlier in the list are given higher priority and will be considered
as synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys. If any of the current
synchronous standbys disconnects for whatever reason, it will be replaced
immediately with the next-highest-priority standby.
</para>
<para>
An example of <varname>synchronous_standby_names</> for multiple
synchronous standbys is:
<programlisting>
synchronous_standby_names = '2 (s1, s2, s3)'
</programlisting>
In this example, if four standby servers <literal>s1</>, <literal>s2</>,
<literal>s3</> and <literal>s4</> are running, the two standbys
<literal>s1</> and <literal>s2</> will be chosen as synchronous standbys
because their names appear early in the list of standby names.
<literal>s3</> is a potential synchronous standby and will take over
the role of synchronous standby when either of <literal>s1</> or
<literal>s2</> fails. <literal>s4</> is an asynchronous standby since
its name is not in the list.
</para>
</sect3>
<sect3 id="synchronous-replication-performance">
<title>Planning for Performance</title>
......@@ -1171,19 +1207,21 @@ primary_slot_name = 'node_a_slot'
<title>Planning for High Availability</title>
<para>
Commits made when <varname>synchronous_commit</> is set to <literal>on</>,
<literal>remote_apply</> or <literal>remote_write</> will wait until the
synchronous standby responds. The response may never occur if the last, or
only, standby should crash.
<varname>synchronous_standby_names</> specifies the number and
names of synchronous standbys that transaction commits made when
<varname>synchronous_commit</> is set to <literal>on</>,
<literal>remote_apply</> or <literal>remote_write</> will wait for
responses from. Such transaction commits may never be completed
if any one of synchronous standbys should crash.
</para>
<para>
The best solution for avoiding data loss is to ensure you don't lose
your last remaining synchronous standby. This can be achieved by naming multiple
The best solution for high availability is to ensure you keep as many
synchronous standbys as requested. This can be achieved by naming multiple
potential synchronous standbys using <varname>synchronous_standby_names</>.
The first named standby will be used as the synchronous standby. Standbys
listed after this will take over the role of synchronous standby if the
first one should fail.
The standbys whose names appear earlier in the list will be used as
synchronous standbys. Standbys listed after these will take over
the role of synchronous standby if one of current ones should fail.
</para>
<para>
......@@ -1208,13 +1246,15 @@ primary_slot_name = 'node_a_slot'
they show as committed on the primary. The guarantee we offer is that
the application will not receive explicit acknowledgement of the
successful commit of a transaction until the WAL data is known to be
safely received by the standby.
safely received by all the synchronous standbys.
</para>
<para>
If you really do lose your last standby server then you should disable
<varname>synchronous_standby_names</> and reload the configuration file
on the primary server.
If you really cannot keep as many synchronous standbys as requested
then you should decrease the number of synchronous standbys that
transaction commits must wait for responses from
in <varname>synchronous_standby_names</> (or disable it) and
reload the configuration file on the primary server.
</para>
<para>
......
......@@ -203,7 +203,7 @@ distprep:
$(MAKE) -C parser gram.c gram.h scan.c
$(MAKE) -C bootstrap bootparse.c bootscanner.c
$(MAKE) -C catalog schemapg.h postgres.bki postgres.description postgres.shdescription
$(MAKE) -C replication repl_gram.c repl_scanner.c
$(MAKE) -C replication repl_gram.c repl_scanner.c syncrep_gram.c syncrep_scanner.c
$(MAKE) -C storage/lmgr lwlocknames.h
$(MAKE) -C utils fmgrtab.c fmgroids.h errcodes.h
$(MAKE) -C utils/misc guc-file.c
......@@ -320,6 +320,8 @@ maintainer-clean: distclean
catalog/postgres.shdescription \
replication/repl_gram.c \
replication/repl_scanner.c \
replication/syncrep_gram.c \
replication/syncrep_scanner.c \
storage/lmgr/lwlocknames.c \
storage/lmgr/lwlocknames.h \
utils/fmgroids.h \
......
/repl_gram.c
/repl_scanner.c
/syncrep_gram.c
/syncrep_scanner.c
......@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
repl_gram.o slot.o slotfuncs.o syncrep.o
repl_gram.o slot.o slotfuncs.o syncrep.o syncrep_gram.o
SUBDIRS = logical
......@@ -24,5 +24,10 @@ include $(top_srcdir)/src/backend/common.mk
# repl_scanner is compiled as part of repl_gram
repl_gram.o: repl_scanner.c
# repl_gram.c and repl_scanner.c are in the distribution tarball, so
# they are not cleaned here.
# syncrep_scanner is complied as part of syncrep_gram
syncrep_gram.o: syncrep_scanner.c
syncrep_scanner.c: FLEXFLAGS = -CF -p
syncrep_scanner.c: FLEX_NO_BACKUP=yes
# repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
# are in the distribution tarball, so they are not cleaned here.
This diff is collapsed.
%{
/*-------------------------------------------------------------------------
*
* syncrep_gram.y - Parser for synchronous_standby_names
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
* src/backend/replication/syncrep_gram.y
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include "replication/syncrep.h"
#include "utils/formatting.h"
/* Result of the parsing is returned here */
SyncRepConfigData *syncrep_parse_result;
static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members);
/*
* Bison doesn't allocate anything that needs to live across parser calls,
* so we can easily have it use palloc instead of malloc. This prevents
* memory leaks if we error out during parsing. Note this only works with
* bison >= 2.0. However, in bison 1.875 the default is to use alloca()
* if possible, so there's not really much problem anyhow, at least if
* you're building with gcc.
*/
#define YYMALLOC palloc
#define YYFREE pfree
%}
%expect 0
%name-prefix="syncrep_yy"
%union
{
char *str;
List *list;
SyncRepConfigData *config;
}
%token <str> NAME NUM
%type <config> result standby_config
%type <list> standby_list
%type <str> standby_name
%start result
%%
result:
standby_config { syncrep_parse_result = $1; }
;
standby_config:
standby_list { $$ = create_syncrep_config("1", $1); }
| NUM '(' standby_list ')' { $$ = create_syncrep_config($1, $3); }
;
standby_list:
standby_name { $$ = list_make1($1);}
| standby_list ',' standby_name { $$ = lappend($1, $3);}
;
standby_name:
NAME { $$ = $1; }
| NUM { $$ = $1; }
;
%%
static SyncRepConfigData *
create_syncrep_config(char *num_sync, List *members)
{
SyncRepConfigData *config =
(SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
config->num_sync = atoi(num_sync);
config->members = members;
return config;
}
#include "syncrep_scanner.c"
%{
/*-------------------------------------------------------------------------
*
* syncrep_scanner.l
* a lexical scanner for synchronous_standby_names
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
* src/backend/replication/syncrep_scanner.l
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include "miscadmin.h"
#include "lib/stringinfo.h"
/*
* flex emits a yy_fatal_error() function that it calls in response to
* critical errors like malloc failure, file I/O errors, and detection of
* internal inconsistency. That function prints a message and calls exit().
* Mutate it to instead call ereport(FATAL), which terminates this process.
*
* The process that causes this fatal error should be terminated.
* Otherwise it has to abandon the new setting value of
* synchronous_standby_names and keep running with the previous one
* while the other processes switch to the new one.
* This inconsistency of the setting that each process is based on
* can cause a serious problem. Though it's basically not good idea to
* use FATAL here because it can take down the postmaster,
* we should do that in order to avoid such an inconsistency.
*/
#undef fprintf
#define fprintf(file, fmt, msg) syncrep_flex_fatal(fmt, msg)
static void
syncrep_flex_fatal(const char *fmt, const char *msg)
{
ereport(FATAL, (errmsg_internal("%s", msg)));
}
/* Handles to the buffer that the lexer uses internally */
static YY_BUFFER_STATE scanbufhandle;
static StringInfoData xdbuf;
%}
%option 8bit
%option never-interactive
%option nounput
%option noinput
%option noyywrap
%option warn
%option prefix="syncrep_yy"
/*
* <xd> delimited identifiers (double-quoted identifiers)
*/
%x xd
space [ \t\n\r\f\v]
undquoted_start [^ ,\(\)\"]
undquoted_cont [^ ,\(\)]
undquoted_name {undquoted_start}{undquoted_cont}*
dquoted_name [^\"]+
/* Double-quoted string */
dquote \"
xdstart {dquote}
xddouble {dquote}{dquote}
xdstop {dquote}
xdinside {dquoted_name}
%%
{space}+ { /* ignore */ }
{xdstart} {
initStringInfo(&xdbuf);
BEGIN(xd);
}
<xd>{xddouble} {
appendStringInfoChar(&xdbuf, '\"');
}
<xd>{xdinside} {
appendStringInfoString(&xdbuf, yytext);
}
<xd>{xdstop} {
yylval.str = pstrdup(xdbuf.data);
pfree(xdbuf.data);
BEGIN(INITIAL);
return NAME;
}
"," { return ','; }
"(" { return '('; }
")" { return ')'; }
[1-9][0-9]* {
yylval.str = pstrdup(yytext);
return NUM;
}
{undquoted_name} {
yylval.str = pstrdup(yytext);
return NAME;
}
%%
void
yyerror(const char *message)
{
ereport(IsUnderPostmaster ? DEBUG2 : LOG,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s at or near \"%s\"", message, yytext)));
}
void
syncrep_scanner_init(const char *str)
{
Size slen = strlen(str);
char *scanbuf;
/*
* Might be left over after ereport()
*/
if (YY_CURRENT_BUFFER)
yy_delete_buffer(YY_CURRENT_BUFFER);
/*
* Make a scan buffer with special termination needed by flex.
*/
scanbuf = (char *) palloc(slen + 2);
memcpy(scanbuf, str, slen);
scanbuf[slen] = scanbuf[slen + 1] = YY_END_OF_BUFFER_CHAR;
scanbufhandle = yy_scan_buffer(scanbuf, slen + 2);
}
void
syncrep_scanner_finish(void)
{
yy_delete_buffer(scanbufhandle);
scanbufhandle = NULL;
}
......@@ -2751,7 +2751,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
Tuplestorestate *tupstore;
MemoryContext per_query_ctx;
MemoryContext oldcontext;
WalSnd *sync_standby;
List *sync_standbys;
int i;
/* check to see if caller supports us returning a tuplestore */
......@@ -2780,12 +2780,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
MemoryContextSwitchTo(oldcontext);
/*
* Get the currently active synchronous standby.
* Allocate and update the config data of synchronous replication,
* and then get the currently active synchronous standbys.
*/
SyncRepUpdateConfig();
LWLockAcquire(SyncRepLock, LW_SHARED);
sync_standby = SyncRepGetSynchronousStandby();
sync_standbys = SyncRepGetSyncStandbys(NULL);
LWLockRelease(SyncRepLock);
/*
* Free the previously-allocated config data because a backend
* no longer needs it. The next call of this function needs to
* allocate and update the config data newly because the setting
* of sync replication might be changed between the calls.
*/
SyncRepFreeConfig(SyncRepConfig);
SyncRepConfig = NULL;
for (i = 0; i < max_wal_senders; i++)
{
WalSnd *walsnd = &WalSndCtl->walsnds[i];
......@@ -2856,7 +2867,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
*/
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (walsnd == sync_standby)
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else
values[7] = CStringGetTextDatum("potential");
......
......@@ -3448,7 +3448,7 @@ static struct config_string ConfigureNamesString[] =
{
{"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
gettext_noop("List of names of potential synchronous standbys."),
gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
NULL,
GUC_LIST_INPUT
},
......
......@@ -240,7 +240,7 @@
# These settings are ignored on a standby server.
#synchronous_standby_names = '' # standby servers that provide sync rep
# comma-separated list of application_name
# number of sync standbys and comma-separated list of application_name
# from standby(s); '*' = all
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
......
......@@ -32,6 +32,18 @@
#define SYNC_REP_WAITING 1
#define SYNC_REP_WAIT_COMPLETE 2
/*
* Struct for the configuration of synchronous replication.
*/
typedef struct SyncRepConfigData
{
int num_sync; /* number of sync standbys that we need to wait for */
List *members; /* list of names of potential sync standbys */
} SyncRepConfigData;
extern SyncRepConfigData *syncrep_parse_result;
extern SyncRepConfigData *SyncRepConfig;
/* user-settable parameters for synchronous replication */
extern char *SyncRepStandbyNames;
......@@ -45,14 +57,25 @@ extern void SyncRepCleanupAtProcExit(void);
extern void SyncRepInitConfig(void);
extern void SyncRepReleaseWaiters(void);
/* called by wal sender and user backend */
extern List *SyncRepGetSyncStandbys(bool *am_sync);
extern void SyncRepUpdateConfig(void);
extern void SyncRepFreeConfig(SyncRepConfigData *config);
/* called by checkpointer */
extern void SyncRepUpdateSyncStandbysDefined(void);
/* forward declaration to avoid pulling in walsender_private.h */
struct WalSnd;
extern struct WalSnd *SyncRepGetSynchronousStandby(void);
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
extern void assign_synchronous_commit(int newval, void *extra);
/*
* Internal functions for parsing synchronous_standby_names grammar,
* in syncrep_gram.y and syncrep_scanner.l
*/
extern int syncrep_yyparse(void);
extern int syncrep_yylex(void);
extern void syncrep_yyerror(const char *str);
extern void syncrep_scanner_init(const char *query_string);
extern void syncrep_scanner_finish(void);
#endif /* _SYNCREP_H */
......@@ -156,7 +156,7 @@ sub mkvcbuild
'bootparse.y');
$postgres->AddFiles('src/backend/utils/misc', 'guc-file.l');
$postgres->AddFiles('src/backend/replication', 'repl_scanner.l',
'repl_gram.y');
'repl_gram.y', 'syncrep_scanner.l', 'syncrep_gram.y');
$postgres->AddDefine('BUILDING_DLL');
$postgres->AddLibrary('secur32.lib');
$postgres->AddLibrary('ws2_32.lib');
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment