high-availability.sgml 15.2 KB
Newer Older
1
<!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.25 2007/11/12 17:02:14 momjian Exp $ -->
2

3
<chapter id="high-availability">
4
 <title>High Availability, Load Balancing, and Replication</title>
5

6
 <indexterm><primary>high availability</></>
7 8 9 10
 <indexterm><primary>failover</></>
 <indexterm><primary>replication</></>
 <indexterm><primary>load balancing</></>
 <indexterm><primary>clustering</></>
11
 <indexterm><primary>data partitioning</></>
12 13

 <para>
14
  Database servers can work together to allow a second server to
Tom Lane's avatar
Tom Lane committed
15
  take over quickly if the primary server fails (high
16 17 18 19 20 21 22 23 24 25 26
  availability), or to allow several computers to serve the same
  data (load balancing).  Ideally, database servers could work
  together seamlessly.  Web servers serving static web pages can
  be combined quite easily by merely load-balancing web requests
  to multiple machines.  In fact, read-only database servers can
  be combined relatively easily too.  Unfortunately, most database
  servers have a read/write mix of requests, and read/write servers
  are much harder to combine.  This is because though read-only
  data needs to be placed on each server only once, a write to any
  server has to be propagated to all servers so that future read
  requests to those servers return consistent results.
27 28 29
 </para>

 <para>
30 31 32 33 34 35
  This synchronization problem is the fundamental difficulty for
  servers working together.  Because there is no single solution
  that eliminates the impact of the sync problem for all use cases,
  there are multiple solutions.  Each solution addresses this
  problem in a different way, and minimizes its impact for a specific
  workload.
36 37
 </para>

38 39 40
 <para>
  Some solutions deal with synchronization by allowing only one
  server to modify the data.  Servers that can modify data are
41 42 43 44
  called read/write or "master" servers.  Servers that can reply
  to read-only queries are called "slave" servers.  Servers that
  cannot be accessed until they are changed to master servers are
  called "standby" servers.
45 46
 </para>

47
 <para>
48
  Some solutions are synchronous,
Bruce Momjian's avatar
Bruce Momjian committed
49 50 51
  meaning that a data-modifying transaction is not considered
  committed until all servers have committed the transaction.  This
  guarantees that a failover will not lose any data and that all
52
  load-balanced servers will return consistent results no matter
Tom Lane's avatar
Tom Lane committed
53 54
  which server is queried. In contrast, asynchronous solutions allow some
  delay between the time of a commit and its propagation to the other servers,
55 56 57 58
  opening the possibility that some transactions might be lost in
  the switch to a backup server, and that load balanced servers
  might return slightly stale results.  Asynchronous communication
  is used when synchronous would be too slow.
59 60 61 62 63 64 65 66 67
 </para>

 <para>
  Solutions can also be categorized by their granularity.  Some solutions
  can deal only with an entire database server, while others allow control
  at the per-table or per-database level.
 </para>

 <para>
68 69
  Performance must be considered in any choice.  There is usually a
  tradeoff between functionality and
70 71 72 73 74 75
  performance.  For example, a full synchronous solution over a slow
  network might cut performance by more than half, while an asynchronous
  one might have a minimal performance impact.
 </para>

 <para>
Bruce Momjian's avatar
Bruce Momjian committed
76
  The remainder of this section outlines various failover, replication,
77 78 79
  and load balancing solutions.
 </para>

80 81
 <variablelist>

Bruce Momjian's avatar
Bruce Momjian committed
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
  <varlistentry>
   <term>Shared Disk Failover</term>
   <listitem>

    <para>
     Shared disk failover avoids synchronization overhead by having only one
     copy of the database.  It uses a single disk array that is shared by
     multiple servers.  If the main database server fails, the standby server
     is able to mount and start the database as though it was recovering from
     a database crash.  This allows rapid failover with no data loss.
    </para>

    <para>
     Shared hardware functionality is common in network storage devices.
     Using a network file system is also possible, though care must be
97
     taken that the file system has full <acronym>POSIX</> behavior (see <xref
Bruce Momjian's avatar
Bruce Momjian committed
98 99 100 101 102 103 104 105 106 107 108
     linkend="creating-cluster-nfs">).  One significant limitation of this
     method is that if the shared disk array fails or becomes corrupt, the
     primary and standby servers are both nonfunctional.  Another issue is
     that the standby server should never access the shared storage while
     the primary server is running.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry>
109
   <term>File System (Block-Device) Replication</term>
Bruce Momjian's avatar
Bruce Momjian committed
110 111 112 113 114 115 116 117 118
   <listitem>

    <para>
     A modified version of shared hardware functionality is file system
     replication, where all changes to a file system are mirrored to a file
     system residing on another computer.  The only restriction is that
     the mirroring must be done in a way that ensures the standby server
     has a consistent copy of the file system &mdash; specifically, writes
     to the standby must be done in the same order as those on the master.
119 120
     <productname>DRBD</> is a popular file system replication solution
     for Linux.
Bruce Momjian's avatar
Bruce Momjian committed
121
    </para>
Bruce Momjian's avatar
Bruce Momjian committed
122 123 124 125 126 127

<!--
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

Oracle RAC is a shared disk approach and just send cache invalidations
to other nodes but not actual data. As the disk is shared, data is
128
only committed once to disk and there is a distributed locking
Bruce Momjian's avatar
Bruce Momjian committed
129 130 131
protocol to make nodes agree on a serializable transactional order.
-->

Bruce Momjian's avatar
Bruce Momjian committed
132 133 134 135 136 137 138 139 140
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Warm Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
   <listitem>

    <para>
     A warm standby server (see <xref linkend="warm-standby">) can
141
     be kept current by reading a stream of write-ahead log (<acronym>WAL</>)
Bruce Momjian's avatar
Bruce Momjian committed
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
     records.  If the main server fails, the warm standby contains
     almost all of the data of the main server, and can be quickly
     made the new master database server.  This is asynchronous and
     can only be done for the entire database server.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Master-Slave Replication</term>
   <listitem>

    <para>
     A master-slave replication setup sends all data modification
     queries to the master server.  The master server asynchronously
     sends data changes to the slave server.  The slave can answer
     read-only queries while the master server is running.  The
     slave server is ideal for data warehouse queries.
    </para>

    <para>
163
     <productname>Slony-I</> is an example of this type of replication, with per-table
Bruce Momjian's avatar
Bruce Momjian committed
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
     granularity, and support for multiple slaves.  Because it
     updates the slave server asynchronously (in batches), there is
     possible data loss during fail over.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Statement-Based Replication Middleware</term>
   <listitem>

    <para>
     With statement-based replication middleware, a program intercepts
     every SQL query and sends it to one or all servers.  Each server
     operates independently.  Read-write queries are sent to all servers,
     while read-only queries can be sent to just one server, allowing
     the read workload to be distributed.
    </para>

    <para>
     If queries are simply broadcast unmodified, functions like
     <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
     sequences would have different values on different servers.
     This is because each server operates independently, and because
     SQL queries are broadcast (and not actual modified rows).  If
     this is unacceptable, either the middleware or the application
     must query such values from a single server and then use those
     values in write queries.  Also, care must be taken that all
     transactions either commit or abort on all servers, perhaps
     using two-phase commit (<xref linkend="sql-prepare-transaction"
     endterm="sql-prepare-transaction-title"> and <xref
     linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">.
196 197
     <productname>Pgpool</> and <productname>Sequoia</> are examples of
     this type of replication.
Bruce Momjian's avatar
Bruce Momjian committed
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Asynchronous Multi-Master Replication</term>
   <listitem>

    <para>
     For servers that are not regularly connected, like laptops or
     remote servers, keeping data consistent among servers is a
     challenge.  Using asynchronous multi-master replication, each
     server works independently, and periodically communicates with
     the other servers to identify conflicting transactions.  The
     conflicts can be resolved by users or conflict resolution rules.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Synchronous Multi-Master Replication</term>
   <listitem>

    <para>
     In synchronous multi-master replication, each server can accept
     write requests, and modified data is transmitted from the
     original server to every other server before each transaction
     commits.  Heavy write activity can cause excessive locking,
     leading to poor performance.  In fact, write performance is
     often worse than that of a single server.  Read requests can
     be sent to any server.  Some implementations use shared disk
     to reduce the communication overhead.  Synchronous multi-master
     replication is best for mostly read workloads, though its big
     advantage is that any server can accept write requests &mdash;
     there is no need to partition workloads between master and
     slave servers, and because the data changes are sent from one
     server to another, there is no problem with non-deterministic
     functions like <function>random()</>.
    </para>

    <para>
     <productname>PostgreSQL</> does not offer this type of replication,
     though <productname>PostgreSQL</> two-phase commit (<xref
     linkend="sql-prepare-transaction"
     endterm="sql-prepare-transaction-title"> and <xref
     linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">)
     can be used to implement this in application code or middleware.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Commercial Solutions</term>
   <listitem>

    <para>
     Because <productname>PostgreSQL</> is open source and easily
     extended, a number of companies have taken <productname>PostgreSQL</>
     and created commercial closed-source solutions with unique
     failover, replication, and load balancing capabilities.
    </para>
   </listitem>
  </varlistentry>
261 262

 </variablelist>
263

264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
 <para>
  The table below (<xref linkend="high-availability-matrix">) summarizes
  the capabilities of the various solutions listed above.
 </para>

 <table id="high-availability-matrix">
  <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
  <tgroup cols="9">
   <thead>
    <row>
     <entry>Feature</entry>
     <entry>Shared Disk Failover</entry>
     <entry>File System Replication</entry>
     <entry>Warm Standby Using PITR</entry>
     <entry>Master-Slave Replication</entry>
     <entry>Statement-Based Replication Middleware</entry>
     <entry>Asynchronous Multi-Master Replication</entry>
     <entry>Synchronous Multi-Master Replication</entry>
    </row>
   </thead>

   <tbody>

    <row>
     <entry>No special hardware required</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Allows multiple master servers</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No master server overhead</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
    </row>

    <row>
321
     <entry>No waiting for multiple servers</entry>
322
     <entry align="center">&bull;</entry>
323
     <entry align="center"></entry>
324 325
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
326
     <entry align="center"></entry>
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>Master failure will never lose data</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Slaves accept read-only queries</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Per-table granularity</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No conflict resolution necessary</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

   </tbody>
  </tgroup>
 </table>

 <para>
380
  There are a few solutions that do not fit into the above categories:
381 382
 </para>

383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407
 <variablelist>

  <varlistentry>
   <term>Data Partitioning</term>
   <listitem>

    <para>
     Data partitioning splits tables into data sets.  Each set can
     be modified by only one server.  For example, data can be
     partitioned by offices, e.g. London and Paris, with a server
     in each office.  If queries combining London and Paris data
     are necessary, an application can query both servers, or
     master/slave replication can be used to keep a read-only copy
     of the other office's data on each server.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Multi-Server Parallel Query Execution</term>
   <listitem>

    <para>
     Many of the above solutions allow multiple servers to handle multiple
     queries, but none allow a single query to use multiple servers to
408 409 410 411 412 413
     complete faster.  This solution allows multiple servers to work
     concurrently on a single query.  It is usually accomplished by
     splitting the data among servers and having each server execute its
     part of the query and return results to a central server where they
     are combined and returned to the user.  <productname>Pgpool-II</>
     has this capability.  Also, this can be implemented using the
414 415 416 417 418 419 420 421
     <productname>PL/Proxy</> toolset.
    </para>

   </listitem>
  </varlistentry>

 </variablelist>

422
</chapter>