1. 03 Feb, 2015 11 commits
    • Andres Freund's avatar
      Don't allow immediate interrupts during authentication anymore. · 6647248e
      Andres Freund authored
      We used to handle authentication_timeout by setting
      ImmediateInterruptOK to true during large parts of the authentication
      phase of a new connection.  While that happens to work acceptably in
      practice, it's not particularly nice and has ugly corner cases.
      
      Previous commits converted the FE/BE communication to use latches and
      implemented support for interrupt handling during both
      send/recv. Building on top of that work we can get rid of
      ImmediateInterruptOK during authentication, by immediately treating
      timeouts during authentication as a reason to die. As die interrupts
      are handled immediately during client communication that provides a
      sensibly quick reaction time to authentication timeout.
      
      Additionally add a few CHECK_FOR_INTERRUPTS() to some more complex
      authentication methods. More could be added, but this already should
      provides a reasonable coverage.
      
      While it this overall increases the maximum time till a timeout is
      reacted to, it greatly reduces complexity and increases
      reliability. That seems like a overall win. If the increase proves to
      be noticeable we can deal with those cases by moving to nonblocking
      network code and add interrupt checking there.
      
      Reviewed-By: Heikki Linnakangas
      6647248e
    • Tom Lane's avatar
      Remove unused "m" field in LSEG. · cec916f3
      Tom Lane authored
      This field has been unreferenced since 1998, and does not appear in lseg
      values stored on disk (since sizeof(lseg) is only 32 bytes according to
      pg_type).  There was apparently some idea of maintaining it just in values
      appearing in memory, but the bookkeeping required to make that work would
      surely far outweigh the cost of recalculating the line's slope when needed.
      Remove it to (a) simplify matters and (b) suppress some uninitialized-field
      whining from Coverity.
      cec916f3
    • Andres Freund's avatar
      Process 'die' interrupts while reading/writing from the client socket. · 4fe384bd
      Andres Freund authored
      Up to now it was impossible to terminate a backend that was trying to
      send/recv data to/from the client when the socket's buffer was already
      full/empty. While the send/recv calls itself might have gotten
      interrupted by signals on some platforms, we just immediately retried.
      
      That could lead to situations where a backend couldn't be terminated ,
      after a client died without the connection being closed, because it
      was blocked in send/recv.
      
      The problem was far more likely to be hit when sending data than when
      reading. That's because while reading a command from the client, and
      during authentication, we processed interrupts immediately . That
      primarily left COPY FROM STDIN as being problematic for recv.
      
      Change things so that that we process 'die' events immediately when
      the appropriate signal arrives. We can't sensibly react to query
      cancels at that point, because we might loose sync with the client as
      we could be in the middle of writing a message.
      
      We don't interrupt writes if the write buffer isn't full, as indicated
      by write() returning EWOULDBLOCK, as that would lead to fewer error
      messages reaching clients.
      
      Per discussion with Kyotaro HORIGUCHI and Heikki Linnakangas
      
      Discussion: 20140927191243.GD5423@alap3.anarazel.de
      4fe384bd
    • Andres Freund's avatar
      Introduce and use infrastructure for interrupt processing during client reads. · 4f85fde8
      Andres Freund authored
      Up to now large swathes of backend code ran inside signal handlers
      while reading commands from the client, to allow for speedy reaction to
      asynchronous events. Most prominently shared invalidation and NOTIFY
      handling. That means that complex code like the starting/stopping of
      transactions is run in signal handlers...  The required code was
      fragile and verbose, and is likely to contain bugs.
      
      That approach also severely limited what could be done while
      communicating with the client. As the read might be from within
      openssl it wasn't safely possible to trigger an error, e.g. to cancel
      a backend in idle-in-transaction state. We did that in some cases,
      namely fatal errors, nonetheless.
      
      Now that FE/BE communication in the backend employs non-blocking
      sockets and latches to block, we can quite simply interrupt reads from
      signal handlers by setting the latch. That allows us to signal an
      interrupted read, which is supposed to be retried after returning from
      within the ssl library.
      
      As signal handlers now only need to set the latch to guarantee timely
      interrupt processing, remove a fair amount of complicated & fragile
      code from async.c and sinval.c.
      
      We could now actually start to process some kinds of interrupts, like
      sinval ones, more often that before, but that seems better done
      separately.
      
      This work will hopefully allow to handle cases like being blocked by
      sending data, interrupting idle transactions and similar to be
      implemented without too much effort.  In addition to allowing getting
      rid of ImmediateInterruptOK, that is.
      
      Author: Andres Freund
      Reviewed-By: Heikki Linnakangas
      4f85fde8
    • Andres Freund's avatar
      Use a nonblocking socket for FE/BE communication and block using latches. · 387da188
      Andres Freund authored
      This allows to introduce more elaborate handling of interrupts while
      reading from a socket.  Currently some interrupt handlers have to do
      significant work from inside signal handlers, and it's very hard to
      correctly write code to do so.  Generic signal handler limitations,
      combined with the fact that we can't safely jump out of a signal
      handler while reading from the client have prohibited implementation
      of features like timeouts for idle-in-transaction.
      
      Additionally we use the latch code to wait in a couple places where we
      previously only had waiting code on windows as other platforms just
      busy looped.
      
      This can increase the number of systemcalls happening during FE/BE
      communication. Benchmarks so far indicate that the impact isn't very
      high, and there's room for optimization in the latch code. The chance
      of cleaning up the usage of latches gives us, seem to outweigh the
      risk of small performance regressions.
      
      This commit theoretically can't used without the next patch in the
      series, as WaitLatchOrSocket is not defined to be fully signal
      safe. As we already do that in some cases though, it seems better to
      keep the commits separate, so they're easier to understand.
      
      Author: Andres Freund
      Reviewed-By: Heikki Linnakangas
      387da188
    • Tom Lane's avatar
      Fix breakage in GEODEBUG debug code. · 778d498c
      Tom Lane authored
      LINE doesn't have an "m" field (anymore anyway).  Also fix unportable
      assumption that %x can print the result of pointer subtraction.
      
      In passing, improve single_decode() in minor ways:
      * Remove unnecessary leading-whitespace skip (strtod does that already).
      * Make GEODEBUG message more intelligible.
      * Remove entirely-useless test to see if strtod returned a silly pointer.
      * Don't bother computing trailing-whitespace skip unless caller wants
        an ending pointer.
      
      This has been broken since 261c7d4b.
      Although it's only debug code, might as well fix the 9.4 branch too.
      778d498c
    • Heikki Linnakangas's avatar
      Add API functions to libpq to interrogate SSL related stuff. · 91fa7b47
      Heikki Linnakangas authored
      This makes it possible to query for things like the SSL version and cipher
      used, without depending on OpenSSL functions or macros. That is a good
      thing if we ever get another SSL implementation.
      
      PQgetssl() still works, but it should be considered as deprecated as it
      only works with OpenSSL. In particular, PQgetSslInUse() should be used to
      check if a connection uses SSL, because as soon as we have another
      implementation, PQgetssl() will return NULL even if SSL is in use.
      91fa7b47
    • Heikki Linnakangas's avatar
      Refactor page compactifying code. · 809d9a26
      Heikki Linnakangas authored
      The logic to compact away removed tuples from page was duplicated with
      small differences in PageRepairFragmentation, PageIndexMultiDelete, and
      PageIndexDeleteNoCompact. Put it into a common function.
      
      Reviewed by Peter Geoghegan.
      809d9a26
    • Heikki Linnakangas's avatar
      Rephrase the documentation on pg_receivexlog --synchronous option. · 507627f5
      Heikki Linnakangas authored
      The old wording talked about a "sync command", meaining fsync(), but it
      was not very clear.
      507627f5
    • Heikki Linnakangas's avatar
      Fix typo in comment. · efba7a54
      Heikki Linnakangas authored
      Amit Langote
      efba7a54
    • Heikki Linnakangas's avatar
      Remove dead code. · 4eaafa04
      Heikki Linnakangas authored
      Commit 13629df5 changed metaphone() function to return an empty string on
      empty input, but it left the old error message in place. It's now dead code.
      
      Michael Paquier, per Coverity warning.
      4eaafa04
  2. 02 Feb, 2015 12 commits
    • Robert Haas's avatar
      Add new function BackgroundWorkerInitializeConnectionByOid. · 5d2f957f
      Robert Haas authored
      Sometimes it's useful for a background worker to be able to initialize
      its database connection by OID rather than by name, so provide a way
      to do that.
      5d2f957f
    • Tom Lane's avatar
      Last-minute updates for release notes. · 2488eff8
      Tom Lane authored
      Add entries for security issues.
      
      Security: CVE-2015-0241 through CVE-2015-0244
      2488eff8
    • Heikki Linnakangas's avatar
      Be more careful to not lose sync in the FE/BE protocol. · 2b3a8b20
      Heikki Linnakangas authored
      If any error occurred while we were in the middle of reading a protocol
      message from the client, we could lose sync, and incorrectly try to
      interpret a part of another message as a new protocol message. That will
      usually lead to an "invalid frontend message" error that terminates the
      connection. However, this is a security issue because an attacker might
      be able to deliberately cause an error, inject a Query message in what's
      supposed to be just user data, and have the server execute it.
      
      We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
      operations that could ereport(ERROR) in the middle of processing a message,
      but a query cancel interrupt or statement timeout could nevertheless cause
      it to happen. Also, the V2 fastpath and COPY handling were not so careful.
      It's very difficult to recover in the V2 COPY protocol, so we will just
      terminate the connection on error. In practice, that's what happened
      previously anyway, as we lost protocol sync.
      
      To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
      whenever we're in the middle of reading a message. When it's set, we cannot
      safely ERROR out and continue running, because we might've read only part
      of a message. PqCommReadingMsg acts somewhat similarly to critical sections
      in that if an error occurs while it's set, the error handler will force the
      connection to be terminated, as if the error was FATAL. It's not
      implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
      to PANIC in critical sections, because we want to be able to use
      PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
      advantage of that to prevent an OOM error from terminating the connection.
      
      To prevent unnecessary connection terminations, add a holdoff mechanism
      similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
      interrupts, but still allow die interrupts. The rules on which interrupts
      are processed when are now a bit more complicated, so refactor
      ProcessInterrupts() and the calls to it in signal handlers so that the
      signal handlers always call it if ImmediateInterruptOK is set, and
      ProcessInterrupts() can decide to not do anything if the other conditions
      are not met.
      
      Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
      Backpatch to all supported versions.
      
      Security: CVE-2015-0244
      2b3a8b20
    • Noah Misch's avatar
      Prevent Valgrind Memcheck errors around px_acquire_system_randomness(). · 59b91982
      Noah Misch authored
      This function uses uninitialized stack and heap buffers as supplementary
      entropy sources.  Mark them so Memcheck will not complain.  Back-patch
      to 9.4, where Valgrind Memcheck cooperation first appeared.
      
      Marko Tiikkaja
      59b91982
    • Noah Misch's avatar
      Cherry-pick security-relevant fixes from upstream imath library. · 8b59672d
      Noah Misch authored
      This covers alterations to buffer sizing and zeroing made between imath
      1.3 and imath 1.20.  Valgrind Memcheck identified the buffer overruns
      and reliance on uninitialized data; their exploit potential is unknown.
      Builds specifying --with-openssl are unaffected, because they use the
      OpenSSL BIGNUM facility instead of imath.  Back-patch to 9.0 (all
      supported versions).
      
      Security: CVE-2015-0243
      8b59672d
    • Noah Misch's avatar
      Fix buffer overrun after incomplete read in pullf_read_max(). · 1dc75515
      Noah Misch authored
      Most callers pass a stack buffer.  The ensuing stack smash can crash the
      server, and we have not ruled out the viability of attacks that lead to
      privilege escalation.  Back-patch to 9.0 (all supported versions).
      
      Marko Tiikkaja
      
      Security: CVE-2015-0243
      1dc75515
    • Bruce Momjian's avatar
      port/snprintf(): fix overflow and do padding · 29725b3d
      Bruce Momjian authored
      Prevent port/snprintf() from overflowing its local fixed-size
      buffer and pad to the desired number of digits with zeros, even
      if the precision is beyond the ability of the native sprintf().
      port/snprintf() is only used on systems that lack a native
      snprintf().
      
      Reported by Bruce Momjian. Patch by Tom Lane.	Backpatch to all
      supported versions.
      
      Security: CVE-2015-0242
      29725b3d
    • Bruce Momjian's avatar
      to_char(): prevent writing beyond the allocated buffer · 9241c84c
      Bruce Momjian authored
      Previously very long localized month and weekday strings could
      overflow the allocated buffers, causing a server crash.
      
      Reported and patch reviewed by Noah Misch.  Backpatch to all
      supported versions.
      
      Security: CVE-2015-0241
      9241c84c
    • Bruce Momjian's avatar
      to_char(): prevent accesses beyond the allocated buffer · 0150ab56
      Bruce Momjian authored
      Previously very long field masks for floats could access memory
      beyond the existing buffer allocated to hold the result.
      
      Reported by Andres Freund and Peter Geoghegan.	Backpatch to all
      supported versions.
      
      Security: CVE-2015-0241
      0150ab56
    • Tom Lane's avatar
      Doc: fix syntax description for psql's \setenv. · f9ee8ea1
      Tom Lane authored
      The variable name isn't optional --- looks like a copy-and-paste-o from
      the \set command, where it is.
      
      Dilip Kumar
      f9ee8ea1
    • Peter Eisentraut's avatar
      Translation updates · f8948616
      Peter Eisentraut authored
      Source-Git-URL: git://git.postgresql.org/git/pgtranslation/messages.git
      Source-Git-Hash: 19c72ea8d856d7b1d4f5d759a766c8206bf9ce53
      f8948616
    • Peter Eisentraut's avatar
      doc: Improve claim about location of pg_service.conf · 1332bbb3
      Peter Eisentraut authored
      The previous wording claimed that the file was always in /etc, but of
      course this varies with the installation layout.  Write instead that it
      can be found via `pg_config --sysconfdir`.  Even though this is still
      somewhat incorrect because it doesn't account of moved installations, it
      at least conveys that the location depends on the installation.
      1332bbb3
  3. 01 Feb, 2015 1 commit
  4. 31 Jan, 2015 3 commits
    • Tom Lane's avatar
      Fix documentation of psql's ECHO all mode. · b7d254c0
      Tom Lane authored
      "ECHO all" is ignored for interactive input, and has been for a very long
      time, though possibly not for as long as the documentation has claimed the
      opposite.  Fix that, and also note that empty lines aren't echoed, which
      while dubious is another longstanding behavior (it's embedded in our
      regression test files for one thing).  Per bug #12721 from Hans Ginzel.
      
      In HEAD, also improve the code comments in this area, and suppress an
      unnecessary fflush(stdout) when we're not echoing.  That would likely
      be safe to back-patch, but I'll not risk it mere hours before a release
      wrap.
      b7d254c0
    • Tom Lane's avatar
      First-draft release notes for 9.4.1 et al. · 77e9125e
      Tom Lane authored
      As usual, the release notes for older branches will be made by cutting
      these down, but put them up for community review first.
      
      Note: a significant fraction of these items don't apply to 9.4.1, only to
      older branches, because the fixes already appeared in 9.4.0.  These can be
      distinguished by noting the branch commits in the associated SGML comments.
      This will be adjusted tomorrow while copying items into the older
      release-X.Y.sgml files.  In a few cases I've made two separate entries with
      different wordings for 9.4 than for the equivalent commits in the older
      branches.
      77e9125e
    • Tom Lane's avatar
      Update time zone data files to tzdata release 2015a. · 08bd0c58
      Tom Lane authored
      DST law changes in Chile and Mexico (state of Quintana Roo).
      Historical changes for Iceland.
      08bd0c58
  5. 30 Jan, 2015 11 commits
    • Stephen Frost's avatar
      Policy documentation improvements · 3144e1b3
      Stephen Frost authored
      In ALTER POLICY, use 'check_expression' instead of 'expression' for the
      parameter, to match up with the recent CREATE POLICY change.
      
      In CREATE POLICY, frame the discussion as granting access to rows
      instead of limiting access to rows.  Further, clarify that the
      expression must return true for rows to be visible/allowed and that a
      false or NULL result will mean the row is not visible/allowed.
      
      Per discussion with Dean Rasheed and Robert.
      3144e1b3
    • Tom Lane's avatar
      Fix jsonb Unicode escape processing, and in consequence disallow \u0000. · 451d2808
      Tom Lane authored
      We've been trying to support \u0000 in JSON values since commit
      78ed8e03, and have introduced increasingly worse hacks to try to
      make it work, such as commit 0ad1a816.  However, it fundamentally
      can't work in the way envisioned, because the stored representation looks
      the same as for \\u0000 which is not the same thing at all.  It's also
      entirely bogus to output \u0000 when de-escaped output is called for.
      
      The right way to do this would be to store an actual 0x00 byte, and then
      throw error only if asked to produce de-escaped textual output.  However,
      getting to that point seems likely to take considerable work and may well
      never be practical in the 9.4.x series.
      
      To preserve our options for better behavior while getting rid of the nasty
      side-effects of 0ad1a816, revert that commit in toto and instead
      throw error if \u0000 is used in a context where it needs to be de-escaped.
      (These are the same contexts where non-ASCII Unicode escapes throw error
      if the database encoding isn't UTF8, so this behavior is by no means
      without precedent.)
      
      In passing, make both the \u0000 case and the non-ASCII Unicode case report
      ERRCODE_UNTRANSLATABLE_CHARACTER / "unsupported Unicode escape sequence"
      rather than claiming there's something wrong with the input syntax.
      
      Back-patch to 9.4, where we have to do something because 0ad1a816
      broke things for many cases having nothing to do with \u0000.  9.3 also has
      bogus behavior, but only for that specific escape value, so given the lack
      of field complaints it seems better to leave 9.3 alone.
      451d2808
    • Peter Eisentraut's avatar
      doc: Remove superfluous table column · e40d43f8
      Peter Eisentraut authored
      e40d43f8
    • Tom Lane's avatar
      Fix Coverity warning about contrib/pgcrypto's mdc_finish(). · a59ee881
      Tom Lane authored
      Coverity points out that mdc_finish returns a pointer to a local buffer
      (which of course is gone as soon as the function returns), leaving open
      a risk of misbehaviors possibly as bad as a stack overwrite.
      
      In reality, the only possible call site is in process_data_packets()
      which does not examine the returned pointer at all.  So there's no
      live bug, but nonetheless the code is confusing and risky.  Refactor
      to avoid the issue by letting process_data_packets() call mdc_finish()
      directly instead of going through the pullf_read() API.
      
      Although this is only cosmetic, it seems good to back-patch so that
      the logic in pgp-decrypt.c stays in sync across all branches.
      
      Marko Kreen
      a59ee881
    • Robert Haas's avatar
      Provide a way to supress the "out of memory" error when allocating. · bd4e2fd9
      Robert Haas authored
      Using the new interface MemoryContextAllocExtended, callers can
      specify MCXT_ALLOC_NO_OOM if they are prepared to handle a NULL
      return value.
      
      Michael Paquier, reviewed and somewhat revised by me.
      bd4e2fd9
    • Tom Lane's avatar
      Fix assorted oversights in range selectivity estimation. · 3d660d33
      Tom Lane authored
      calc_rangesel() failed outright when comparing range variables to empty
      constant ranges with < or >=, as a result of missing cases in a switch.
      It also produced a bogus estimate for > comparison to an empty range.
      
      On top of that, the >= and > cases were mislabeled throughout.  For
      nonempty constant ranges, they managed to produce the right answers
      anyway as a result of counterbalancing typos.
      
      Also, default_range_selectivity() omitted cases for elem <@ range,
      range &< range, and range &> range, so that rather dubious defaults
      were applied for these operators.
      
      In passing, rearrange the code in rangesel() so that the elem <@ range
      case is handled in a less opaque fashion.
      
      Report and patch by Emre Hasegeli, some additional work by me
      3d660d33
    • Heikki Linnakangas's avatar
      Fix query-duration memory leak with GIN rescans. · 68fa75f3
      Heikki Linnakangas authored
      The requiredEntries / additionalEntries arrays were not freed in
      freeScanKeys() like other per-key stuff.
      
      It's not obvious, but startScanKey() was only ever called after the keys
      have been initialized with ginNewScanKey(). That's why it doesn't need to
      worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
      was never true, because ginrescan free's the existing keys, and it's not OK
      to call gingetbitmap twice in a row without calling ginrescan in between.
      To make that clear, remove the unnecessary ginIsNewKey(). And just to be
      extra sure that nothing funny happens if there is an existing key after all,
      call freeScanKeys() to free it if it exists. This makes the code more
      straightforward.
      
      (I'm seeing other similar leaks in testing a query that rescans an GIN index
      scan, but that's a different issue. This just fixes the obvious leak with
      those two arrays.)
      
      Backpatch to 9.4, where GIN fast scan was added.
      68fa75f3
    • Kevin Grittner's avatar
      Allow pg_dump to use jobs and serializable transactions together. · cff1bd2a
      Kevin Grittner authored
      Since 9.3, when the --jobs option was introduced, using it together
      with the --serializable-deferrable option generated multiple
      errors.  We can get correct behavior by allowing the connection
      which acquires the snapshot to use SERIALIZABLE, READ ONLY,
      DEFERRABLE and pass that to the workers running the other
      connections using REPEATABLE READ, READ ONLY.  This is a bit of a
      kluge since the SERIALIZABLE behavior is achieved by running some
      of the participating connections at a different isolation level,
      but it is a simple and safe change, suitable for back-patching.
      
      This will be followed by a proposal for a more invasive fix with
      some slight behavioral changes on just the master branch, based on
      suggestions from Andres Freund, but the kluge will be applied to
      master until something is agreed along those lines.
      
      Back-patched to 9.3, where the --jobs option was added.
      
      Based on report from Alexander Korotkov
      cff1bd2a
    • Stephen Frost's avatar
      Fix BuildIndexValueDescription for expressions · 32bf6ee6
      Stephen Frost authored
      In 804b6b6d we modified
      BuildIndexValueDescription to pay attention to which columns are visible
      to the user, but unfortunatley that commit neglected to consider indexes
      which are built on expressions.
      
      Handle error-reporting of violations of constraint indexes based on
      expressions by not returning any detail when the user does not have
      table-level SELECT rights.
      
      Backpatch to 9.0, as the prior commit was.
      
      Pointed out by Tom.
      32bf6ee6
    • Bruce Momjian's avatar
      doc: clarify libpq's 'verify-full' host name check · e7704305
      Bruce Momjian authored
      Report by David Guyot
      e7704305
    • Tom Lane's avatar
      Handle unexpected query results, especially NULLs, safely in connectby(). · 37507962
      Tom Lane authored
      connectby() didn't adequately check that the constructed SQL query returns
      what it's expected to; in fact, since commit 08c33c42 it wasn't
      checking that at all.  This could result in a null-pointer-dereference
      crash if the constructed query returns only one column instead of the
      expected two.  Less excitingly, it could also result in surprising data
      conversion failures if the constructed query returned values that were
      not I/O-conversion-compatible with the types specified by the query
      calling connectby().
      
      In all branches, insist that the query return at least two columns;
      this seems like a minimal sanity check that can't break any reasonable
      use-cases.
      
      In HEAD, insist that the constructed query return the types specified by
      the outer query, including checking for typmod incompatibility, which the
      code never did even before it got broken.  This is to hide the fact that
      the implementation does a conversion to text and back; someday we might
      want to improve that.
      
      In back branches, leave that alone, since adding a type check in a minor
      release is more likely to break things than make people happy.  Type
      inconsistencies will continue to work so long as the actual type and
      declared type are I/O representation compatible, and otherwise will fail
      the same way they used to.
      
      Also, in all branches, be on guard for NULL results from the constructed
      query, which formerly would cause null-pointer dereference crashes.
      We now print the row with the NULL but don't recurse down from it.
      
      In passing, get rid of the rather pointless idea that
      build_tuplestore_recursively() should return the same tuplestore that's
      passed to it.
      
      Michael Paquier, adjusted somewhat by me
      37507962
  6. 29 Jan, 2015 2 commits
    • Andres Freund's avatar
      Properly terminate the array returned by GetLockConflicts(). · 17792bfc
      Andres Freund authored
      GetLockConflicts() has for a long time not properly terminated the
      returned array. During normal processing the returned array is zero
      initialized which, while not pretty, is sufficient to be recognized as
      a invalid virtual transaction id. But the HotStandby case is more than
      aesthetically broken: The allocated (and reused) array is neither
      zeroed upon allocation, nor reinitialized, nor terminated.
      
      Not having a terminating element means that the end of the array will
      not be recognized and that recovery conflict handling will thus read
      ahead into adjacent memory. Only terminating when hitting memory
      content that looks like a invalid virtual transaction id.  Luckily
      this seems so far not have caused significant problems, besides making
      recovery conflict more expensive.
      
      Discussion: 20150127142713.GD29457@awork2.anarazel.de
      
      Backpatch into all supported branches.
      17792bfc
    • Andres Freund's avatar
      Align buffer descriptors to cache line boundaries. · ed127002
      Andres Freund authored
      Benchmarks has shown that aligning the buffer descriptor array to
      cache lines is important for scalability; especially on bigger,
      multi-socket, machines.
      
      Currently the array sometimes already happens to be aligned by
      happenstance, depending how large previous shared memory allocations
      were. That can lead to wildly varying performance results after minor
      configuration changes.
      
      In addition to aligning the start of descriptor array, also force the
      size of individual descriptors to be of a common cache line size (64
      bytes). That happens to already be the case on 64bit platforms, but
      this way we can change the struct BufferDesc more easily.
      
      As the alignment primarily matters in highly concurrent workloads
      which probably all are 64bit these days, and the space wastage of
      element alignment would be a bit more noticeable on 32bit systems, we
      don't force the stride to be cacheline sized on 32bit platforms for
      now. If somebody does actual performance testing, we can reevaluate
      that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
      
      Discussion: 20140202151319.GD32123@awork2.anarazel.de
      
      Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
      Geoghegan.
      ed127002