1. 21 Oct, 2016 1 commit
    • Robert Haas's avatar
      postgres_fdw: Push down aggregates to remote servers. · 7012b132
      Robert Haas authored
      Now that the upper planner uses paths, and now that we have proper hooks
      to inject paths into the upper planning process, it's possible for
      foreign data wrappers to arrange to push aggregates to the remote side
      instead of fetching all of the rows and aggregating them locally.  This
      figures to be a massive win for performance, so teach postgres_fdw to
      do it.
      
      Jeevan Chalke and Ashutosh Bapat.  Reviewed by Ashutosh Bapat with
      additional testing by Prabhat Sahu.  Various mostly cosmetic changes
      by me.
      7012b132
  2. 20 Oct, 2016 6 commits
    • Tom Lane's avatar
      Fix EXPLAIN so that it doesn't emit invalid XML in corner cases. · 709e461b
      Tom Lane authored
      With track_io_timing = on, EXPLAIN (ANALYZE, BUFFERS) will emit fields
      named like "I/O Read Time".  The slash makes that invalid as an XML
      element name, so that adding FORMAT XML would produce invalid XML.
      
      We already have code in there to translate spaces to dashes, so let's
      generalize that to convert anything that isn't a valid XML name character,
      viz letters, digits, hyphens, underscores, and periods.  We could just
      reject slashes, which would run a bit faster.  But the fact that this went
      unnoticed for so long doesn't give me a warm feeling that we'd notice the
      next creative violation, so let's make it a permanent fix.
      
      Reported by Markus Winand, though this isn't his initial patch proposal.
      
      Back-patch to 9.2 where track_io_timing was added.  The problem is only
      latent in 9.1, so I don't feel a need to fix it there.
      
      Discussion: <E0BF6A45-68E8-45E6-918F-741FB332C6BB@winand.at>
      709e461b
    • Tom Lane's avatar
      Sync our copy of the timezone library with IANA release tzcode2016h. · 5e21b681
      Tom Lane authored
      This absorbs a fix for a symlink-manipulation bug in zic that was
      introduced in 2016g.  It probably isn't interesting for our use-case,
      but I'm not quite sure, so let's update while we're at it.
      5e21b681
    • Tom Lane's avatar
      Update time zone data files to tzdata release 2016h. · d8fc45bd
      Tom Lane authored
      (Didn't I just do this?  Oh well.)
      
      DST law changes in Palestine.  Historical corrections for Turkey.
      Switch to numeric abbreviations for Asia/Colombo.
      d8fc45bd
    • Robert Haas's avatar
      Rename "pg_xlog" directory to "pg_wal". · f82ec32a
      Robert Haas authored
      "xlog" is not a particularly clear abbreviation for "write-ahead log",
      and it sometimes confuses users into believe that the contents of the
      "pg_xlog" directory are not critical data, leading to unpleasant
      consequences.  So, rename the directory to "pg_wal".
      
      This patch modifies pg_upgrade and pg_basebackup to understand both
      the old and new directory layouts; the former is necessary given the
      purpose of the tool, while the latter merely avoids an unnecessary
      backward-compatibility break.
      
      We may wish to consider renaming other programs, switches, and
      functions which still use the old "xlog" naming to also refer to
      "wal".  However, that's still under discussion, so let's do just this
      much for now.
      
      Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com
      
      Michael Paquier
      f82ec32a
    • Robert Haas's avatar
      Remove a comment which is now incorrect. · ec7db2b4
      Robert Haas authored
      Before 5d305d86, this comment was
      correct, but now it says we do something which we don't actually do.
      Accordingly, remove the comment.
      ec7db2b4
    • Tom Lane's avatar
      Another portability fix for tzcode2016g update. · 23ed2ba8
      Tom Lane authored
      clang points out that SIZE_MAX wouldn't fit into an int, which means
      this comparison is pretty useless.  Per report from Thomas Munro.
      23ed2ba8
  3. 19 Oct, 2016 11 commits
    • Tom Lane's avatar
      Windows portability fix. · ad90ac4d
      Tom Lane authored
      Per buildfarm.
      ad90ac4d
    • Tom Lane's avatar
      Sync our copy of the timezone library with IANA release tzcode2016g. · f3094920
      Tom Lane authored
      This is mostly to absorb some corner-case fixes in zic for year-2037
      timestamps.  The other changes that have been made are unlikely to affect
      our usage, but nonetheless we may as well take 'em.
      f3094920
    • Tom Lane's avatar
      Suppress "Factory" zone in pg_timezone_names view for tzdata >= 2016g. · a3215431
      Tom Lane authored
      IANA got rid of the really silly "abbreviation" and replaced it with one
      that's only moderately silly.  But it's still pointless, so keep on not
      showing it.
      a3215431
    • Tom Lane's avatar
      Update time zone data files to tzdata release 2016g. · ecbac3e6
      Tom Lane authored
      DST law changes in Turkey.  Historical corrections for America/Los_Angeles,
      Europe/Kirov, Europe/Moscow, Europe/Samara, and Europe/Ulyanovsk.
      Rename Asia/Rangoon to Asia/Yangon, with a backward compatibility link.
      
      The IANA crew continue their campaign to replace invented time zone
      abbrevations with numeric GMT offsets.  This update changes numerous zones
      in Antarctica and the former Soviet Union, for instance Antarctica/Casey
      now reports "+08" not "AWST" in the pg_timezone_names view.  I kept these
      abbreviations in the tznames/ data files, however, so that we will still
      accept them for input.  (We may want to start trimming those files someday,
      but today is not that day.)
      
      An exception is that since IANA no longer claims that "AMT" is in use
      in Armenia for GMT+4, I replaced it in the Default file with GMT-4,
      corresponding to Amazon Time which is in use in South America.  It may be
      that that meaning is also invented and IANA will drop it in a future
      update; but for now, it seems silly to give pride of place to a meaning
      not traceable to IANA over one that is.
      ecbac3e6
    • Peter Eisentraut's avatar
    • Peter Eisentraut's avatar
      Use pg_ctl promote -w in TAP tests · e5a9bcb5
      Peter Eisentraut authored
      Switch TAP tests to use the new wait mode of pg_ctl promote.  This
      allows avoiding extra logic with poll_query_until() to be sure that a
      promoted standby is ready for read-write queries.
      
      From: Michael Paquier <michael.paquier@gmail.com>
      e5a9bcb5
    • Peter Eisentraut's avatar
      initdb pg_basebackup: Rename --noxxx options to --no-xxx · 5d58c07a
      Peter Eisentraut authored
      --noclean and --nosync were the only options spelled without a hyphen,
      so change this for consistency with other options.  The options in
      pg_basebackup have not been in a release, so we just rename them.  For
      initdb, we retain the old variants.
      
      Vik Fearing and me
      5d58c07a
    • Peter Eisentraut's avatar
      pg_ctl: Add long option for -o · caf936b0
      Peter Eisentraut authored
      Now all normally used options are covered by long options as well.
      caf936b0
    • Peter Eisentraut's avatar
      doc: Consistently use = sign in long options synopses · c709c607
      Peter Eisentraut authored
      This was already the predominant form in man pages and help output.
      c709c607
    • Peter Eisentraut's avatar
      pg_ctl: Add long options for -w and -W · 0be22457
      Peter Eisentraut authored
      From: Vik Fearing <vik@2ndquadrant.fr>
      0be22457
    • Heikki Linnakangas's avatar
      Fix WAL-logging of FSM and VM truncation. · 917dc7d2
      Heikki Linnakangas authored
      When a relation is truncated, it is important that the FSM is truncated as
      well. Otherwise, after recovery, the FSM can return a page that has been
      truncated away, leading to errors like:
      
      ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
      of 8192 bytes
      
      We were using MarkBufferDirtyHint() to dirty the buffer holding the last
      remaining page of the FSM, but during recovery, that might in fact not
      dirty the page, and the FSM update might be lost.
      
      To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
      requires us to do WAL-logging ourselves, to protect from a torn page, if
      checksumming is enabled.
      
      Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
      when checksumming is enabled.
      
      Analysis by Pavan Deolasee.
      
      Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
      917dc7d2
  4. 18 Oct, 2016 5 commits
    • Robert Haas's avatar
      Improve regression test coverage for hash indexes. · b801e120
      Robert Haas authored
      On my system, this improves coverage for src/backend/access/hash from
      61.3% of lines to 88.2% of lines, and from 83.5% of functions to 97.5%
      of functions, which is pretty good for 36 lines of tests.
      
      Mithun Cy, reviewing by Amit Kapila and Álvaro Herrera
      b801e120
    • Andres Freund's avatar
      Fix a few typos in simplehash.h. · 90d3da11
      Andres Freund authored
      Author: Erik Rijkers
      Discussion: <274e4c8ac545d6622735f97c1f6c354b@xs4all.nl>
      90d3da11
    • Robert Haas's avatar
      Fix typo in comment. · fca41acb
      Robert Haas authored
      Amit Langote
      fca41acb
    • Tom Lane's avatar
      Fix cidin() to handle values above 2^31 platform-independently. · 6f13a682
      Tom Lane authored
      CommandId is declared as uint32, and values up to 4G are indeed legal.
      cidout() handles them properly by treating the value as unsigned int.
      But cidin() was just using atoi(), which has platform-dependent behavior
      for values outside the range of signed int, as reported by Bart Lengkeek
      in bug #14379.  Use strtoul() instead, as xidin() does.
      
      In passing, make some purely cosmetic changes to make xidin/xidout
      look more like cidin/cidout; the former didn't have a monopoly on
      best practice IMO.
      
      Neither xidin nor cidin make any attempt to throw error for invalid input.
      I didn't change that here, and am not sure it's worth worrying about
      since neither is really a user-facing type.  The point is just to ensure
      that indubitably-valid inputs work as expected.
      
      It's been like this for a long time, so back-patch to all supported
      branches.
      
      Report: <20161018152550.1413.6439@wrigleys.postgresql.org>
      6f13a682
    • Heikki Linnakangas's avatar
      Revert "Replace PostmasterRandom() with a stronger way of generating randomness." · faae1c91
      Heikki Linnakangas authored
      This reverts commit 9e083fd4. That was a
      few bricks shy of a load:
      
      * Query cancel stopped working
      * Buildfarm member pademelon stopped working, because the box doesn't have
        /dev/urandom nor /dev/random.
      
      This clearly needs some more discussion, and a quite different patch, so
      revert for now.
      faae1c91
  5. 17 Oct, 2016 4 commits
    • Robert Haas's avatar
      By default, set log_line_prefix = '%m [%p] '. · 7d3235ba
      Robert Haas authored
      This value might not be to everyone's taste; in particular, some
      people might prefer %t to %m, and others may want %u, %d, or other
      fields.  However, it's a vast improvement on the old default of ''.
      
      Christoph Berg
      7d3235ba
    • Heikki Linnakangas's avatar
      Use OpenSSL EVP API for symmetric encryption in pgcrypto. · 5ff4a67f
      Heikki Linnakangas authored
      The old "low-level" API is deprecated, and doesn't support hardware
      acceleration. And this makes the code simpler, too.
      
      Discussion: <561274F1.1030000@iki.fi>
      5ff4a67f
    • Heikki Linnakangas's avatar
      Fix use-after-free around DISTINCT transition function calls. · d8589946
      Heikki Linnakangas authored
      Have tuplesort_gettupleslot() copy the contents of its current table slot
      as needed. This is based on an approach taken by tuplestore_gettupleslot().
      In the future, tuplesort_gettupleslot() may also be taught to avoid copying
      the tuple where caller can determine that that is safe (the
      tuplestore_gettupleslot() interface already offers this option to callers).
      
      Patch by Peter Geoghegan. Fixes bug #14344, reported by Regina Obe.
      
      Report: <20160929035538.20224.39628@wrigleys.postgresql.org>
      
      Backpatch-through: 9.6
      d8589946
    • Heikki Linnakangas's avatar
      Replace PostmasterRandom() with a stronger way of generating randomness. · 9e083fd4
      Heikki Linnakangas authored
      This adds a new routine, pg_strong_random() for generating random bytes,
      for use in both frontend and backend. At the moment, it's only used in
      the backend, but the upcoming SCRAM authentication patches need strong
      random numbers in libpq as well.
      
      pg_strong_random() is based on, and replaces, the existing implementation
      in pgcrypto. It can acquire strong random numbers from a number of sources,
      depending on what's available:
      - OpenSSL RAND_bytes(), if built with OpenSSL
      - On Windows, the native cryptographic functions are used
      - /dev/urandom
      - /dev/random
      
      Original patch by Magnus Hagander, with further work by Michael Paquier
      and me.
      
      Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>
      9e083fd4
  6. 15 Oct, 2016 1 commit
    • Andres Freund's avatar
      Use more efficient hashtable for execGrouping.c to speed up hash aggregation. · 5dfc1981
      Andres Freund authored
      The more efficient hashtable speeds up hash-aggregations with more than
      a few hundred groups significantly. Improvements of over 120% have been
      measured.
      
      Due to the the different hash table queries that not fully
      determined (e.g. GROUP BY without ORDER BY) may change their result
      order.
      
      The conversion is largely straight-forward, except that, due to the
      static element types of simplehash.h type hashes, the additional data
      some users store in elements (e.g. the per-group working data for hash
      aggregaters) is now stored in TupleHashEntryData->additional.  The
      meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
      has been changed to only be about the additionally stored size.  That
      size is only used for the initial sizing of the hash-table.
      
      Reviewed-By: Tomas Vondra
      Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
      5dfc1981
  7. 14 Oct, 2016 5 commits
    • Andres Freund's avatar
      Use more efficient hashtable for tidbitmap.c to speed up bitmap scans. · 75ae538b
      Andres Freund authored
      Use the new simplehash.h to speed up tidbitmap.c uses. For bitmap scan
      heavy queries speedups of over 100% have been measured. Both lossy and
      exact scans benefit, but the wins are bigger for mostly exact scans.
      
      The conversion is mostly trivial, except that tbm_lossify() now restarts
      lossifying at the point it previously stopped. Otherwise the hash table
      becomes unbalanced because the scan in done in hash-order, leaving the
      end of the hashtable more densely filled then the beginning. That caused
      performance issues with dynahash as well, but due to the open chaining
      they were less pronounced than with the linear adressing from
      simplehash.h.
      
      Reviewed-By: Tomas Vondra
      Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
      75ae538b
    • Andres Freund's avatar
      Add a macro templatized hashtable. · b30d3ea8
      Andres Freund authored
      dynahash.c hash tables aren't quite fast enough for some
      use-cases. There are several reasons for lacking performance:
      - the use of chaining for collision handling makes them cache
        inefficient, that's especially an issue when the tables get bigger.
      - as the element sizes for dynahash are only determined at runtime,
        offset computations are somewhat expensive
      - hash and element comparisons are indirect function calls, causing
        unnecessary pipeline stalls
      - it's two level structure has some benefits (somewhat natural
        partitioning), but increases the number of indirections
      to fix several of these the hash tables have to be adjusted to the
      individual use-case at compile-time. C unfortunately doesn't provide a
      good way to do compile code generation (like e.g. c++'s templates for
      all their weaknesses do).  Thus the somewhat ugly approach taken here is
      to allow for code generation using a macro-templatized header file,
      which generates functions and types based on a prefix and other
      parameters.
      
      Later patches use this infrastructure to use such hash tables for
      tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation,
      ...). In queries where these use up a large fraction of the time, this
      has been measured to lead to performance improvements of over 100%.
      
      There are other cases where this could be useful (e.g. catcache.c).
      
      The hash table design chosen is a variant of linear open-addressing. The
      biggest disadvantage of simple linear addressing schemes are highly
      variable lookup times due to clustering, and deletions leaving a lot of
      tombstones around.  To address these issues a variant of "robin hood"
      hashing is employed.  Robin hood hashing optimizes chaining lengths by
      moving elements close to their optimal bucket ("rich" elements), out of
      the way if a to-be-inserted element is further away from its optimal
      position (i.e. it's "poor").  While that can make insertions slower, the
      average lookup performance is a lot better, and higher fill factors can
      be used in a still performant manner.  To avoid tombstones - which
      normally solve the issue that a deleted node's presence is relevant to
      determine whether a lookup needs to continue looking or is done -
      buckets following a deleted element are shifted backwards, unless
      they're empty or already at their optimal position.
      
      There's further possible improvements that can be made to this
      implementation. Amongst others:
      - Use distance as a termination criteria during searches. This is
        generally a good idea, but I've been able to see the overhead of
        distance calculations in some cases.
      - Consider combining the 'empty' status into the hashvalue, and enforce
        storing the hashvalue. That could, in some cases, increase memory
        density and remove a few instructions.
      - Experiment further with the, very conservatively choosen, fillfactor.
      - Make maximum size of hashtable configurable, to allow storing very
        very large tables. That'd require 64bit hash values to be more common
        than now, though.
      - some smaller memcpy calls could be optimized to copy larger chunks
      But since the new implementation is already considerably faster than
      dynahash it seem sensible to start using it.
      
      Reviewed-By: Tomas Vondra
      Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
      b30d3ea8
    • Andres Freund's avatar
      Add likely/unlikely() branch hint macros. · aa3ca5e3
      Andres Freund authored
      These are useful for very hot code paths. Because it's easy to guess
      wrongly about likelihood, and because such likelihoods change over time,
      they should be used sparingly.
      
      Past tests have shown it'd be a good idea to use them in some places,
      e.g. in error checks around ereports that ERROR out, but that's work for
      later.
      
      Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
      aa3ca5e3
    • Tom Lane's avatar
      Fix assorted integer-overflow hazards in varbit.c. · 32fdf42c
      Tom Lane authored
      bitshiftright() and bitshiftleft() would recursively call each other
      infinitely if the user passed INT_MIN for the shift amount, due to integer
      overflow in negating the shift amount.  To fix, clamp to -VARBITMAXLEN.
      That doesn't change the results since any shift distance larger than the
      input bit string's length produces an all-zeroes result.
      
      Also fix some places that seemed inadequately paranoid about input typmods
      exceeding VARBITMAXLEN.  While a typmod accepted by anybit_typmodin() will
      certainly be much less than that, at least some of these spots are
      reachable with user-chosen integer values.
      
      Andreas Seltenreich and Tom Lane
      
      Discussion: <87d1j2zqtz.fsf@credativ.de>
      32fdf42c
    • Tatsuo Ishii's avatar
      Fix typo. · 13d3180f
      Tatsuo Ishii authored
      Confirmed by Michael Paquier.
      13d3180f
  8. 13 Oct, 2016 7 commits
    • Tom Lane's avatar
      Fix handling of pgstat counters for TRUNCATE in a prepared transaction. · 81e82a2b
      Tom Lane authored
      pgstat_twophase_postcommit is supposed to duplicate the math in
      AtEOXact_PgStat, but it had missed out the bit about clearing
      t_delta_live_tuples/t_delta_dead_tuples for a TRUNCATE.
      
      It's harder than you might think to replicate the issue here, because
      those counters would only be nonzero when a previous transaction in
      the same backend had added/deleted tuples in the truncated table,
      and those counts hadn't been sent to the stats collector yet.
      
      Evident oversight in commit d42358ef.  I've not added a regression
      test for this; we tried to add one in d42358ef, and had to revert it
      because it was too timing-sensitive for the buildfarm.
      
      Back-patch to 9.5 where d42358ef came in.
      
      Stas Kelvich
      
      Discussion: <EB57BF68-C06D-4737-BDDC-4BA778F4E62B@postgrespro.ru>
      81e82a2b
    • Tatsuo Ishii's avatar
      Fix typo. · b1ee762a
      Tatsuo Ishii authored
      Confirmed by Tom Lane.
      b1ee762a
    • Tom Lane's avatar
      Fix another bug in merging of inherited CHECK constraints. · 3cca13cb
      Tom Lane authored
      It's not good for an inherited child constraint to be marked connoinherit;
      that would result in the constraint not propagating to grandchild tables,
      if any are created later.  The code mostly prevented this from happening
      but there was one case that was missed.
      
      This is somewhat related to commit e55a946a, which also tightened checks
      on constraint merging.  Hence, back-patch to 9.2 like that one.  This isn't
      so much because there's a concrete feature-related reason to stop there,
      as to avoid having more distinct behaviors than we have to in this area.
      
      Amit Langote
      
      Discussion: <b28ee774-7009-313d-dd55-5bdd81242c41@lab.ntt.co.jp>
      3cca13cb
    • Tom Lane's avatar
      Remove dead code in pg_dump. · c08521eb
      Tom Lane authored
      I'm not sure if this provision for "pg_backup" behaving a bit differently
      from "pg_dump" ever did anything useful in a released version.  But it's
      definitely dead code now.
      
      Michael Paquier
      c08521eb
    • Tom Lane's avatar
      Try to find out the actual hugepage size when making a MAP_HUGETLB request. · cb775768
      Tom Lane authored
      Even if Linux's mmap() is okay with a partial-hugepage request, munmap()
      is not, as reported by Chris Richards.  Therefore it behooves us to try
      a bit harder to find out the actual hugepage size, instead of assuming
      that we can skate by with a guess.
      
      For the moment, just look into /proc/meminfo to find out the default
      hugepage size, and use that.  Later, on kernels that support requests
      for nondefault sizes, we might try to consider other alternatives.
      But that smells more like a new feature than a bug fix, especially if
      we want to provide any way for the DBA to control it, so leave it for
      another day.
      
      I set this up to allow easy addition of platform-specific code for
      non-Linux platforms, if needed; but right now there are no reports
      suggesting that we need to work harder on other platforms.
      
      Back-patch to 9.4 where hugepage support was introduced.
      
      Discussion: <31056.1476303954@sss.pgh.pa.us>
      cb775768
    • Tom Lane's avatar
      Clean up handling of anonymous mmap'd shared-memory segment. · 15fc5e15
      Tom Lane authored
      Fix detaching of the mmap'd segment to have its own on_shmem_exit callback,
      rather than piggybacking on the one for detaching from the SysV segment.
      That was confusing, and given the distance between the two attach calls,
      it was trouble waiting to happen.
      
      Make the detaching calls idempotent by clearing AnonymousShmem to show
      we've already unmapped.  I spent quite a bit of time yesterday trying
      to find a path that would allow the munmap()'s to be done twice, and
      while I did not succeed, it seems silly that there's even a question.
      
      Make the #ifdef logic less confusing by separating "do we want to use
      anonymous shmem" from EXEC_BACKEND.  Even though there's no current
      scenario where those conditions are different, it is not helpful for
      different places in the same file to be testing EXEC_BACKEND for what
      are fundamentally different reasons.
      
      Don't do on_exit_reset() in StartBackgroundWorker().  At best that's
      useless (InitPostmasterChild would have done it already) and at worst
      it could zap some callback that's unrelated to shared memory.
      
      Improve comments, and simplify the huge_pages enablement logic slightly.
      
      Back-patch to 9.4 where hugepage support was introduced.
      Arguably this should go into 9.3 as well, but the code looks
      significantly different there, and I doubt it's worth the
      trouble of adapting the patch given I can't show a live bug.
      15fc5e15
    • Tom Lane's avatar
      Fix pg_dumpall regression test to be locale-independent. · 0a4bf6b1
      Tom Lane authored
      The expected results in commit b4fc6457 seem to have been generated
      in a non-C locale, which just points up the fact that the ORDER BY
      clause was locale-sensitive.
      
      Per buildfarm.
      0a4bf6b1