1. 26 Feb, 2020 3 commits
    • Peter Geoghegan's avatar
      Add deduplication to nbtree. · 0d861bbb
      Peter Geoghegan authored
      Deduplication reduces the storage overhead of duplicates in indexes that
      use the standard nbtree index access method.  The deduplication process
      is applied lazily, after the point where opportunistic deletion of
      LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
      the point where a leaf page split would otherwise be required.  New
      posting list tuples are formed by merging together existing duplicate
      tuples.  The physical representation of the items on an nbtree leaf page
      is made more space efficient by deduplication, but the logical contents
      of the page are not changed.  Even unique indexes make use of
      deduplication as a way of controlling bloat from duplicates whose TIDs
      point to different versions of the same logical table row.
      
      The lazy approach taken by nbtree has significant advantages over a GIN
      style eager approach.  Most individual inserts of index tuples have
      exactly the same overhead as before.  The extra overhead of
      deduplication is amortized across insertions, just like the overhead of
      page splits.  The key space of indexes works in the same way as it has
      since commit dd299df8 (the commit that made heap TID a tiebreaker
      column).
      
      Testing has shown that nbtree deduplication can generally make indexes
      with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
      smaller, even with single column integer indexes (e.g., an index on a
      referencing column that accompanies a foreign key).  The final size of
      single column nbtree indexes comes close to the final size of a similar
      contrib/btree_gin index, at least in cases where GIN's posting list
      compression isn't very effective.  This can significantly improve
      transaction throughput, and significantly reduce the cost of vacuuming
      indexes.
      
      A new index storage parameter (deduplicate_items) controls the use of
      deduplication.  The default setting is 'on', so all new B-Tree indexes
      automatically use deduplication where possible.  This decision will be
      reviewed at the end of the Postgres 13 beta period.
      
      There is a regression of approximately 2% of transaction throughput with
      synthetic workloads that consist of append-only inserts into a table
      with several non-unique indexes, where all indexes have few or no
      repeated values.  The underlying issue is that cycles are wasted on
      unsuccessful attempts at deduplicating items in non-unique indexes.
      There doesn't seem to be a way around it short of disabling
      deduplication entirely.  Note that deduplication of items in unique
      indexes is fairly well targeted in general, which avoids the problem
      there (we can use a special heuristic to trigger deduplication passes in
      unique indexes, since we're specifically targeting "version bloat").
      
      Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
      
      No bump in BTREE_VERSION, since the representation of posting list
      tuples works in a way that's backwards compatible with version 4 indexes
      (i.e. indexes built on PostgreSQL 12).  However, users must still
      REINDEX a pg_upgrade'd index to use deduplication, regardless of the
      Postgres version they've upgraded from.  This is the only way to set the
      new nbtree metapage flag indicating that deduplication is generally
      safe.
      
      Author: Anastasia Lubennikova, Peter Geoghegan
      Reviewed-By: Peter Geoghegan, Heikki Linnakangas
      Discussion:
          https://postgr.es/m/55E4051B.7020209@postgrespro.ru
          https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
      0d861bbb
    • Peter Geoghegan's avatar
      Add equalimage B-Tree support functions. · 612a1ab7
      Peter Geoghegan authored
      Invent the concept of a B-Tree equalimage ("equality implies image
      equality") support function, registered as support function 4.  This
      indicates whether it is safe (or not safe) to apply optimizations that
      assume that any two datums considered equal by an operator class's order
      method must be interchangeable without any loss of semantic information.
      This is static information about an operator class and a collation.
      
      Register an equalimage routine for almost all of the existing B-Tree
      opclasses.  We only need two trivial routines for all of the opclasses
      that are included with the core distribution.  There is one routine for
      opclasses that index non-collatable types (which returns 'true'
      unconditionally), plus another routine for collatable types (which
      returns 'true' when the collation is a deterministic collation).
      
      This patch is infrastructure for an upcoming patch that adds B-Tree
      deduplication.
      
      Author: Peter Geoghegan, Anastasia Lubennikova
      Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
      612a1ab7
    • Magnus Hagander's avatar
      Include error code in message from pg_upgrade · 4109bb5d
      Magnus Hagander authored
      In passing, also quote the filename in one message where it wasn't.
      
      Author: Dagfinn Ilmari Mannsåker
      Discussion: https://postgr.es/m/87pne2w98h.fsf@wibble.ilmari.org
      4109bb5d
  2. 25 Feb, 2020 1 commit
  3. 24 Feb, 2020 9 commits
    • Tom Lane's avatar
      Fix compile failure. · 36390713
      Tom Lane authored
      I forgot that some compilers won't handle #if constructs within
      ereport() calls.  Duplicating most of the call is annoying but simple.
      Per buildfarm.
      36390713
    • Andres Freund's avatar
      expression eval: Reduce number of steps for agg transition invocations. · 2742c450
      Andres Freund authored
      Do so by combining the various steps that are part of aggregate
      transition function invocation into one larger step. As some of the
      current steps are only necessary for some aggregates, have one variant
      of the aggregate transition step for each possible combination.
      
      To avoid further manual copies of code in the different transition
      step implementations, move most of the code into helper functions
      marked as "always inline".
      
      The benefit of this change is an increase in performance when
      aggregating lots of rows. This comes in part due to the reduced number
      of indirect jumps due to the reduced number of steps, and in part by
      reducing redundant setup code across steps. This mainly benefits
      interpreted execution, but the code generated by JIT is also improved
      a bit.
      
      As a nice side-effect it also ends up making the code a bit simpler.
      
      A small additional optimization is removing the need to set
      aggstate->curaggcontext before calling ExecAggInitGroup, choosing to
      instead passign curaggcontext as an argument. It was, in contrast to
      other aggregate related functions, only needed to fetch a memory
      context to copy the transition value into.
      
      Author: Andres Freund
      Discussion:
         https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de
         https://postgr.es/m/5c371df7cee903e8cd4c685f90c6c72086d3a2dc.camel@j-davis.com
      2742c450
    • Michael Paquier's avatar
      Issue properly WAL record for CID of first catalog tuple in multi-insert · 7d672b76
      Michael Paquier authored
      Multi-insert for heap is not yet used actively for catalogs, but the
      code to support this case is in place for logical decoding.  The
      existing code forgot to issue a XLOG_HEAP2_NEW_CID record for the first
      tuple inserted, leading to failures when attempting to use multiple
      inserts for catalogs at decoding time.  This commit fixes the problem by
      WAL-logging the needed CID.
      
      This is not an active bug, so no back-patch is done.
      
      Author: Daniel Gustafsson
      Discussion: https://postgr.es/m/E0D4CC67-A1CF-4DF4-991D-B3AC2EB5FAE9@yesql.se
      7d672b76
    • Tom Lane's avatar
      Account explicitly for long-lived FDs that are allocated outside fd.c. · 3d475515
      Tom Lane authored
      The comments in fd.c have long claimed that all file allocations should
      go through that module, but in reality that's not always practical.
      fd.c doesn't supply APIs for invoking some FD-producing syscalls like
      pipe() or epoll_create(); and the APIs it does supply for non-virtual
      FDs are mostly insistent on releasing those FDs at transaction end;
      and in some cases the actual open() call is in code that can't be made
      to use fd.c, such as libpq.
      
      This has led to a situation where, in a modern server, there are likely
      to be seven or so long-lived FDs per backend process that are not known
      to fd.c.  Since NUM_RESERVED_FDS is only 10, that meant we had *very*
      few spare FDs if max_files_per_process is >= the system ulimit and
      fd.c had opened all the files it thought it safely could.  The
      contrib/postgres_fdw regression test, in particular, could easily be
      made to fall over by running it under a restrictive ulimit.
      
      To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
      that allow outside callers to tell fd.c that they have or want to allocate
      a FD that's not directly managed by fd.c.  Add calls to track all the
      fixed FDs in a standard backend session, so that we are honestly
      guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
      limit in a backend's idle state.  The coding rules for these functions say
      that there's no need to call them in code that just allocates one FD over
      a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
      That means that there aren't all that many places where we need to worry.
      But postgres_fdw and dblink must use this facility to account for
      long-lived FDs consumed by libpq connections.  There may be other places
      where it's worth doing such accounting, too, but this seems like enough
      to solve the immediate problem.
      
      Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
      (Callers can choose to ignore this limit, but of course it's unwise
      to do so except for fixed file allocations.)  I also reduced the limit
      on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
      Conceivably a smarter rule could be used here --- but in practice,
      on reasonable systems, max_safe_fds should be large enough that this
      isn't much of an issue, so KISS for now.  To avoid possible regression
      in the number of external or allocated files that can be opened,
      increase FD_MINFREE and the lower limit on max_files_per_process a
      little bit; we now insist that the effective "ulimit -n" be at least 64.
      
      This seems like pretty clearly a bug fix, but in view of the lack of
      field complaints, I'll refrain from risking a back-patch.
      
      Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
      3d475515
    • Peter Eisentraut's avatar
      Change client-side fsync_fname() to report errors fatally · 1420617b
      Peter Eisentraut authored
      Given all we have learned about fsync() error handling in the last few
      years, reporting an fsync() error non-fatally is not useful,
      unless you don't care much about the file, in which case you probably
      don't need to use fsync() in the first place.
      
      Change fsync_fname() and durable_rename() to exit(1) on fsync() errors
      other than those that we specifically chose to ignore.
      
      This affects initdb, pg_basebackup, pg_checksums, pg_dump, pg_dumpall,
      and pg_rewind.
      Reviewed-by: default avatarMichael Paquier <michael@paquier.xyz>
      Discussion: https://www.postgresql.org/message-id/flat/d239d1bd-aef0-ca7c-dc0a-da14bdcf0392%402ndquadrant.com
      1420617b
    • Robert Haas's avatar
      Adapt hashfn.c and hashutils.h for frontend use. · a91e2fa9
      Robert Haas authored
      hash_any() and its various variants are defined to return Datum,
      which is a backend-only concept, but the underlying functions
      actually want to return uint32 and uint64, and only return Datum
      because it's convenient for callers who are using them to
      implement a hash function for some SQL datatype.
      
      However, changing these functions to return uint32 and uint64
      seems like it might lead to programming errors or back-patching
      difficulties, both because they are widely used and because
      failure to use UInt{32,64}GetDatum() might not provoke a
      compilation error. Instead, rename the existing functions as
      well as changing the return type, and add static inline wrappers
      for those callers that need the previous behavior.
      
      Although this commit adapts hashutils.h and hashfn.c so that they
      can be compiled as frontend code, it does not actually do
      anything that would cause them to be so compiled. That is left
      for another commit.
      
      Patch by me, reviewed by Suraj Kharage and Mark Dilger.
      
      Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
      a91e2fa9
    • Robert Haas's avatar
      Put all the prototypes for hashfn.c into the same header file. · 9341c783
      Robert Haas authored
      Previously, some of the prototypes for functions in hashfn.c were
      in utils/hashutils.h and others were in utils/hsearch.h, but that
      is confusing and has no particular benefit.
      
      Patch by me, reviewed by Suraj Kharage and Mark Dilger.
      
      Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
      9341c783
    • Robert Haas's avatar
      Move bitmap_hash and bitmap_match to bitmapset.c. · 07b95c3d
      Robert Haas authored
      The closely-related function bms_hash_value is already defined in that
      file, and this change means that hashfn.c no longer needs to depend on
      nodes/bitmapset.h. That gets us closer to allowing use of the hash
      functions in hashfn.c in frontend code.
      
      Patch by me, reviewed by Suraj Kharage and Mark Dilger.
      
      Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
      07b95c3d
    • Michael Paquier's avatar
      Add prefix checks in exclude lists for pg_rewind, pg_checksums and base backups · bf883b21
      Michael Paquier authored
      An instance of PostgreSQL crashing with a bad timing could leave behind
      temporary pg_internal.init files, potentially causing failures when
      verifying checksums.  As the same exclusion lists are used between
      pg_rewind, pg_checksums and basebackup.c, all those tools are extended
      with prefix checks to keep everything in sync, with dedicated checks
      added for pg_internal.init.
      
      Backpatch down to 11, where pg_checksums (pg_verify_checksums in 11) and
      checksum verification for base backups have been introduced.
      
      Reported-by: Michael Banck
      Author: Michael Paquier
      Reviewed-by: Kyotaro Horiguchi, David Steele
      Discussion: https://postgr.es/m/62031974fd8e941dd8351fbc8c7eff60d59c5338.camel@credativ.de
      Backpatch-through: 11
      bf883b21
  4. 22 Feb, 2020 3 commits
  5. 21 Feb, 2020 16 commits
  6. 20 Feb, 2020 3 commits
  7. 19 Feb, 2020 5 commits
    • Tom Lane's avatar
      Doc: discourage use of partial indexes for poor-man's-partitioning. · 6a8e5605
      Tom Lane authored
      Creating a bunch of non-overlapping partial indexes is generally
      a bad idea, so add an example saying not to do that.
      
      Back-patch to v10.  Before that, the alternative of using (real)
      partitioning wasn't available, so that the tradeoff isn't quite
      so clear cut.
      
      Discussion: https://postgr.es/m/CAKVFrvFY-f7kgwMRMiPLbPYMmgjc8Y2jjUGK_Y0HVcYAmU6ymg@mail.gmail.com
      6a8e5605
    • Tom Lane's avatar
      Remove support for upgrading extensions from "unpackaged" state. · 70a77320
      Tom Lane authored
      Andres Freund pointed out that allowing non-superusers to run
      "CREATE EXTENSION ... FROM unpackaged" has security risks, since
      the unpackaged-to-1.0 scripts don't try to verify that the existing
      objects they're modifying are what they expect.  Just attaching such
      objects to an extension doesn't seem too dangerous, but some of them
      do more than that.
      
      We could have resolved this, perhaps, by still requiring superuser
      privilege to use the FROM option.  However, it's fair to ask just what
      we're accomplishing by continuing to lug the unpackaged-to-1.0 scripts
      forward.  None of them have received any real testing since 9.1 days,
      so they may not even work anymore (even assuming that one could still
      load the previous "loose" object definitions into a v13 database).
      And an installation that's trying to go from pre-9.1 to v13 or later
      in one jump is going to have worse compatibility problems than whether
      there's a trivial way to convert their contrib modules into extension
      style.
      
      Hence, let's just drop both those scripts and the core-code support
      for "CREATE EXTENSION ... FROM".
      
      Discussion: https://postgr.es/m/20200213233015.r6rnubcvl4egdh5r@alap3.anarazel.de
      70a77320
    • Peter Eisentraut's avatar
      Fix typo · 2f9c46a3
      Peter Eisentraut authored
      Reported-by: default avatarDaniel Verite <daniel@manitou-mail.org>
      2f9c46a3
    • Tom Lane's avatar
      Fix confusion about event trigger vs. plain function in plpgsql. · 761a5688
      Tom Lane authored
      The function hash table keys made by compute_function_hashkey() failed
      to distinguish event-trigger call context from regular call context.
      This meant that once we'd successfully made a hash entry for an event
      trigger (either by validation, or by normal use as an event trigger),
      an attempt to call the trigger function as a plain function would
      find this hash entry and thereby bypass the you-can't-do-that check in
      do_compile().  Thus we'd attempt to execute the function, leading to
      strange errors or even crashes, depending on function contents and
      server version.
      
      To fix, add an isEventTrigger field to PLpgSQL_func_hashkey,
      paralleling the longstanding infrastructure for regular triggers.
      This fits into what had been pad space, so there's no risk of an ABI
      break, even assuming that any third-party code is looking at these
      hash keys.  (I considered replacing isTrigger with a PLpgSQL_trigtype
      enum field, but felt that that carried some API/ABI risk.  Maybe we
      should change it in HEAD though.)
      
      Per bug #16266 from Alexander Lakhin.  This has been broken since
      event triggers were invented, so back-patch to all supported branches.
      
      Discussion: https://postgr.es/m/16266-fcd7f838e97ba5d4@postgresql.org
      761a5688
    • Peter Eisentraut's avatar
      Set gen_random_uuid() to volatile · 2ed19a48
      Peter Eisentraut authored
      It was set to immutable.  This was a mistake in the initial
      commit (5925e554).
      Reported-by: default avatarhubert depesz lubaczewski <depesz@depesz.com>
      Discussion: https://www.postgresql.org/message-id/flat/20200218185452.GA8710%40depesz.com
      2ed19a48