1. 02 Apr, 2021 9 commits
    • Fujii Masao's avatar
      postgres_fdw: Add option to control whether to keep connections open. · b1be3074
      Fujii Masao authored
      This commit adds a new option keep_connections that controls
      whether postgres_fdw keeps the connections to the foreign server open
      so that the subsequent queries can re-use them. This option can only be
      specified for a foreign server. The default is on. If set to off,
      all connections to the foreign server will be discarded
      at the end of transaction. Closed connections will be re-established
      when they are necessary by future queries using a foreign table.
      
      This option is useful, for example, when users want to prevent
      the connections from eating up the foreign servers connections
      capacity.
      
      Author: Bharath Rupireddy
      Reviewed-by: Alexey Kondratov, Vignesh C, Fujii Masao
      Discussion: https://postgr.es/m/CALj2ACVvrp5=AVp2PupEm+nAC8S4buqR3fJMmaCoc7ftT0aD2A@mail.gmail.com
      b1be3074
    • Peter Eisentraut's avatar
    • Fujii Masao's avatar
      Fix pgstat_report_replslot() to use proper data types for its arguments. · 96bdb7e1
      Fujii Masao authored
      The caller of pgstat_report_replslot() passes int64 values to the function.
      Also the function stores those values in PgStat_Counter (i.e., int64) fields
      of PgStat_MsgReplSlot struct. But previously the function used "int" as
      the data types of some arguments for those values, which could lead to
      the overflow of values.
      
      To avoid this risk, this commit fixes pgstat_report_replslot() to use
      PgStat_Counter type for the arguments. Since they are the statistics counters,
      PgStat_Counter, the data type used for counters, is used for them
      instead of int64.
      
      Reported-by: Vignesh C
      Author: Vignesh C
      Reviewed-by: Jeevan Ladhe, Fujii Masao
      Discussion: https://postgr.es/m/CALDaNm080OpG=ZwOb0i8EyChH5SyHAMFWJCKaKTXmrfvJLbgaA@mail.gmail.com
      96bdb7e1
    • Michael Paquier's avatar
      doc: Clarify how to generate backup files with non-exclusive backups · 6fb66c26
      Michael Paquier authored
      The current instructions describing how to write the backup_label and
      tablespace_map files are confusing.  For example, opening a file in text
      mode on Windows and copy-pasting the file's contents would result in a
      failure at recovery because of the extra CRLF characters generated.  The
      documentation was not stating that clearly, and per discussion this is
      not considered as a supported scenario.
      
      This commit extends a bit the documentation to mention that it may be
      required to open the file in binary mode before writing its data.
      
      Reported-by: Wang Shenhao
      Author: David Steele
      Reviewed-by: Andrew Dunstan, Magnus Hagander
      Discussion: https://postgr.es/m/8373f61426074f2cb6be92e02f838389@G08CNEXMBPEKD06.g08.fujitsu.local
      Backpatch-through: 9.6
      6fb66c26
    • Fujii Masao's avatar
      98e5bd10
    • David Rowley's avatar
      Attempt to fix unstable Result Cache regression tests · a4fac4ff
      David Rowley authored
      force_parallel_mode = regress is causing a few more problems than I
      thought.  It seems that both the leader and the single worker can
      contribute to the execution. I had mistakenly thought that only the worker
      process would do any work.  Since it's not deterministic as to which
      of the two processes will get a chance to work on the plan, it seems just
      better to disable force_parallel_mode for these tests.  At least doing
      this seems better than changing to EXPLAIN only rather than EXPLAIN
      ANALYZE.
      
      Additionally, I overlooked the fact that the number of executions of the
      sub-plan below a Result Cache will execute a varying number of times
      depending on cache eviction.  32-bit machines will use less memory and
      evict fewer tuples from the cache.  That results in the subnode being
      executed fewer times on 32-bit machines.  Let's just blank out the number
      of loops in each node.
      a4fac4ff
    • Bruce Momjian's avatar
      doc: mention that intervening major releases can be skipped · 2bda93f8
      Bruce Momjian authored
      Also mention that you should read the intervening major releases notes.
      This change was also applied to the website.
      
      Discussion: https://postgr.es/m/20210330144949.GA8259@momjian.us
      
      Backpatch-through: 9.6
      2bda93f8
    • David Rowley's avatar
      Add Result Cache executor node (take 2) · 9eacee2e
      David Rowley authored
      Here we add a new executor node type named "Result Cache".  The planner
      can include this node type in the plan to have the executor cache the
      results from the inner side of parameterized nested loop joins.  This
      allows caching of tuples for sets of parameters so that in the event that
      the node sees the same parameter values again, it can just return the
      cached tuples instead of rescanning the inner side of the join all over
      again.  Internally, result cache uses a hash table in order to quickly
      find tuples that have been previously cached.
      
      For certain data sets, this can significantly improve the performance of
      joins.  The best cases for using this new node type are for join problems
      where a large portion of the tuples from the inner side of the join have
      no join partner on the outer side of the join.  In such cases, hash join
      would have to hash values that are never looked up, thus bloating the hash
      table and possibly causing it to multi-batch.  Merge joins would have to
      skip over all of the unmatched rows.  If we use a nested loop join with a
      result cache, then we only cache tuples that have at least one join
      partner on the outer side of the join.  The benefits of using a
      parameterized nested loop with a result cache increase when there are
      fewer distinct values being looked up and the number of lookups of each
      value is large.  Also, hash probes to lookup the cache can be much faster
      than the hash probe in a hash join as it's common that the result cache's
      hash table is much smaller than the hash join's due to result cache only
      caching useful tuples rather than all tuples from the inner side of the
      join.  This variation in hash probe performance is more significant when
      the hash join's hash table no longer fits into the CPU's L3 cache, but the
      result cache's hash table does.  The apparent "random" access of hash
      buckets with each hash probe can cause a poor L3 cache hit ratio for large
      hash tables.  Smaller hash tables generally perform better.
      
      The hash table used for the cache limits itself to not exceeding work_mem
      * hash_mem_multiplier in size.  We maintain a dlist of keys for this cache
      and when we're adding new tuples and realize we've exceeded the memory
      budget, we evict cache entries starting with the least recently used ones
      until we have enough memory to add the new tuples to the cache.
      
      For parameterized nested loop joins, we now consider using one of these
      result cache nodes in between the nested loop node and its inner node.  We
      determine when this might be useful based on cost, which is primarily
      driven off of what the expected cache hit ratio will be.  Estimating the
      cache hit ratio relies on having good distinct estimates on the nested
      loop's parameters.
      
      For now, the planner will only consider using a result cache for
      parameterized nested loop joins.  This works for both normal joins and
      also for LATERAL type joins to subqueries.  It is possible to use this new
      node for other uses in the future.  For example, to cache results from
      correlated subqueries.  However, that's not done here due to some
      difficulties obtaining a distinct estimation on the outer plan to
      calculate the estimated cache hit ratio.  Currently we plan the inner plan
      before planning the outer plan so there is no good way to know if a result
      cache would be useful or not since we can't estimate the number of times
      the subplan will be called until the outer plan is generated.
      
      The functionality being added here is newly introducing a dependency on
      the return value of estimate_num_groups() during the join search.
      Previously, during the join search, we only ever needed to perform
      selectivity estimations.  With this commit, we need to use
      estimate_num_groups() in order to estimate what the hit ratio on the
      result cache will be.   In simple terms, if we expect 10 distinct values
      and we expect 1000 outer rows, then we'll estimate the hit ratio to be
      99%.  Since cache hits are very cheap compared to scanning the underlying
      nodes on the inner side of the nested loop join, then this will
      significantly reduce the planner's cost for the join.   However, it's
      fairly easy to see here that things will go bad when estimate_num_groups()
      incorrectly returns a value that's significantly lower than the actual
      number of distinct values.  If this happens then that may cause us to make
      use of a nested loop join with a result cache instead of some other join
      type, such as a merge or hash join.  Our distinct estimations have been
      known to be a source of trouble in the past, so the extra reliance on them
      here could cause the planner to choose slower plans than it did previous
      to having this feature.  Distinct estimations are also fairly hard to
      estimate accurately when several tables have been joined already or when a
      WHERE clause filters out a set of values that are correlated to the
      expressions we're estimating the number of distinct value for.
      
      For now, the costing we perform during query planning for result caches
      does put quite a bit of faith in the distinct estimations being accurate.
      When these are accurate then we should generally see faster execution
      times for plans containing a result cache.  However, in the real world, we
      may find that we need to either change the costings to put less trust in
      the distinct estimations being accurate or perhaps even disable this
      feature by default.  There's always an element of risk when we teach the
      query planner to do new tricks that it decides to use that new trick at
      the wrong time and causes a regression.  Users may opt to get the old
      behavior by turning the feature off using the enable_resultcache GUC.
      Currently, this is enabled by default.  It remains to be seen if we'll
      maintain that setting for the release.
      
      Additionally, the name "Result Cache" is the best name I could think of
      for this new node at the time I started writing the patch.  Nobody seems
      to strongly dislike the name. A few people did suggest other names but no
      other name seemed to dominate in the brief discussion that there was about
      names. Let's allow the beta period to see if the current name pleases
      enough people.  If there's some consensus on a better name, then we can
      change it before the release.  Please see the 2nd discussion link below
      for the discussion on the "Result Cache" name.
      
      Author: David Rowley
      Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
      Tested-By: Konstantin Knizhnik
      Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
      Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
      9eacee2e
    • Michael Paquier's avatar
      Improve stability of test with vacuum_truncate in reloptions.sql · fe246d1c
      Michael Paquier authored
      This test has been using a simple VACUUM with pg_relation_size() to
      check if a relation gets physically truncated or not, but forgot the
      fact that some concurrent activity, like checkpoint buffer writes, could
      cause some pages to be skipped.  The second test enabling
      vacuum_truncate could fail, seeing a non-empty relation.  The first test
      would not have failed, but could finish by testing a behavior different
      than the one aimed for.  Both tests gain a FREEZE option, to make the
      vacuums more aggressive and prevent page skips.
      
      This is similar to the issues fixed in c2dc1a79.
      
      Author: Arseny Sher
      Reviewed-by: Masahiko Sawada
      Discussion: https://postgr.es/m/87tuotr2hh.fsf@ars-thinkpad
      backpatch-through: 12
      fe246d1c
  2. 01 Apr, 2021 14 commits
    • Tom Lane's avatar
      Rethink handling of pass-by-value leaf datums in SP-GiST. · 1ebdec8c
      Tom Lane authored
      The existing convention in SP-GiST is that any pass-by-value datatype
      is stored in Datum representation, i.e. it's of width sizeof(Datum)
      even when typlen is less than that.  This is okay, or at least it's
      too late to change it, for prefix datums and node-label datums in inner
      (upper) tuples.  But it's problematic for leaf datums, because we'd
      prefer those to be stored in Postgres' standard on-disk representation
      so that we can easily extend leaf tuples to carry additional "included"
      columns.
      
      I believe, however, that we can get away with just up and changing that.
      This would be an unacceptable on-disk-format break, but there are two
      big mitigating factors:
      
      1. It seems quite unlikely that there are any SP-GiST opclasses out
      there that use pass-by-value leaf datatypes.  Certainly none of the
      ones in core do, nor has codesearch.debian.net heard of any.  Given
      what SP-GiST is good for, it's hard to conceive of a use-case where
      the leaf-level values would be both small and fixed-width.  (As an
      example, if you wanted to index text values with the leaf level being
      just a byte, then every text string would have to be represented
      with one level of inner tuple per preceding byte, which would be
      horrendously space-inefficient and slow to access.  You always want
      to use as few inner-tuple levels as possible, leaving as much as
      possible in the leaf values.)
      
      2. Even granting that you have such an index, this change only
      breaks things on big-endian machines.  On little-endian, the high
      order bytes of the Datum format will now just appear to be alignment
      padding space.
      
      So, change the code to store pass-by-value leaf datums in their
      usual on-disk form.  Inner-tuple datums are not touched.
      
      This is extracted from a larger patch that intends to add support for
      "included" columns.  I'm committing it separately for visibility in
      our commit logs.
      
      Pavel Borisov and Tom Lane, reviewed by Andrey Borodin
      
      Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
      1ebdec8c
    • Stephen Frost's avatar
      Rename Default Roles to Predefined Roles · c9c41c7a
      Stephen Frost authored
      The term 'default roles' wasn't quite apt as these roles aren't able to
      be modified or removed after installation, so rename them to be
      'Predefined Roles' instead, adding an entry into the newly added
      Obsolete Appendix to help users of current releases find the new
      documentation.
      
      Bruce Momjian and Stephen Frost
      
      Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org
      and https://www.postgresql.org/message-id/20201120211304.GG16415@tamriel.snowman.net
      c9c41c7a
    • Alvaro Herrera's avatar
      Fix setvbuf()-induced crash in libpq_pipeline · a68a894f
      Alvaro Herrera authored
      Windows doesn't like setvbuf(..., _IOLBF) and crashes if you use it,
      which has been causing the libpq_pipeline failures all along ... and our
      own port.h has known about it for a long time: it offers PG_IOLBF that's
      defined to _IONBF on that platform.  Follow its advice.
      
      While at it, get rid of a bogus bitshift that used a constant of the
      wrong size.  Decorate the constant as LL to fix.  While at it, remove a
      pointless addition that only confused matters.
      
      All as diagnosed by Tom Lane.
      
      Discussion: https://postgr.es/m/3458958.1617302154@sss.pgh.pa.us
      a68a894f
    • Robert Haas's avatar
      amcheck: Fix verify_heapam's tuple visibility checking rules. · 3b6c1259
      Robert Haas authored
      We now follow the order of checks from HeapTupleSatisfies* more
      closely to avoid coming to erroneous conclusions.
      
      Mark Dilger and Robert Haas
      
      Discussion: http://postgr.es/m/CA+Tgmob6sii0yTvULYJ0Vq4w6ZBmj7zUhddL3b+SKDi9z9NA7Q@mail.gmail.com
      3b6c1259
    • Tom Lane's avatar
      Fix pg_restore's misdesigned code for detecting archive file format. · ec03f2df
      Tom Lane authored
      Despite the clear comments pointing out that the duplicative code
      segments in ReadHead() and _discoverArchiveFormat() needed to be
      in sync, they were not: the latter did not bother to apply any of
      the sanity checks in the former.  We'd missed noticing this partly
      because none of those checks would fail in scenarios we customarily
      test, and partly because the oversight would be masked if both
      segments execute, which they would in cases other than needing to
      autodetect the format of a non-seekable stdin source.  However,
      in a case meeting all these requirements --- for example, trying
      to read a newer-than-supported archive format from non-seekable
      stdin --- pg_restore missed applying the version check and would
      likely dump core or otherwise misbehave.
      
      The whole thing is silly anyway, because there seems little reason
      to duplicate the logic beyond the one-line verification that the
      file starts with "PGDMP".  There seems to have been an undocumented
      assumption that multiple major formats (major enough to require
      separate reader modules) would nonetheless share the first half-dozen
      fields of the custom-format header.  This seems unlikely, so let's
      fix it by just nuking the duplicate logic in _discoverArchiveFormat().
      
      Also get rid of the pointless attempt to seek back to the start of
      the file after successful autodetection.  That wastes cycles and
      it means we have four behaviors to verify not two.
      
      Per bug #16951 from Sergey Koposov.  This has been broken for
      decades, so back-patch to all supported versions.
      
      Discussion: https://postgr.es/m/16951-a4dd68cf0de23048@postgresql.org
      ec03f2df
    • Peter Eisentraut's avatar
      Fix internal extract(timezone_minute) formulas · 91e7c903
      Peter Eisentraut authored
      Through various refactorings over time, the extract(timezone_minute
      from time with time zone) and extract(timezone_minute from timestamp
      with time zone) implementations ended up with two different but
      equally nonsensical formulas by using SECS_PER_MINUTE and
      MINS_PER_HOUR interchangeably.  Since those two are of course both the
      same number, the formulas do work, but for readability, fix them to be
      semantically correct.
      91e7c903
    • Alvaro Herrera's avatar
      libpq_pipeline: Must strdup(optarg) to avoid crash · dde1a35a
      Alvaro Herrera authored
      I forgot to strdup() when processing argv[].  Apparently many platforms
      hide this mistake from users, but in those that don't you may get a
      program crash.  Repair.
      
      Per buildfarm member drongo, which is the only one in all the buildfarm
      manifesting a problem here.
      
      While at it, move "numrows" processing out of the line of special cases,
      and make it getopt's -r instead.  (A similar thing could be done to
      'conninfo', but current use of the program doesn't warrant spending time
      on that -- nowhere else we use conninfo in so simplistic a manner.)
      
      Discussion: https://postgr.es/m/20210401124850.GA19247@alvherre.pgsql
      dde1a35a
    • Heikki Linnakangas's avatar
      Do COPY FROM encoding conversion/verification in larger chunks. · f82de5c4
      Heikki Linnakangas authored
      This gives a small performance gain, by reducing the number of calls
      to the conversion/verification function, and letting it work with
      larger inputs. Also, reorganizing the input pipeline makes it easier
      to parallelize the input parsing: after the input has been converted
      to the database encoding, the next stage of finding the newlines can
      be done in parallel, because there cannot be any newline chars
      "embedded" in multi-byte characters in the encodings that we support
      as server encodings.
      
      This changes behavior in one corner case: if client and server
      encodings are the same single-byte encoding (e.g. latin1), previously
      the input would not be checked for zero bytes ('\0'). Any fields
      containing zero bytes would be truncated at the zero. But if encoding
      conversion was needed, the conversion routine would throw an error on
      the zero. After this commit, the input is always checked for zeros.
      
      Reviewed-by: John Naylor
      Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
      f82de5c4
    • Heikki Linnakangas's avatar
      Add 'noError' argument to encoding conversion functions. · ea1b99a6
      Heikki Linnakangas authored
      With the 'noError' argument, you can try to convert a buffer without
      knowing the character boundaries beforehand. The functions now need to
      return the number of input bytes successfully converted.
      
      This is is a backwards-incompatible change, if you have created a custom
      encoding conversion with CREATE CONVERSION. This adds a check to
      pg_upgrade for that, refusing the upgrade if there are any user-defined
      encoding conversions. Custom conversions are very rare, there are no
      commonly used extensions that I know of that uses that feature. No other
      objects can depend on conversions, so if you do have one, you can fairly
      easily drop it before upgrading, and recreate it after the upgrade with
      an updated version.
      
      Add regression tests for built-in encoding conversions. This doesn't cover
      every conversion, but it covers all the internal functions in conv.c that
      are used to implement the conversions.
      
      Reviewed-by: John Naylor
      Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
      ea1b99a6
    • Peter Eisentraut's avatar
      Make extract(timetz) tests a bit more interesting · e2639a76
      Peter Eisentraut authored
      Use a time zone offset with nonzero minutes to make the
      timezone_minute test meaningful.
      e2639a76
    • Michael Paquier's avatar
      doc: Clarify use of ACCESS EXCLUSIVE lock in various sections · ffd3391e
      Michael Paquier authored
      Some sections of the documentation used "exclusive lock" to describe
      that an ACCESS EXCLUSIVE lock is taken during a given operation.  This
      can be confusing to the reader as ACCESS SHARE is allowed with an
      EXCLUSIVE lock is used, but that would not be the case with what is
      described on those parts of the documentation.
      
      Author: Greg Rychlewski
      Discussion: https://postgr.es/m/CAKemG7VptD=7fNWckFMsMVZL_zzvgDO6v2yVmQ+ZiBfc_06kCQ@mail.gmail.com
      Backpatch-through: 9.6
      ffd3391e
    • Amit Kapila's avatar
      Ensure to send a prepare after we detect concurrent abort during decoding. · 47788265
      Amit Kapila authored
      It is possible that while decoding a prepared transaction, it gets aborted
      concurrently via a ROLLBACK PREPARED command. In that case, we were
      skipping all the changes and directly sending Rollback Prepared when we
      find the same in WAL. However, the downstream has no idea of the GID of
      such a transaction. So, ensure to send prepare even when a concurrent
      abort is detected.
      
      Author: Ajin Cherian
      Reviewed-by: Markus Wanner, Amit Kapila
      Discussion: https://postgr.es/m/f82133c6-6055-b400-7922-97dae9f2b50b@enterprisedb.com
      47788265
    • Michael Paquier's avatar
      Move some client-specific routines from SSLServer to PostgresNode · 0d1a3343
      Michael Paquier authored
      test_connect_ok() and test_connect_fails() have always been part of the
      SSL tests, and check if a connection to the backend should work or not,
      and there are sanity checks done on specific error patterns dropped by
      libpq if the connection fails.
      
      This was fundamentally wrong on two aspects.  First, SSLServer.pm works
      mostly on setting up and changing the SSL configuration of a
      PostgresNode, and has really nothing to do with the client.  Second,
      the situation became worse in light of b34ca595, where the SSL tests
      would finish by using a psql command that may not come from the same
      installation as the node set up.
      
      This commit moves those client routines into PostgresNode, making easier
      the refactoring of SSLServer to become more SSL-implementation aware.
      This can also be reused by the ldap, kerberos and authentication test
      suites for connection checks, and a follow-up patch should extend those
      interfaces to match with backend log patterns.
      
      Author: Michael Paquier
      Reviewed-by: Andrew Dunstan, Daniel Gustafsson, Álvaro Herrera
      Discussion: https://postgr.es/m/YGLKNBf9zyh6+WSt@paquier.xyz
      0d1a3343
    • David Rowley's avatar
      Revert b6002a79 · 28b3e390
      David Rowley authored
      This removes "Add Result Cache executor node".  It seems that something
      weird is going on with the tracking of cache hits and misses as
      highlighted by many buildfarm animals.  It's not yet clear what the
      problem is as other parts of the plan indicate that the cache did work
      correctly, it's just the hits and misses that were being reported as 0.
      
      This is especially a bad time to have the buildfarm so broken, so
      reverting before too many more animals go red.
      
      Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
      28b3e390
  3. 31 Mar, 2021 17 commits
    • David Rowley's avatar
      Add Result Cache executor node · b6002a79
      David Rowley authored
      Here we add a new executor node type named "Result Cache".  The planner
      can include this node type in the plan to have the executor cache the
      results from the inner side of parameterized nested loop joins.  This
      allows caching of tuples for sets of parameters so that in the event that
      the node sees the same parameter values again, it can just return the
      cached tuples instead of rescanning the inner side of the join all over
      again.  Internally, result cache uses a hash table in order to quickly
      find tuples that have been previously cached.
      
      For certain data sets, this can significantly improve the performance of
      joins.  The best cases for using this new node type are for join problems
      where a large portion of the tuples from the inner side of the join have
      no join partner on the outer side of the join.  In such cases, hash join
      would have to hash values that are never looked up, thus bloating the hash
      table and possibly causing it to multi-batch.  Merge joins would have to
      skip over all of the unmatched rows.  If we use a nested loop join with a
      result cache, then we only cache tuples that have at least one join
      partner on the outer side of the join.  The benefits of using a
      parameterized nested loop with a result cache increase when there are
      fewer distinct values being looked up and the number of lookups of each
      value is large.  Also, hash probes to lookup the cache can be much faster
      than the hash probe in a hash join as it's common that the result cache's
      hash table is much smaller than the hash join's due to result cache only
      caching useful tuples rather than all tuples from the inner side of the
      join.  This variation in hash probe performance is more significant when
      the hash join's hash table no longer fits into the CPU's L3 cache, but the
      result cache's hash table does.  The apparent "random" access of hash
      buckets with each hash probe can cause a poor L3 cache hit ratio for large
      hash tables.  Smaller hash tables generally perform better.
      
      The hash table used for the cache limits itself to not exceeding work_mem
      * hash_mem_multiplier in size.  We maintain a dlist of keys for this cache
      and when we're adding new tuples and realize we've exceeded the memory
      budget, we evict cache entries starting with the least recently used ones
      until we have enough memory to add the new tuples to the cache.
      
      For parameterized nested loop joins, we now consider using one of these
      result cache nodes in between the nested loop node and its inner node.  We
      determine when this might be useful based on cost, which is primarily
      driven off of what the expected cache hit ratio will be.  Estimating the
      cache hit ratio relies on having good distinct estimates on the nested
      loop's parameters.
      
      For now, the planner will only consider using a result cache for
      parameterized nested loop joins.  This works for both normal joins and
      also for LATERAL type joins to subqueries.  It is possible to use this new
      node for other uses in the future.  For example, to cache results from
      correlated subqueries.  However, that's not done here due to some
      difficulties obtaining a distinct estimation on the outer plan to
      calculate the estimated cache hit ratio.  Currently we plan the inner plan
      before planning the outer plan so there is no good way to know if a result
      cache would be useful or not since we can't estimate the number of times
      the subplan will be called until the outer plan is generated.
      
      The functionality being added here is newly introducing a dependency on
      the return value of estimate_num_groups() during the join search.
      Previously, during the join search, we only ever needed to perform
      selectivity estimations.  With this commit, we need to use
      estimate_num_groups() in order to estimate what the hit ratio on the
      result cache will be.   In simple terms, if we expect 10 distinct values
      and we expect 1000 outer rows, then we'll estimate the hit ratio to be
      99%.  Since cache hits are very cheap compared to scanning the underlying
      nodes on the inner side of the nested loop join, then this will
      significantly reduce the planner's cost for the join.   However, it's
      fairly easy to see here that things will go bad when estimate_num_groups()
      incorrectly returns a value that's significantly lower than the actual
      number of distinct values.  If this happens then that may cause us to make
      use of a nested loop join with a result cache instead of some other join
      type, such as a merge or hash join.  Our distinct estimations have been
      known to be a source of trouble in the past, so the extra reliance on them
      here could cause the planner to choose slower plans than it did previous
      to having this feature.  Distinct estimations are also fairly hard to
      estimate accurately when several tables have been joined already or when a
      WHERE clause filters out a set of values that are correlated to the
      expressions we're estimating the number of distinct value for.
      
      For now, the costing we perform during query planning for result caches
      does put quite a bit of faith in the distinct estimations being accurate.
      When these are accurate then we should generally see faster execution
      times for plans containing a result cache.  However, in the real world, we
      may find that we need to either change the costings to put less trust in
      the distinct estimations being accurate or perhaps even disable this
      feature by default.  There's always an element of risk when we teach the
      query planner to do new tricks that it decides to use that new trick at
      the wrong time and causes a regression.  Users may opt to get the old
      behavior by turning the feature off using the enable_resultcache GUC.
      Currently, this is enabled by default.  It remains to be seen if we'll
      maintain that setting for the release.
      
      Additionally, the name "Result Cache" is the best name I could think of
      for this new node at the time I started writing the patch.  Nobody seems
      to strongly dislike the name. A few people did suggest other names but no
      other name seemed to dominate in the brief discussion that there was about
      names. Let's allow the beta period to see if the current name pleases
      enough people.  If there's some consensus on a better name, then we can
      change it before the release.  Please see the 2nd discussion link below
      for the discussion on the "Result Cache" name.
      
      Author: David Rowley
      Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu
      Tested-By: Konstantin Knizhnik
      Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
      Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
      b6002a79
    • Alvaro Herrera's avatar
      Remove setvbuf() call from PQtrace() · 6ec578e6
      Alvaro Herrera authored
      It's misplaced there -- it's not libpq's output stream to tweak in that
      way.  In particular, POSIX says that it has to be called before any
      other operation on the file, so if a stream previously used by the
      calling application, bad things may happen.
      
      Put setvbuf() in libpq_pipeline for good measure.
      
      Also, reduce fopen(..., "w+") to just fopen(..., "w") in
      libpq_pipeline.c.  It's not clear that this fixes anything, but we don't
      use w+ anywhere.
      
      Per complaints from Tom Lane.
      
      Discussion: https://postgr.es/m/3337422.1617229905@sss.pgh.pa.us
      6ec578e6
    • Alvaro Herrera's avatar
      Initialize conn->Pfdebug to NULL when creating a connection · aba24b51
      Alvaro Herrera authored
      Failing to do this can cause a crash, and I suspect is what has happened
      with a buildfarm member reporting mysterious failures.
      
      This is an ancient bug, but I'm not backpatching since evidently nobody
      cares about PQtrace in older releases.
      
      Discussion: https://postgr.es/m/3333908.1617227066@sss.pgh.pa.us
      aba24b51
    • Alvaro Herrera's avatar
      Disable force_parallel_mode in libpq_pipeline · a6d3dea8
      Alvaro Herrera authored
      Some buildfarm animals with force_parallel_mode=regress were failing
      this test because the error is reported in a parallel worker quicker
      than the rows that succeed.
      
      Take the opportunity to move the SET of lc_messages out of the traced
      section, because it's not very interesting.
      Diagnosed-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Discussion: https://postgr.es/m/3304521.1617221724@sss.pgh.pa.us
      a6d3dea8
    • Tom Lane's avatar
      Fix unportable use of isprint(). · 9e20406d
      Tom Lane authored
      We must cast the arguments of <ctype.h> functions to unsigned
      char to avoid problems where char is signed.
      
      Speaking of which, considering that this *is* a <ctype.h> function,
      it's rather remarkable that we aren't seeing more complaints about
      not having included that header.
      
      Per buildfarm.
      9e20406d
    • Tom Lane's avatar
      Fix portability and safety issues in pqTraceFormatTimestamp. · f1be740a
      Tom Lane authored
      Remove confusion between time_t and pg_time_t; neither
      gettimeofday() nor localtime() deal in the latter.
      libpq indeed has no business using <pgtime.h> at all.
      
      Use snprintf not sprintf, to ensure we can't overrun the
      supplied buffer.  (Unlikely, but let's be safe.)
      
      Per buildfarm.
      f1be740a
    • Tom Lane's avatar
      Silence compiler warning in non-assert builds. · 8998e3ca
      Tom Lane authored
      Per buildfarm.
      8998e3ca
    • Tom Lane's avatar
      Don't prematurely cram a value into a short int. · c545e952
      Tom Lane authored
      Since a4d75c86, some buildfarm members have been warning that
      		Assert(attnum <= MaxAttrNumber);
      is useless if attnum is an AttrNumber.  I'm not certain how plausible
      it is that the value coming out of the bitmap could actually exceed
      MaxAttrNumber, but we seem to have thought that that was possible back
      in 7300a699.  Revert the intermediate variable to int so that we have
      the same overflow protection as before.
      c545e952
    • Stephen Frost's avatar
      Add a docs section for obsoleted and renamed functions and settings · 3b0c647b
      Stephen Frost authored
      The new appendix groups information on renamed or removed settings,
      commands, etc into an out-of-the-way part of the docs.
      
      The original id elements are retained in each subsection to ensure that
      the same filenames are produced for HTML docs. This prevents /current/
      links on the web from breaking, and allows users of the web docs
      to follow links from old version pages to info on the changes in the
      new version. Prior to this change, a link to /current/ for renamed
      sections like the recovery.conf docs would just 404. Similarly if
      someone searched for recovery.conf they would find the pg11 docs,
      but there would be no /12/ or /current/ link, so they couldn't easily
      find out that it was removed in pg12 or how to adapt.
      
      Index entries are also added so that there's a breadcrumb trail for
      users to follow when they know the old name, but not what we changed it
      to. So a user who is trying to find out how to set standby_mode in
      PostgreSQL 12+, or where pg_resetxlog went, now has more chance of
      finding that information.
      
      Craig Ringer and Stephen Frost
      Reviewed-by: Euler Taveira
      Discussion: https://postgr.es/m/CAGRY4nzPNOyYQ_1-pWYToUVqQ0ThqP5jdURnJMZPm539fdizOg%40mail.gmail.com
      Backpatch-through: 10
      3b0c647b
    • Tom Lane's avatar
      Suppress compiler warning in libpq_pipeline.c. · 522d1a89
      Tom Lane authored
      Some compilers seem to be concerned about the possibility that
      recv_step is not any of the defined enum values.  Silence
      warnings about uninitialized cmdtag in a different way than
      I did in 9fb9691a.
      522d1a89
    • Tom Lane's avatar
      Improve style of some replication-related error messages. · 6197db53
      Tom Lane authored
      Put the remote end's error message into the primary error string,
      instead of relegating it to errdetail().  Although this could end up
      being awkward if the remote sends us a really long error message,
      it seems more in keeping with our message style guidelines, and more
      helpful in situations where the errdetail could get dropped.
      
      Peter Smith
      
      Discussion: https://postgr.es/m/CAHut+Ps-Qv2yQceCwobQDP0aJOkfDzRFrOaR6+2Op2K=WHGeWg@mail.gmail.com
      6197db53
    • Alvaro Herrera's avatar
      Fix some libpq_pipeline test problems · db973ffb
      Alvaro Herrera authored
      Test pipeline_abort was not checking that it got the rows it expected in
      one mode; make it do so.  This doesn't fix the actual problem (no idea
      what that is, yet) but at least it should make it more obvious rather
      than being visible only as a difference in the trace output.
      
      While at it, fix other infelicities in the test:
      
      * I reversed the order of result vs. expected in like().
      
      * The output traces from -t are being put in the log dir, which means
      the buildfarm script uselessly captures them.  Put them in a separate
      dir tmp_check/traces instead, to avoid cluttering the buildfarm results.
      
      * Test pipelined_insert was using too large a row count.  Reduce that a
      tad and add a filler column to make each insert a little bulkier, while
      still keeping enough that a buffer is filled and we have to switch mode.
      db973ffb
    • Joe Conway's avatar
      Fix has_column_privilege function corner case · b12bd486
      Joe Conway authored
      According to the comments, when an invalid or dropped column oid is passed
      to has_column_privilege(), the intention has always been to return NULL.
      However, when the caller had table level privilege the invalid/missing
      column was never discovered, because table permissions were checked first.
      
      Fix that by introducing extended versions of pg_attribute_acl(check|mask)
      and pg_class_acl(check|mask) which take a new argument, is_missing. When
      is_missing is NULL, the old behavior is preserved. But when is_missing is
      passed by the caller, no ERROR is thrown for dropped or missing
      columns/relations, and is_missing is flipped to true. This in turn allows
      has_column_privilege to check for column privileges first, providing the
      desired semantics.
      
      Not backpatched since it is a user visible behavioral change with no previous
      complaints, and the fix is a bit on the invasive side.
      
      Author: Joe Conway
      Reviewed-By: Tom Lane
      Reported by: Ian Barwick
      Discussion: https://postgr.es/m/flat/9b5f4311-157b-4164-7fe7-077b4fe8ed84%40joeconway.com
      b12bd486
    • Tom Lane's avatar
      Rework planning and execution of UPDATE and DELETE. · 86dc9005
      Tom Lane authored
      This patch makes two closely related sets of changes:
      
      1. For UPDATE, the subplan of the ModifyTable node now only delivers
      the new values of the changed columns (i.e., the expressions computed
      in the query's SET clause) plus row identity information such as CTID.
      ModifyTable must re-fetch the original tuple to merge in the old
      values of any unchanged columns.  The core advantage of this is that
      the changed columns are uniform across all tables of an inherited or
      partitioned target relation, whereas the other columns might not be.
      A secondary advantage, when the UPDATE involves joins, is that less
      data needs to pass through the plan tree.  The disadvantage of course
      is an extra fetch of each tuple to be updated.  However, that seems to
      be very nearly free in context; even worst-case tests don't show it to
      add more than a couple percent to the total query cost.  At some point
      it might be interesting to combine the re-fetch with the tuple access
      that ModifyTable must do anyway to mark the old tuple dead; but that
      would require a good deal of refactoring and it seems it wouldn't buy
      all that much, so this patch doesn't attempt it.
      
      2. For inherited UPDATE/DELETE, instead of generating a separate
      subplan for each target relation, we now generate a single subplan
      that is just exactly like a SELECT's plan, then stick ModifyTable
      on top of that.  To let ModifyTable know which target relation a
      given incoming row refers to, a tableoid junk column is added to
      the row identity information.  This gets rid of the horrid hack
      that was inheritance_planner(), eliminating O(N^2) planning cost
      and memory consumption in cases where there were many unprunable
      target relations.
      
      Point 2 of course requires point 1, so that there is a uniform
      definition of the non-junk columns to be returned by the subplan.
      We can't insist on uniform definition of the row identity junk
      columns however, if we want to keep the ability to have both
      plain and foreign tables in a partitioning hierarchy.  Since
      it wouldn't scale very far to have every child table have its
      own row identity column, this patch includes provisions to merge
      similar row identity columns into one column of the subplan result.
      In particular, we can merge the whole-row Vars typically used as
      row identity by FDWs into one column by pretending they are type
      RECORD.  (It's still okay for the actual composite Datums to be
      labeled with the table's rowtype OID, though.)
      
      There is more that can be done to file down residual inefficiencies
      in this patch, but it seems to be committable now.
      
      FDW authors should note several API changes:
      
      * The argument list for AddForeignUpdateTargets() has changed, and so
      has the method it must use for adding junk columns to the query.  Call
      add_row_identity_var() instead of manipulating the parse tree directly.
      You might want to reconsider exactly what you're adding, too.
      
      * PlanDirectModify() must now work a little harder to find the
      ForeignScan plan node; if the foreign table is part of a partitioning
      hierarchy then the ForeignScan might not be the direct child of
      ModifyTable.  See postgres_fdw for sample code.
      
      * To check whether a relation is a target relation, it's no
      longer sufficient to compare its relid to root->parse->resultRelation.
      Instead, check it against all_result_relids or leaf_result_relids,
      as appropriate.
      
      Amit Langote and Tom Lane
      
      Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
      86dc9005
    • Peter Eisentraut's avatar
      Allow an alias to be attached to a JOIN ... USING · 055fee7e
      Peter Eisentraut authored
      This allows something like
      
          SELECT ... FROM t1 JOIN t2 USING (a, b, c) AS x
      
      where x has the columns a, b, c and unlike a regular alias it does not
      hide the range variables of the tables being joined t1 and t2.
      
      Per SQL:2016 feature F404 "Range variable for common column names".
      Reviewed-by: default avatarVik Fearing <vik.fearing@2ndquadrant.com>
      Reviewed-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Discussion: https://www.postgresql.org/message-id/flat/454638cf-d563-ab76-a585-2564428062af@2ndquadrant.com
      055fee7e
    • Etsuro Fujita's avatar
      Add support for asynchronous execution. · 27e1f145
      Etsuro Fujita authored
      This implements asynchronous execution, which runs multiple parts of a
      non-parallel-aware Append concurrently rather than serially to improve
      performance when possible.  Currently, the only node type that can be
      run concurrently is a ForeignScan that is an immediate child of such an
      Append.  In the case where such ForeignScans access data on different
      remote servers, this would run those ForeignScans concurrently, and
      overlap the remote operations to be performed simultaneously, so it'll
      improve the performance especially when the operations involve
      time-consuming ones such as remote join and remote aggregation.
      
      We may extend this to other node types such as joins or aggregates over
      ForeignScans in the future.
      
      This also adds the support for postgres_fdw, which is enabled by the
      table-level/server-level option "async_capable".  The default is false.
      
      Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself.  This commit
      is mostly based on the patch proposed by Robert Haas, but also uses
      stuff from the patch proposed by Kyotaro Horiguchi and from the patch
      proposed by Thomas Munro.  Reviewed by Kyotaro Horiguchi, Konstantin
      Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and
      others.
      
      Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
      Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com
      Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
      27e1f145
    • Peter Eisentraut's avatar
      Add p_names field to ParseNamespaceItem · 66392d39
      Peter Eisentraut authored
      ParseNamespaceItem had a wired-in assumption that p_rte->eref
      describes the table and column aliases exposed by the nsitem.  This
      relaxes this by creating a separate p_names field in an nsitem.  This
      is mainly preparation for a patch for JOIN USING aliases, but it saves
      one indirection in common code paths, so it's possibly a win on its
      own.
      
      Author: Tom Lane <tgl@sss.pgh.pa.us>
      Discussion: https://www.postgresql.org/message-id/785329.1616455091@sss.pgh.pa.us
      66392d39