1. 07 Apr, 2020 1 commit
  2. 06 Apr, 2020 1 commit
    • Tomas Vondra's avatar
      Implement Incremental Sort · d2d8a229
      Tomas Vondra authored
      Incremental Sort is an optimized variant of multikey sort for cases when
      the input is already sorted by a prefix of the requested sort keys. For
      example when the relation is already sorted by (key1, key2) and we need
      to sort it by (key1, key2, key3) we can simply split the input rows into
      groups having equal values in (key1, key2), and only sort/compare the
      remaining column key3.
      
      This has a number of benefits:
      
      - Reduced memory consumption, because only a single group (determined by
        values in the sorted prefix) needs to be kept in memory. This may also
        eliminate the need to spill to disk.
      
      - Lower startup cost, because Incremental Sort produce results after each
        prefix group, which is beneficial for plans where startup cost matters
        (like for example queries with LIMIT clause).
      
      We consider both Sort and Incremental Sort, and decide based on costing.
      
      The implemented algorithm operates in two different modes:
      
      - Fetching a minimum number of tuples without check of equality on the
        prefix keys, and sorting on all columns when safe.
      
      - Fetching all tuples for a single prefix group and then sorting by
        comparing only the remaining (non-prefix) keys.
      
      We always start in the first mode, and employ a heuristic to switch into
      the second mode if we believe it's beneficial - the goal is to minimize
      the number of unnecessary comparions while keeping memory consumption
      below work_mem.
      
      This is a very old patch series. The idea was originally proposed by
      Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
      patch was taken over by James Coleman, who wrote and rewrote most of the
      current code.
      
      There were many reviewers/contributors since 2013 - I've done my best to
      pick the most active ones, and listed them in this commit message.
      
      Author: James Coleman, Alexander Korotkov
      Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
      Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
      Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
      d2d8a229
  3. 04 Apr, 2020 1 commit
    • Noah Misch's avatar
      Skip WAL for new relfilenodes, under wal_level=minimal. · c6b92041
      Noah Misch authored
      Until now, only selected bulk operations (e.g. COPY) did this.  If a
      given relfilenode received both a WAL-skipping COPY and a WAL-logged
      operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
      src/backend/access/transam/README section "Skipping WAL for New
      RelFileNode" for the new coding rules.  Maintainers of table access
      methods should examine that section.
      
      To maintain data durability, just before commit, we choose between an
      fsync of the relfilenode and copying its contents to WAL.  A new GUC,
      wal_skip_threshold, guides that choice.  If this change slows a workload
      that creates small, permanent relfilenodes under wal_level=minimal, try
      adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
      need to adjust that timeout, and log_min_duration_statement analysis
      will reflect time consumption moving to COMMIT from commands like COPY.
      
      Internally, this requires a reliable determination of whether
      RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
      current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
      specification of rd_createSubid such that the field is zero when a new
      rel has an old rd_node.  Make relcache.c retain entries for certain
      dropped relations until end of transaction.
      
      Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
      Future servers accept older WAL, so this bump is discretionary.
      
      Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
      Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
      designs that materially clarified the problem.  Reviewed, in earlier
      designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
      Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.
      
      Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
      c6b92041
  4. 02 Apr, 2020 2 commits
    • Tom Lane's avatar
      Improve user control over truncation of logged bind-parameter values. · 0b34e7d3
      Tom Lane authored
      This patch replaces the boolean GUC log_parameters_on_error introduced
      by commit ba79cb5d with an integer log_parameter_max_length_on_error,
      adding the ability to specify how many bytes to trim each logged
      parameter value to.  (The previous coding hard-wired that choice at
      64 bytes.)
      
      In addition, add a new parameter log_parameter_max_length that provides
      similar control over truncation of query parameters that are logged in
      response to statement-logging options, as opposed to errors.  Previous
      releases always logged such parameters in full, possibly causing log
      bloat.
      
      For backwards compatibility with prior releases,
      log_parameter_max_length defaults to -1 (log in full), while
      log_parameter_max_length_on_error defaults to 0 (no logging).
      
      Per discussion, log_parameter_max_length is SUSET since the DBA should
      control routine logging behavior, but log_parameter_max_length_on_error
      is USERSET because it also affects errcontext data sent back to the
      client.
      
      Alexey Bashtanov, editorialized a little by me
      
      Discussion: https://postgr.es/m/b10493cc-a399-a03a-67c7-068f2791ee50@imap.cc
      0b34e7d3
    • Peter Eisentraut's avatar
      doc: Update for Unix-domain sockets on Windows · 580a446c
      Peter Eisentraut authored
      Update the documentation to reflect that Unix-domain sockets are now
      usable on Windows.
      580a446c
  5. 28 Mar, 2020 1 commit
    • David Rowley's avatar
      Trigger autovacuum based on number of INSERTs · b07642db
      David Rowley authored
      Traditionally autovacuum has only ever invoked a worker based on the
      estimated number of dead tuples in a table and for anti-wraparound
      purposes. For the latter, with certain classes of tables such as
      insert-only tables, anti-wraparound vacuums could be the first vacuum that
      the table ever receives. This could often lead to autovacuum workers being
      busy for extended periods of time due to having to potentially freeze
      every page in the table. This could be particularly bad for very large
      tables. New clusters, or recently pg_restored clusters could suffer even
      more as many large tables may have the same relfrozenxid, which could
      result in large numbers of tables requiring an anti-wraparound vacuum all
      at once.
      
      Here we aim to reduce the work required by anti-wraparound and aggressive
      vacuums in general, by triggering autovacuum when the table has received
      enough INSERTs. This is controlled by adding two new GUCs and reloptions;
      autovacuum_vacuum_insert_threshold and
      autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
      existing scale factor and threshold controls, only base themselves off the
      number of inserts since the last vacuum, rather than the number of dead
      tuples. New controls were added rather than reusing the existing
      controls, to allow these new vacuums to be tuned independently and perhaps
      even completely disabled altogether, which can be done by setting
      autovacuum_vacuum_insert_threshold to -1.
      
      We make no attempt to skip index cleanup operations on these vacuums as
      they may trigger for an insert-mostly table which continually doesn't have
      enough dead tuples to trigger an autovacuum for the purpose of removing
      those dead tuples. If we were to skip cleaning the indexes in this case,
      then it is possible for the index(es) to become bloated over time.
      
      There are additional benefits to triggering autovacuums based on inserts,
      as tables which never contain enough dead tuples to trigger an autovacuum
      are now more likely to receive a vacuum, which can mark more of the table
      as "allvisible" and encourage the query planner to make use of Index Only
      Scans.
      
      Currently, we still obey vacuum_freeze_min_age when triggering these new
      autovacuums based on INSERTs. For large insert-only tables, it may be
      beneficial to lower the table's autovacuum_freeze_min_age so that tuples
      are eligible to be frozen sooner. Here we've opted not to zero that for
      these types of vacuums, since the table may just be insert-mostly and we
      may otherwise freeze tuples that are still destined to be updated or
      removed in the near future.
      
      There was some debate to what exactly the new scale factor and threshold
      should default to. For now, these are set to 0.2 and 1000, respectively.
      There may be some motivation to adjust these before the release.
      
      Author: Laurenz Albe, Darafei Praliaskouski
      Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
      Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
      b07642db
  6. 27 Mar, 2020 2 commits
  7. 24 Mar, 2020 1 commit
    • Fujii Masao's avatar
      Prefer standby promotion over recovery pause. · 496ee647
      Fujii Masao authored
      Previously if a promotion was triggered while recovery was paused,
      the paused state continued. Also recovery could be paused by executing
      pg_wal_replay_pause() even while a promotion was ongoing. That is,
      recovery pause had higher priority over a standby promotion.
      But this behavior was not desirable because most users basically wanted
      the recovery to complete as soon as possible and the server to become
      the master when they requested a promotion.
      
      This commit changes recovery so that it prefers a promotion over
      recovery pause. That is, if a promotion is triggered while recovery
      is paused, the paused state ends and a promotion continues. Also
      this commit makes recovery pause functions like pg_wal_replay_pause()
      throw an error if they are executed while a promotion is ongoing.
      
      Internally, this commit adds new internal function PromoteIsTriggered()
      that returns true if a promotion is triggered. Since the name of
      this function and the existing function IsPromoteTriggered() are
      confusingly similar, the commit changes the name of IsPromoteTriggered()
      to IsPromoteSignaled, as more appropriate name.
      
      Author: Fujii Masao
      Reviewed-by: Atsushi Torikoshi, Sergei Kornilov
      Discussion: https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com
      496ee647
  8. 22 Mar, 2020 1 commit
  9. 21 Mar, 2020 1 commit
    • Noah Misch's avatar
      Skip WAL for new relfilenodes, under wal_level=minimal. · cb2fd7ea
      Noah Misch authored
      Until now, only selected bulk operations (e.g. COPY) did this.  If a
      given relfilenode received both a WAL-skipping COPY and a WAL-logged
      operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
      src/backend/access/transam/README section "Skipping WAL for New
      RelFileNode" for the new coding rules.  Maintainers of table access
      methods should examine that section.
      
      To maintain data durability, just before commit, we choose between an
      fsync of the relfilenode and copying its contents to WAL.  A new GUC,
      wal_skip_threshold, guides that choice.  If this change slows a workload
      that creates small, permanent relfilenodes under wal_level=minimal, try
      adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
      need to adjust that timeout, and log_min_duration_statement analysis
      will reflect time consumption moving to COMMIT from commands like COPY.
      
      Internally, this requires a reliable determination of whether
      RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
      current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
      specification of rd_createSubid such that the field is zero when a new
      rel has an old rd_node.  Make relcache.c retain entries for certain
      dropped relations until end of transaction.
      
      Back-patch to 9.5 (all supported versions).  This introduces a new WAL
      record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
      always, update standby systems before master systems.  This changes
      sizeof(RelationData) and sizeof(IndexStmt), breaking binary
      compatibility for affected extensions.  (The most recent commit to
      affect the same class of extensions was
      089e4d405d0f3b94c74a2c6a54357a84a681754b.)
      
      Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
      Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
      designs that materially clarified the problem.  Reviewed, in earlier
      designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
      Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.
      
      Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
      cb2fd7ea
  10. 18 Mar, 2020 1 commit
    • Jeff Davis's avatar
      Disk-based Hash Aggregation. · 1f39bce0
      Jeff Davis authored
      While performing hash aggregation, track memory usage when adding new
      groups to a hash table. If the memory usage exceeds work_mem, enter
      "spill mode".
      
      In spill mode, new groups are not created in the hash table(s), but
      existing groups continue to be advanced if input tuples match. Tuples
      that would cause a new group to be created are instead spilled to a
      logical tape to be processed later.
      
      The tuples are spilled in a partitioned fashion. When all tuples from
      the outer plan are processed (either by advancing the group or
      spilling the tuple), finalize and emit the groups from the hash
      table. Then, create new batches of work from the spilled partitions,
      and select one of the saved batches and process it (possibly spilling
      recursively).
      
      Author: Jeff Davis
      Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
      Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
      1f39bce0
  11. 16 Mar, 2020 1 commit
  12. 15 Mar, 2020 1 commit
  13. 10 Mar, 2020 1 commit
  14. 29 Jan, 2020 1 commit
  15. 22 Jan, 2020 1 commit
    • Fujii Masao's avatar
      Add GUC ignore_invalid_pages. · 41c184bc
      Fujii Masao authored
      Detection of WAL records having references to invalid pages
      during recovery causes PostgreSQL to raise a PANIC-level error,
      aborting the recovery. Setting ignore_invalid_pages to on causes
      the system to ignore those WAL records (but still report a warning),
      and continue recovery. This behavior may cause crashes, data loss,
      propagate or hide corruption, or other serious problems.
      However, it may allow you to get past the PANIC-level error,
      to finish the recovery, and to cause the server to start up.
      
      Author: Fujii Masao
      Reviewed-by: Michael Paquier
      Discussion: https://www.postgresql.org/message-id/CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
      41c184bc
  16. 20 Jan, 2020 1 commit
    • Amit Kapila's avatar
      Allow vacuum command to process indexes in parallel. · 40d964ec
      Amit Kapila authored
      This feature allows the vacuum to leverage multiple CPUs in order to
      process indexes.  This enables us to perform index vacuuming and index
      cleanup with background workers.  This adds a PARALLEL option to VACUUM
      command where the user can specify the number of workers that can be used
      to perform the command which is limited by the number of indexes on a
      table.  Specifying zero as a number of workers will disable parallelism.
      This option can't be used with the FULL option.
      
      Each index is processed by at most one vacuum process.  Therefore parallel
      vacuum can be used when the table has at least two indexes.
      
      The parallel degree is either specified by the user or determined based on
      the number of indexes that the table has, and further limited by
      max_parallel_maintenance_workers.  The index can participate in parallel
      vacuum iff it's size is greater than min_parallel_index_scan_size.
      
      Author: Masahiko Sawada and Amit Kapila
      Reviewed-by: Dilip Kumar, Amit Kapila, Robert Haas, Tomas Vondra,
      Mahendra Singh and Sergei Kornilov
      Tested-by: Mahendra Singh and Prabhat Sahu
      Discussion:
      https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com
      https://postgr.es/m/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0+w@mail.gmail.com
      40d964ec
  17. 14 Jan, 2020 1 commit
  18. 11 Dec, 2019 1 commit
  19. 04 Dec, 2019 1 commit
  20. 29 Nov, 2019 1 commit
  21. 19 Nov, 2019 1 commit
    • Amit Kapila's avatar
      Add logical_decoding_work_mem to limit ReorderBuffer memory usage. · cec2edfa
      Amit Kapila authored
      Instead of deciding to serialize a transaction merely based on the
      number of changes in that xact (toplevel or subxact), this makes
      the decisions based on amount of memory consumed by the changes.
      
      The memory limit is defined by a new logical_decoding_work_mem GUC,
      so for example we can do this
      
          SET logical_decoding_work_mem = '128kB'
      
      to reduce the memory usage of walsenders or set the higher value to
      reduce disk writes. The minimum value is 64kB.
      
      When adding a change to a transaction, we account for the size in
      two places. Firstly, in the ReorderBuffer, which is then used to
      decide if we reached the total memory limit. And secondly in the
      transaction the change belongs to, so that we can pick the largest
      transaction to evict (and serialize to disk).
      
      We still use max_changes_in_memory when loading changes serialized
      to disk. The trouble is we can't use the memory limit directly as
      there might be multiple subxact serialized, we need to read all of
      them but we don't know how many are there (and which subxact to
      read first).
      
      We do not serialize the ReorderBufferTXN entries, so if there is a
      transaction with many subxacts, most memory may be in this type of
      objects. Those records are not included in the memory accounting.
      
      We also do not account for INTERNAL_TUPLECID changes, which are
      kept in a separate list and not evicted from memory. Transactions
      with many CTID changes may consume significant amounts of memory,
      but we can't really do much about that.
      
      The current eviction algorithm is very simple - the transaction is
      picked merely by size, while it might be useful to also consider age
      (LSN) of the changes for example. With the new Generational memory
      allocator, evicting the oldest changes would make it more likely
      the memory gets actually pfreed.
      
      The logical_decoding_work_mem can be set in postgresql.conf, in which
      case it serves as the default for all publishers on that instance.
      
      Author: Tomas Vondra, with changes by Dilip Kumar and Amit Kapila
      Reviewed-by: Dilip Kumar and Amit Kapila
      Tested-By: Vignesh C
      Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
      cec2edfa
  22. 09 Nov, 2019 1 commit
  23. 08 Nov, 2019 1 commit
  24. 06 Nov, 2019 2 commits
    • Tomas Vondra's avatar
      Allow sampling of statements depending on duration · 6e3e6cc0
      Tomas Vondra authored
      This allows logging a sample of statements, without incurring excessive
      log traffic (which may impact performance).  This can be useful when
      analyzing workloads with lots of short queries.
      
      The sampling is configured using two new GUC parameters:
      
       * log_min_duration_sample - minimum required statement duration
      
       * log_statement_sample_rate - sample rate (0.0 - 1.0)
      
      Only statements with duration exceeding log_min_duration_sample are
      considered for sampling. To enable sampling, both those GUCs have to
      be set correctly.
      
      The existing log_min_duration_statement GUC has a higher priority, i.e.
      statements with duration exceeding log_min_duration_statement will be
      always logged, irrespectedly of how the sampling is configured. This
      means only configurations
      
        log_min_duration_sample < log_min_duration_statement
      
      do actually sample the statements, instead of logging everything.
      
      Author: Adrien Nayrat
      Reviewed-by: David Rowley, Vik Fearing, Tomas Vondra
      Discussion: https://postgr.es/m/bbe0a1a8-a8f7-3be2-155a-888e661cc06c@anayrat.info
      6e3e6cc0
    • Tomas Vondra's avatar
      Document log_transaction_sample_rate as superuser-only · 11d9ac28
      Tomas Vondra authored
      The docs do say which GUCs can be changed only by superusers, but we
      forgot to mention this for the new log_transaction_sample_rate. This
      GUC was introduced in PostgreSQL 12, so backpatch accordingly.
      
      Author: Adrien Nayrat
      Backpatch-through: 12
      11d9ac28
  25. 26 Oct, 2019 1 commit
    • Tom Lane's avatar
      Doc: improve documentation of configuration settings that have units. · cfb75590
      Tom Lane authored
      When we added the GUC units feature, we didn't make any great effort
      to adjust the documentation of individual GUCs; they tended to still
      say things like "this is the number of milliseconds that ...", even
      though users might prefer to write some other units, and SHOW might
      even show the value in other units.  Commit 6c9fb69f made an effort
      to improve this situation, but I thought it made things less readable
      by injecting units information in mid-sentence.  It also wasn't very
      consistent, and did not touch all the GUCs that have units.
      
      To improve matters, standardize on the phrasing "If this value is
      specified without units, it is taken as <units>".  Also, try to
      standardize where this is mentioned, right before the specification
      of the default.  (In a couple of places, doing that would've required
      more rewriting than seemed justified, so I wasn't 100% consistent
      about that.)  I also tried to use the phrases "amount of time",
      "amount of memory", etc rather than describing the contents of GUCs
      in other ways, as those were the majority usage in places that weren't
      overcommitting to a particular unit.  (I left "length of time" alone
      in a couple of places, though.)
      
      I failed to resist the temptation to copy-edit some awkward text, too.
      
      Backpatch to v12, like 6c9fb69f, mainly because v12 hasn't diverged
      much from HEAD yet.
      
      Discussion: https://postgr.es/m/15882.1571942223@sss.pgh.pa.us
      cfb75590
  26. 25 Oct, 2019 1 commit
    • Tom Lane's avatar
      Reset statement_timeout between queries of a multi-query string. · 2b2bacdc
      Tom Lane authored
      Historically, we started the timer (if StatementTimeout > 0) at the
      beginning of a simple-Query message and usually let it run until the
      end, so that the timeout limit applied to the entire query string,
      and intra-string changes of the statement_timeout GUC had no effect.
      But, confusingly, a COMMIT within the string would reset the state
      and allow a fresh timeout cycle to start with the current setting.
      
      Commit f8e5f156 changed the behavior of statement_timeout for extended
      query protocol, and as an apparently-unintended side effect, a change in
      the statement_timeout GUC during a multi-statement simple-Query message
      might have an effect immediately --- but only if it was going from
      "disabled" to "enabled".
      
      This is all pretty confusing, not to mention completely undocumented.
      Let's change things so that the timeout is always reset between queries
      of a multi-query string, whether they're transaction control commands
      or not.  Thus the active timeout setting is applied to each query in
      the string, separately.  This costs a few more cycles if statement_timeout
      is active, but it provides much more intuitive behavior, especially if one
      changes statement_timeout in one of the queries of the string.
      
      Also, add something to the documentation to explain all this.
      
      Per bug #16035 from Raj Mohite.  Although this is a bug fix, I'm hesitant
      to back-patch it; conceivably somebody has worked out the old behavior
      and is depending on it.  (But note that this change should make the
      behavior less restrictive in most cases, since the timeout will now
      be applied to shorter segments of code.)
      
      Discussion: https://postgr.es/m/16035-456e6e69ebfd4374@postgresql.org
      2b2bacdc
  27. 18 Oct, 2019 1 commit
  28. 09 Oct, 2019 1 commit
  29. 07 Oct, 2019 1 commit
  30. 30 Sep, 2019 1 commit
  31. 29 Sep, 2019 1 commit
  32. 27 Sep, 2019 1 commit
  33. 13 Sep, 2019 1 commit
  34. 08 Sep, 2019 1 commit
  35. 20 Aug, 2019 1 commit
  36. 15 Aug, 2019 1 commit
  37. 05 Aug, 2019 1 commit
    • Noah Misch's avatar
      Require the schema qualification in pg_temp.type_name(arg). · ffa2d37e
      Noah Misch authored
      Commit aa27977f introduced this
      restriction for pg_temp.function_name(arg); do likewise for types
      created in temporary schemas.  Programs that this breaks should add
      "pg_temp." schema qualification or switch to arg::type_name syntax.
      Back-patch to 9.4 (all supported versions).
      
      Reviewed by Tom Lane.  Reported by Tom Lane.
      
      Security: CVE-2019-10208
      ffa2d37e