1. 12 Feb, 2021 2 commits
    • Amit Kapila's avatar
      Allow multiple xacts during table sync in logical replication. · ce0fdbfe
      Amit Kapila authored
      For the initial table data synchronization in logical replication, we use
      a single transaction to copy the entire table and then synchronize the
      position in the stream with the main apply worker.
      
      There are multiple downsides of this approach: (a) We have to perform the
      entire copy operation again if there is any error (network breakdown,
      error in the database operation, etc.) while we synchronize the WAL
      position between tablesync worker and apply worker; this will be onerous
      especially for large copies, (b) Using a single transaction in the
      synchronization-phase (where we can receive WAL from multiple
      transactions) will have the risk of exceeding the CID limit, (c) The slot
      will hold the WAL till the entire sync is complete because we never commit
      till the end.
      
      This patch solves all the above downsides by allowing multiple
      transactions during the tablesync phase. The initial copy is done in a
      single transaction and after that, we commit each transaction as we
      receive. To allow recovery after any error or crash, we use a permanent
      slot and origin to track the progress. The slot and origin will be removed
      once we finish the synchronization of the table. We also remove slot and
      origin of tablesync workers if the user performs DROP SUBSCRIPTION .. or
      ALTER SUBSCRIPTION .. REFERESH and some of the table syncs are still not
      finished.
      
      The commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and
      ALTER SUBSCRIPTION ... SET PUBLICATION ... with refresh option as true
      cannot be executed inside a transaction block because they can now drop
      the slots for which we have no provision to rollback.
      
      This will also open up the path for logical replication of 2PC
      transactions on the subscriber side. Previously, we can't do that because
      of the requirement of maintaining a single transaction in tablesync
      workers.
      
      Bump catalog version due to change of state in the catalog
      (pg_subscription_rel).
      
      Author: Peter Smith, Amit Kapila, and Takamichi Osumi
      Reviewed-by: Ajin Cherian, Petr Jelinek, Hou Zhijie and Amit Kapila
      Discussion: https://postgr.es/m/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com
      ce0fdbfe
    • Peter Geoghegan's avatar
      Remove obsolete IndexBulkDeleteResult stats field. · 3063eb17
      Peter Geoghegan authored
      The pages_removed field is no longer used for anything.  It hasn't been
      possible for an index to physically shrink since old-style VACUUM FULL
      was removed by commit 0a469c87.
      3063eb17
  2. 11 Feb, 2021 5 commits
  3. 10 Feb, 2021 7 commits
  4. 09 Feb, 2021 4 commits
    • Peter Geoghegan's avatar
      Fix obsolete FSM remarks in nbtree README. · 31c7fb41
      Peter Geoghegan authored
      The free space map has used a dedicated relation fork rather than shared
      memory segments for over a decade.
      31c7fb41
    • Fujii Masao's avatar
      Revert "Display the time when the process started waiting for the lock, in pg_locks." · 890d2182
      Fujii Masao authored
      This reverts commit 3b733fcd.
      
      Per buildfarm members prion and rorqual.
      890d2182
    • Fujii Masao's avatar
      Display the time when the process started waiting for the lock, in pg_locks. · 3b733fcd
      Fujii Masao authored
      This commit adds new column "waitstart" into pg_locks view. This column
      reports the time when the server process started waiting for the lock
      if the lock is not held. This information is useful, for example, when
      examining the amount of time to wait on a lock by subtracting
      "waitstart" in pg_locks from the current time, and identify the lock
      that the processes are waiting for very long.
      
      This feature uses the current time obtained for the deadlock timeout
      timer as "waitstart" (i.e., the time when this process started waiting
      for the lock). Since getting the current time newly can cause overhead,
      we reuse the already-obtained time to avoid that overhead.
      
      Note that "waitstart" is updated without holding the lock table's
      partition lock, to avoid the overhead by additional lock acquisition.
      This can cause "waitstart" in pg_locks to become NULL for a very short
      period of time after the wait started even though "granted" is false.
      This is OK in practice because we can assume that users are likely to
      look at "waitstart" when waiting for the lock for a long time.
      
      Bump catalog version.
      
      Author: Atsushi Torikoshi
      Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
      Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
      3b733fcd
    • Michael Paquier's avatar
      Add option PROCESS_TOAST to VACUUM · 7cb3048f
      Michael Paquier authored
      This option controls if toast tables associated with a relation are
      vacuumed or not when running a manual VACUUM.  It was already possible
      to trigger a manual VACUUM on a toast relation without processing its
      main relation, but a manual vacuum on a main relation always forced a
      vacuum on its toast table.  This is useful in scenarios where the level
      of bloat or transaction age of the main and toast relations differs a
      lot.
      
      This option is an extension of the existing VACOPT_SKIPTOAST that was
      used by autovacuum to control if toast relations should be skipped or
      not.  This internal flag is renamed to VACOPT_PROCESS_TOAST for
      consistency with the new option.
      
      A new option switch, called --no-process-toast, is added to vacuumdb.
      
      Author: Nathan Bossart
      Reviewed-by: Kirk Jamison, Michael Paquier, Justin Pryzby
      Discussion: https://postgr.es/m/BA8951E9-1524-48C5-94AF-73B1F0D7857F@amazon.com
      7cb3048f
  5. 08 Feb, 2021 3 commits
    • Peter Geoghegan's avatar
      5fd59002
    • Tom Lane's avatar
      Fix mishandling of column-level SELECT privileges for join aliases. · c028faf2
      Tom Lane authored
      scanNSItemForColumn, expandNSItemAttrs, and ExpandSingleTable would
      pass the wrong RTE to markVarForSelectPriv when dealing with a join
      ParseNamespaceItem: they'd pass the join RTE, when what we need to
      mark is the base table that the join column came from.  The end
      result was to not fill the base table's selectedCols bitmap correctly,
      resulting in an understatement of the set of columns that are read
      by the query.  The executor would still insist on there being at
      least one selectable column; but with a correctly crafted query,
      a user having SELECT privilege on just one column of a table would
      nonetheless be allowed to read all its columns.
      
      To fix, make markRTEForSelectPriv fetch the correct RTE for itself,
      ignoring the possibly-mismatched RTE passed by the caller.  Later,
      we'll get rid of some now-unused RTE arguments, but that risks
      API breaks so we won't do it in released branches.
      
      This problem was introduced by commit 9ce77d75, so back-patch
      to v13 where that came in.  Thanks to Sven Klemm for reporting
      the problem.
      
      Security: CVE-2021-20229
      c028faf2
    • Heikki Linnakangas's avatar
      Fix permission checks on constraint violation errors on partitions. · 6214e2b2
      Heikki Linnakangas authored
      If a cross-partition UPDATE violates a constraint on the target partition,
      and the columns in the new partition are in different physical order than
      in the parent, the error message can reveal columns that the user does not
      have SELECT permission on. A similar bug was fixed earlier in commit
      804b6b6d.
      
      The cause of the bug is that the callers of the
      ExecBuildSlotValueDescription() function got confused when constructing
      the list of modified columns. If the tuple was routed from a parent, we
      converted the tuple to the parent's format, but the list of modified
      columns was grabbed directly from the child's RTE entry.
      
      ExecUpdateLockMode() had a similar issue. That lead to confusion on which
      columns are key columns, leading to wrong tuple lock being taken on tables
      referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
      UPDATE. A new isolation test is added for that corner case.
      
      With this patch, the ri_RangeTableIndex field is no longer set for
      partitions that don't have an entry in the range table. Previously, it was
      set to the RTE entry of the parent relation, but that was confusing.
      
      NOTE: This modifies the ResultRelInfo struct, replacing the
      ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
      backpatch, because it breaks any extensions accessing the field. The
      change that ri_RangeTableIndex is not set for partitions could potentially
      break extensions, too. The ResultRelInfos are visible to FDWs at least,
      and this patch required small changes to postgres_fdw. Nevertheless, this
      seem like the least bad option. I don't think these fields widely used in
      extensions; I don't think there are FDWs out there that uses the FDW
      "direct update" API, other than postgres_fdw. If there is, you will get a
      compilation error, so hopefully it is caught quickly.
      
      Backpatch to 11, where support for both cross-partition UPDATEs, and unique
      indexes on partitioned tables, were added.
      
      Reviewed-by: Amit Langote
      Security: CVE-2021-3393
      6214e2b2
  6. 07 Feb, 2021 4 commits
  7. 06 Feb, 2021 2 commits
    • Tom Lane's avatar
      Disallow converting an inheritance child table to a view. · dd705a03
      Tom Lane authored
      Generally, members of inheritance trees must be plain tables (or,
      in more recent versions, foreign tables).  ALTER TABLE INHERIT
      rejects creating an inheritance relationship that has a view at
      either end.  When DefineQueryRewrite attempts to convert a relation
      to a view, it already had checks prohibiting doing so for partitioning
      parents or children as well as traditional-inheritance parents ...
      but it neglected to check that a traditional-inheritance child wasn't
      being converted.  Since the planner assumes that any inheritance
      child is a table, this led to making plans that tried to do a physical
      scan on a view, causing failures (or even crashes, in recent versions).
      
      One could imagine trying to support such a case by expanding the view
      normally, but since the rewriter runs before the planner does
      inheritance expansion, it would take some very fundamental refactoring
      to make that possible.  There are probably a lot of other parts of the
      system that don't cope well with such a situation, too.  For now,
      just forbid it.
      
      Per bug #16856 from Yang Lin.  Back-patch to all supported branches.
      (In versions before v10, this includes back-patching the portion of
      commit 501ed02c that added has_superclass().  Perhaps the lack of
      that infrastructure partially explains the missing check.)
      
      Discussion: https://postgr.es/m/16856-0363e05c6e1612fd@postgresql.org
      dd705a03
    • Michael Paquier's avatar
      Clarify some comments around SharedRecoveryState in xlog.c · f7400823
      Michael Paquier authored
      SharedRecoveryState has been switched from a boolean to an enum as of
      commit 4e87c483, but some comments still referred to it as a boolean.
      
      Author: Amul Sul
      Reviewed-by: Dilip Kumar, Kyotaro Horiguchi
      Discussion: https://postgr.es/m/CAAJ_b97Hf+1SXnm8jySpO+Fhm+-VKFAAce1T_cupUYtnE3Nxig
      f7400823
  8. 05 Feb, 2021 7 commits
    • Robert Haas's avatar
      Generalize parallel slot result handling. · 418611c8
      Robert Haas authored
      Instead of having a hard-coded behavior that we ignore missing
      tables and report all other errors, let the caller decide what
      to do by setting a callback.
      
      Mark Dilger, reviewed and somewhat revised by me. The larger patch
      series of which this is a part has also had review from Peter
      Geoghegan, Andres Freund, Álvaro Herrera, Michael Paquier, and Amul
      Sul, but I don't know whether any of them have reviewed this bit
      specifically.
      
      Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com
      Discussion: http://postgr.es/m/5F743835-3399-419C-8324-2D424237E999@enterprisedb.com
      Discussion: http://postgr.es/m/70655DF3-33CE-4527-9A4D-DDEB582B6BA0@enterprisedb.com
      418611c8
    • Robert Haas's avatar
      Move some code from src/bin/scripts to src/fe_utils to permit reuse. · e955bd4b
      Robert Haas authored
      The parallel slots infrastructure (which implements client-side
      multiplexing of server connections doing similar things, not
      threading or multiple processes or anything like that) are moved from
      src/bin/scripts/scripts_parallel.c to src/fe_utils/parallel_slot.c.
      
      The functions consumeQueryResult() and processQueryResult() which were
      previously part of src/bin/scripts/common.c are now moved into that
      file as well, becoming static helper functions. This might need to be
      changed in the future, but currently they're not used for anything
      else.
      
      Some other functions from src/bin/scripts/common.c are moved to to
      src/fe_utils and are split up among several files.  connectDatabase(),
      connectMaintenanceDatabase(), and disconnectDatabase() are moved to
      connect_utils.c.  executeQuery(), executeCommand(), and
      executeMaintenanceCommand() are move to query_utils.c.
      handle_help_version_opts() is moved to option_utils.c.
      
      Mark Dilger, reviewed by me. The larger patch series of which this is
      a part has also had review from Peter Geoghegan, Andres Freund, Álvaro
      Herrera, Michael Paquier, and Amul Sul, but I don't know whether any
      of them have reviewed this bit specifically.
      
      Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com
      Discussion: http://postgr.es/m/5F743835-3399-419C-8324-2D424237E999@enterprisedb.com
      Discussion: http://postgr.es/m/70655DF3-33CE-4527-9A4D-DDEB582B6BA0@enterprisedb.com
      e955bd4b
    • Heikki Linnakangas's avatar
      Fix backslash-escaping multibyte chars in COPY FROM. · c444472a
      Heikki Linnakangas authored
      If a multi-byte character is escaped with a backslash in TEXT mode input,
      and the encoding is one of the client-only encodings where the bytes after
      the first one can have an ASCII byte "embedded" in the char, we didn't
      skip the character correctly. After a backslash, we only skipped the first
      byte of the next character, so if it was a multi-byte character, we would
      try to process its second byte as if it was a separate character. If it
      was one of the characters with special meaning, like '\n', '\r', or
      another '\\', that would cause trouble.
      
      One such exmple is the byte sequence '\x5ca45c2e666f6f' in Big5 encoding.
      That's supposed to be [backslash][two-byte character][.][f][o][o], but
      because the second byte of the two-byte character is 0x5c, we incorrectly
      treat it as another backslash. And because the next character is a dot, we
      parse it as end-of-copy marker, and throw an "end-of-copy marker corrupt"
      error.
      
      Backpatch to all supported versions.
      
      Reviewed-by: John Naylor, Kyotaro Horiguchi
      Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
      c444472a
    • Etsuro Fujita's avatar
      postgres_fdw: Fix assertion in estimate_path_cost_size(). · 5e7fa189
      Etsuro Fujita authored
      Commit 08d2d58a added an assertion assuming that the retrieved_rows
      estimate for a foreign relation, which is re-used to cost pre-sorted
      foreign paths with local stats, is set to at least one row in
      estimate_path_cost_size(), which isn't correct because if the relation
      is a foreign table with tuples=0, the estimate would be set to 0 there
      when not using remote estimates.
      
      Per bug #16807 from Alexander Lakhin.  Back-patch to v13 where the
      aforementioned commit went in.
      
      Author: Etsuro Fujita
      Reviewed-by: Kyotaro Horiguchi
      Discussion: https://postgr.es/m/16807-9fe4e08fbaa5c7ce%40postgresql.org
      5e7fa189
    • Tom Lane's avatar
      Fix bug in HashAgg's selective-column-spilling logic. · 0ff865fb
      Tom Lane authored
      Commit 23023022 taught nodeAgg.c that, when spilling tuples from
      memory in an oversized hash aggregation, it only needed to spill
      input columns referenced in the node's tlist and quals.  Unfortunately,
      that's wrong: we also have to save the grouping columns.  The error
      is masked in common cases because the grouping columns also appear
      in the tlist, but that's not necessarily true.  The main category
      of plans where it's not true seem to come from semijoins ("WHERE
      outercol IN (SELECT innercol FROM innertable)") where the innercol
      needs an implicit promotion to make it comparable to the outercol.
      The grouping column will be "innercol::promotedtype", but that
      expression appears nowhere in the Agg node's own tlist and quals;
      only the bare "innercol" is found in the tlist.
      
      I spent quite a bit of time looking for a suitable regression test
      case for this, without much success.  If the number of distinct
      values of the innercol is large enough to make spilling happen,
      the planner tends to prefer a non-HashAgg plan, at least for
      problem sizes that are reasonable to use in the regression tests.
      So, no new regression test.  However, this patch does demonstrably
      fix the originally-reported test case.
      
      Per report from s.p.e (at) gmx-topmail.de.  Backpatch to v13
      where the troublesome code came in.
      
      Discussion: https://postgr.es/m/trinity-1c565d44-159f-488b-a518-caf13883134f-1611835701633@3c-app-gmx-bap78
      0ff865fb
    • Thomas Munro's avatar
      Tab-complete CREATE DATABASE ... LOCALE. · e1c02d92
      Thomas Munro authored
      Author: Ian Lawrence Barwick <barwick@gmail.com>
      Discussion: https://postgr.es/m/CAB8KJ%3Dh0XO2CB4QbLBc1Tm9Bg5wzSGQtT-eunaCmrghJp4nqdA%40mail.gmail.com
      e1c02d92
    • Tom Lane's avatar
      Fix YA incremental sort bug. · 82e0e293
      Tom Lane authored
      switchToPresortedPrefixMode() did the wrong thing if it detected
      a batch boundary just at the last tuple of a fullsort group.
      
      The initially-reported symptom was a "retrieved too many tuples in a
      bounded sort" error, but the test case added here just silently gives
      the wrong answer without this patch.
      
      I (tgl) am not really happy about committing this patch without review
      from the incremental-sort authors, but they seem AWOL and we are hard
      against a release deadline.  This does demonstrably make some cases
      better, anyway.
      
      Per bug #16846 from Yoran Heling.  Back-patch to v13 where incremental
      sort was introduced.
      
      Neil Chen
      
      Discussion: https://postgr.es/m/16846-ae49f51ac379a4cb@postgresql.org
      82e0e293
  9. 04 Feb, 2021 6 commits
    • Peter Geoghegan's avatar
      Harden nbtree page deletion. · c34787f9
      Peter Geoghegan authored
      Add some additional defensive checks in the second phase of index
      deletion to detect and report index corruption during VACUUM, and to
      avoid having VACUUM become stuck in more cases.  The code is still not
      robust in the presence of a circular chain of sibling links, though it's
      not clear whether that really matters.  This is follow-up work to commit
      3a01f68e.
      
      The new defensive checks rely on the assumption that there can be no
      more than one VACUUM operation running for an index at any given time.
      Remove an old comment suggesting that multiple concurrent VACUUMs need
      to be considered here.  This concern now seems highly unlikely to have
      any real validity, since we clearly rely on the same assumption in
      several other places.  For example, there are much more recent comments
      that appear in the same function (added by commit efada2b8) that make
      the same assumption.
      
      Also add a CHECK_FOR_INTERRUPTS() to the relevant code path.  Contrary
      to comments added by commit 3a01f68e, it is actually possible to handle
      interrupts here, at least in the common case where processing takes
      place at the leaf level.  We only hold a pin on leafbuf/target page when
      stepping right at the leaf level.
      
      No backpatch due to the lack of complaints following hardening added to
      the same area by commit 3a01f68e.
      c34787f9
    • Heikki Linnakangas's avatar
      Fix small error in COPY FROM progress reporting. · 2f86ab30
      Heikki Linnakangas authored
      The # of bytes processed was accumulated slightly incorrectly. After
      loading more data to the input buffer, we added the number of bytes in
      the buffer to the sum. But in case of multi-byte characters or escapes,
      there can be a few unprocessed bytes left over from previous load in the
      buffer. Those bytes got counted twice.
      2f86ab30
    • Peter Eisentraut's avatar
      Refactor Windows error message for easier translation · 3c78e056
      Peter Eisentraut authored
      In the error messages referring to the user right "Lock pages in
      memory", this is a term from the Windows OS, so it should be
      translated in accordance with the OS localization.  Refactor the error
      messages so this is easier and clearer.  Also fix the capitalization
      to match the existing capitalization in the OS.
      3c78e056
    • Michael Paquier's avatar
      Ensure unlinking of old index file with REINDEX (TABLESPACE) · 5128483d
      Michael Paquier authored
      The original versions of the patch included this part, but a mismerge
      from my side has made this piece go missing.  Oversight in c5b28604.
      5128483d
    • Michael Paquier's avatar
      Clarify comment in tablesync.c · fc749bc7
      Michael Paquier authored
      Author: Peter Smith
      Reviewed-by: Amit Kapila, Michael Paquier, Euler Taveira
      Discussion: https://postgr.es/m/CAHut+Pt9_T6pWar0FLtPsygNmme8HPWPdGUyZ_8mE1Yvjdf0ZA@mail.gmail.com
      fc749bc7
    • Michael Paquier's avatar
      Add TABLESPACE option to REINDEX · c5b28604
      Michael Paquier authored
      This patch adds the possibility to move indexes to a new tablespace
      while rebuilding them.  Both the concurrent and the non-concurrent cases
      are supported, and the following set of restrictions apply:
      - When using TABLESPACE with a REINDEX command that targets a
      partitioned table or index, all the indexes of the leaf partitions are
      moved to the new tablespace.  The tablespace references of the non-leaf,
      partitioned tables in pg_class.reltablespace are not changed. This
      requires an extra ALTER TABLE SET TABLESPACE.
      - Any index on a toast table rebuilt as part of a parent table is kept
      in its original tablespace.
      - The operation is forbidden on system catalogs, including trying to
      directly move a toast relation with REINDEX.  This results in an error
      if doing REINDEX on a single object.  REINDEX SCHEMA, DATABASE and
      SYSTEM skip system relations when TABLESPACE is used.
      
      Author: Alexey Kondratov, Michael Paquier, Justin Pryzby
      Reviewed-by: Álvaro Herrera, Michael Paquier
      Discussion: https://postgr.es/m/8a8f5f73-00d3-55f8-7583-1375ca8f6a91@postgrespro.ru
      c5b28604