1. 15 Aug, 2020 1 commit
    • Tom Lane's avatar
      Be more careful about the shape of hashable subplan clauses. · 1e7629d2
      Tom Lane authored
      nodeSubplan.c expects that the testexpr for a hashable ANY SubPlan
      has the form of one or more OpExprs whose LHS is an expression of the
      outer query's, while the RHS is an expression over Params representing
      output columns of the subquery.  However, the planner only went as far
      as verifying that the clauses were all binary OpExprs.  This works
      99.99% of the time, because the clauses have the right shape when
      emitted by the parser --- but it's possible for function inlining to
      break that, as reported by PegoraroF10.  To fix, teach the planner
      to check that the LHS and RHS contain the right things, or more
      accurately don't contain the wrong things.  Given that this has been
      broken for years without anyone noticing, it seems sufficient to just
      give up hashing when it happens, rather than go to the trouble of
      commuting the clauses back again (which wouldn't necessarily work
      anyway).
      
      While poking at that, I also noticed that nodeSubplan.c had a baked-in
      assumption that the number of hash clauses is identical to the number
      of subquery output columns.  Again, that's fine as far as parser output
      goes, but it's not hard to break it via function inlining.  There seems
      little reason for that assumption though --- AFAICS, the only thing
      it's buying us is not having to store the number of hash clauses
      explicitly.  Adding code to the planner to reject such cases would take
      more code than getting nodeSubplan.c to cope, so I fixed it that way.
      
      This has been broken for as long as we've had hashable SubPlans,
      so back-patch to all supported branches.
      
      Discussion: https://postgr.es/m/1549209182255-0.post@n3.nabble.com
      1e7629d2
  2. 14 Aug, 2020 9 commits
    • Andres Freund's avatar
      snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT. · 73487a60
      Andres Freund authored
      Similar to the previous changes this increases the chance that data
      frequently needed by GetSnapshotData() stays in l2 cache. In many
      workloads subtransactions are very rare, and this makes the check for
      that considerably cheaper.
      
      As this removes the last member of PGXACT, there is no need to keep it
      around anymore.
      
      On a larger 2 socket machine this and the two preceding commits result
      in a ~1.07x performance increase in read-only pgbench. For read-heavy
      mixed r/w workloads without row level contention, I see about 1.1x.
      
      Author: Andres Freund <andres@anarazel.de>
      Reviewed-By: default avatarRobert Haas <robertmhaas@gmail.com>
      Reviewed-By: default avatarThomas Munro <thomas.munro@gmail.com>
      Reviewed-By: default avatarDavid Rowley <dgrowleyml@gmail.com>
      Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
      73487a60
    • Andres Freund's avatar
      snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags. · 5788e258
      Andres Freund authored
      Similar to the previous commit this increases the chance that data
      frequently needed by GetSnapshotData() stays in l2 cache. As we now
      take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
      should be very few modifications to the ProcGlobal->vacuumFlags array.
      
      Author: Andres Freund <andres@anarazel.de>
      Reviewed-By: default avatarRobert Haas <robertmhaas@gmail.com>
      Reviewed-By: default avatarThomas Munro <thomas.munro@gmail.com>
      Reviewed-By: default avatarDavid Rowley <dgrowleyml@gmail.com>
      Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
      5788e258
    • Andres Freund's avatar
      snapshot scalability: Introduce dense array of in-progress xids. · 941697c3
      Andres Freund authored
      The new array contains the xids for all connected backends / in-use
      PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
      arrays which can have unused entries interspersed).
      
      This improves performance because GetSnapshotData() always needs to
      scan the xids of all live procarray entries and now there's no need to
      go through the procArray->pgprocnos indirection anymore.
      
      As the set of running top-level xids changes rarely, compared to the
      number of snapshots taken, this substantially increases the likelihood
      of most data required for a snapshot being in l2 cache.  In
      read-mostly workloads scanning the xids[] array will sufficient to
      build a snapshot, as most backends will not have an xid assigned.
      
      To keep the xid array dense ProcArrayRemove() needs to move entries
      behind the to-be-removed proc's one further up in the array. Obviously
      moving array entries cannot happen while a backend sets it
      xid. I.e. locking needs to prevent that array entries are moved while
      a backend modifies its xid.
      
      To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
      spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
      XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
      entry is not a very frequent operation, even taking 2PC into account.
      
      Due to the above, the dense array entries can only be read or modified
      while holding ProcArrayLock and/or XidGenLock. This prevents a
      concurrent ProcArrayRemove() from shifting the dense array while it is
      accessed concurrently.
      
      While the new dense array is very good when needing to look at all
      xids it is less suitable when accessing a single backend's xid. In
      particular it would be problematic to have to acquire a lock to access
      a backend's own xid. Therefore a backend's xid is not just stored in
      the dense array, but also in PGPROC. This also allows a backend to
      only access the shared xid value when the backend had acquired an
      xid.
      
      The infrastructure added in this commit will be used for the remaining
      PGXACT fields in subsequent commits. They are kept separate to make
      review easier.
      
      Author: Andres Freund <andres@anarazel.de>
      Reviewed-By: default avatarRobert Haas <robertmhaas@gmail.com>
      Reviewed-By: default avatarThomas Munro <thomas.munro@gmail.com>
      Reviewed-By: default avatarDavid Rowley <dgrowleyml@gmail.com>
      Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
      941697c3
    • Alvaro Herrera's avatar
      pg_dump: fix dependencies on FKs to partitioned tables · 2ba5b2db
      Alvaro Herrera authored
      Parallel-restoring a foreign key that references a partitioned table
      with several levels of partitions can fail:
      
      pg_restore: while PROCESSING TOC:
      pg_restore: from TOC entry 6684; 2606 29166 FK CONSTRAINT fk fk_a_fkey postgres
      pg_restore: error: could not execute query: ERROR:  there is no unique constraint matching given keys for referenced table "pk"
      Command was: ALTER TABLE fkpart3.fk
          ADD CONSTRAINT fk_a_fkey FOREIGN KEY (a) REFERENCES fkpart3.pk(a);
      
      This happens in parallel restore mode because some index partitions
      aren't yet attached to the topmost partitioned index that the FK uses,
      and so the index is still invalid.  The current code marks the FK as
      dependent on the first level of index-attach dump objects; the bug is
      fixed by recursively marking the FK on their children.
      
      Backpatch to 12, where FKs to partitioned tables were introduced.
      Reported-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
      Discussion: https://postgr.es/m/3170626.1594842723@sss.pgh.pa.us
      Backpatch: 12-master
      2ba5b2db
    • Peter Geoghegan's avatar
      Fix obsolete comment in xlogutils.c. · 914140e8
      Peter Geoghegan authored
      Oversight in commit 2c03216d.
      914140e8
    • Tom Lane's avatar
      Fix postmaster's behavior during smart shutdown. · 0038f943
      Tom Lane authored
      Up to now, upon receipt of a SIGTERM ("smart shutdown" command), the
      postmaster has immediately killed all "optional" background processes,
      and subsequently refused to launch new ones while it's waiting for
      foreground client processes to exit.  No doubt this seemed like an OK
      policy at some point; but it's a pretty bad one now, because it makes
      for a seriously degraded environment for the remaining clients:
      
      * Parallel queries are killed, and new ones fail to launch. (And our
      parallel-query infrastructure utterly fails to deal with the case
      in a reasonable way --- it just hangs waiting for workers that are
      not going to arrive.  There is more work needed in that area IMO.)
      
      * Autovacuum ceases to function.  We can tolerate that for awhile,
      but if bulk-update queries continue to run in the surviving client
      sessions, there's eventually going to be a mess.  In the worst case
      the system could reach a forced shutdown to prevent XID wraparound.
      
      * The bgwriter and walwriter are also stopped immediately, likely
      resulting in performance degradation.
      
      Hence, let's rearrange things so that the only immediate change in
      behavior is refusing to let in new normal connections.  Once the last
      normal connection is gone, shut everything down as though we'd received
      a "fast" shutdown.  To implement this, remove the PM_WAIT_BACKUP and
      PM_WAIT_READONLY states, instead staying in PM_RUN or PM_HOT_STANDBY
      while normal connections remain.  A subsidiary state variable tracks
      whether or not we're letting in new connections in those states.
      
      This also allows having just one copy of the logic for killing child
      processes in smart and fast shutdown modes.  I moved that logic into
      PostmasterStateMachine() by inventing a new state PM_STOP_BACKENDS.
      
      Back-patch to 9.6 where parallel query was added.  In principle
      this'd be a good idea in 9.5 as well, but the risk/reward ratio
      is not as good there, since lack of autovacuum is not a problem
      during typical uses of smart shutdown.
      
      Per report from Bharath Rupireddy.
      
      Patch by me, reviewed by Thomas Munro
      
      Discussion: https://postgr.es/m/CALj2ACXAZ5vKxT9P7P89D87i3MDO9bfS+_bjMHgnWJs8uwUOOw@mail.gmail.com
      0038f943
    • Heikki Linnakangas's avatar
      Fix typo in test comment. · 5bdf6945
      Heikki Linnakangas authored
      5bdf6945
    • Michael Paquier's avatar
      Fix compilation warnings with libselinux 3.1 in contrib/sepgsql/ · 1f32136a
      Michael Paquier authored
      Upstream SELinux has recently marked security_context_t as officially
      deprecated, causing warnings with -Wdeprecated-declarations.  This is
      considered as legacy code for some time now by upstream as
      security_context_t got removed from most of the code tree during the
      development of 2.3 back in 2014.
      
      This removes all the references to security_context_t in sepgsql/ to be
      consistent with SELinux, fixing the warnings.  Note that this does not
      impact the minimum version of libselinux supported.
      
      Reviewed-by: Tom Lane
      Discussion: https://postgr.es/m/20200813012735.GC11663@paquier.xyz
      1f32136a
    • Tom Lane's avatar
      Doc: improve examples for json_populate_record() and related functions. · a9306f10
      Tom Lane authored
      Make these examples self-contained by providing declarations of the
      user-defined row types they rely on.  There wasn't room to do this
      in the old doc format, but now there is, and I think it makes the
      examples a good bit less confusing.
      a9306f10
  3. 13 Aug, 2020 3 commits
  4. 12 Aug, 2020 4 commits
    • Andres Freund's avatar
      snapshot scalability: Don't compute global horizons while building snapshots. · dc7420c2
      Andres Freund authored
      To make GetSnapshotData() more scalable, it cannot not look at at each proc's
      xmin: While snapshot contents do not need to change whenever a read-only
      transaction commits or a snapshot is released, a proc's xmin is modified in
      those cases. The frequency of xmin modifications leads to, particularly on
      higher core count systems, many cache misses inside GetSnapshotData(), despite
      the data underlying a snapshot not changing. That is the most
      significant source of GetSnapshotData() scaling poorly on larger systems.
      
      Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
      thresholds as it has so far. But we don't really have to: The horizons don't
      actually change that much between GetSnapshotData() calls. Nor are the horizons
      actually used every time a snapshot is built.
      
      The trick this commit introduces is to delay computation of accurate horizons
      until there use and using horizon boundaries to determine whether accurate
      horizons need to be computed.
      
      The use of RecentGlobal[Data]Xmin to decide whether a row version could be
      removed has been replaces with new GlobalVisTest* functions.  These use two
      thresholds to determine whether a row can be pruned:
      1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
         are definitely still visible.
      2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
         definitely be removed
      GetSnapshotData() updates definitely_needed to be the xmin of the computed
      snapshot.
      
      When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
      and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
      definitely_needed) the boundaries can be recomputed to be more accurate. As it
      is not cheap to compute accurate boundaries, we limit the number of times that
      happens in short succession.  As the boundaries used by
      GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
      GetSnapshotData()), it is likely that further test can benefit from an earlier
      computation of accurate horizons.
      
      To avoid regressing performance when old_snapshot_threshold is set (as that
      requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
      unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
      computation of the limited horizon, and the triggering of errors (with
      SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
      tuples.
      
      This commit just removes the accesses to PGXACT->xmin from
      GetSnapshotData(), but other members of PGXACT residing in the same
      cache line are accessed. Therefore this in itself does not result in a
      significant improvement. Subsequent commits will take advantage of the
      fact that GetSnapshotData() now does not need to access xmins anymore.
      
      Note: This contains a workaround in heap_page_prune_opt() to keep the
      snapshot_too_old tests working. While that workaround is ugly, the tests
      currently are not meaningful, and it seems best to address them separately.
      
      Author: Andres Freund <andres@anarazel.de>
      Reviewed-By: default avatarRobert Haas <robertmhaas@gmail.com>
      Reviewed-By: default avatarThomas Munro <thomas.munro@gmail.com>
      Reviewed-By: default avatarDavid Rowley <dgrowleyml@gmail.com>
      Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
      dc7420c2
    • Alvaro Herrera's avatar
      BRIN: Handle concurrent desummarization properly · 1f42d35a
      Alvaro Herrera authored
      If a page range is desummarized at just the right time concurrently with
      an index walk, BRIN would raise an error indicating index corruption.
      This is scary and unhelpful; silently returning that the page range is
      not summarized is sufficient reaction.
      
      This bug was introduced by commit 975ad4e6 as additional protection
      against a bug whose actual fix was elsewhere.  Backpatch equally.
      Reported-By: default avatarAnastasia Lubennikova <a.lubennikova@postgrespro.ru>
      Diagnosed-By: default avatarAlexander Lakhin <exclusion@gmail.com>
      Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru
      Backpatch: 9.5 - master
      1f42d35a
    • Tom Lane's avatar
      Improve comments for postmaster.c's BackendList. · 3546cf8a
      Tom Lane authored
      This had gotten a little disjointed over time, and some of the grammar
      was sloppy.  Rewrite for more clarity.
      
      In passing, re-pgindent some recently added comments.
      
      No code changes.
      3546cf8a
    • Andres Freund's avatar
      Track latest completed xid as a FullTransactionId. · 3bd7f996
      Andres Freund authored
      The reason for doing so is that a subsequent commit will need that to
      avoid wraparound issues. As the subsequent change is large this was
      split out for easier review.
      
      The reason this is not a perfect straight-forward change is that we do
      not want track 64bit xids in the procarray or the WAL. Therefore we
      need to advance lastestCompletedXid in relation to 32 bit xids. The
      code for that is now centralized in MaintainLatestCompletedXid*.
      
      Author: Andres Freund
      Reviewed-By: Thomas Munro, Robert Haas, David Rowley
      Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
      3bd7f996
  5. 11 Aug, 2020 2 commits
  6. 10 Aug, 2020 5 commits
    • Peter Eisentraut's avatar
      Replace remaining StrNCpy() by strlcpy() · 1784f278
      Peter Eisentraut authored
      They are equivalent, except that StrNCpy() zero-fills the entire
      destination buffer instead of providing just one trailing zero.  For
      all but a tiny number of callers, that's just overhead rather than
      being desirable.
      
      Remove StrNCpy() as it is now unused.
      
      In some cases, namestrcpy() is the more appropriate function to use.
      While we're here, simplify the API of namestrcpy(): Remove the return
      value, don't check for NULL input.  Nothing was using that anyway.
      Also, remove a few unused name-related functions.
      Reviewed-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
      1784f278
    • Noah Misch's avatar
      Document clashes between logical replication and untrusted users. · cec57b1a
      Noah Misch authored
      Back-patch to v10, which introduced logical replication.
      
      Security: CVE-2020-14349
      cec57b1a
    • Noah Misch's avatar
      Empty search_path in logical replication apply worker and walsender. · 11da9702
      Noah Misch authored
      This is like CVE-2018-1058 commit
      582edc36.  Today, a malicious user of a
      publisher or subscriber database can invoke arbitrary SQL functions
      under an identity running replication, often a superuser.  This fix may
      cause "does not exist" or "no schema has been selected to create in"
      errors in a replication process.  After upgrading, consider watching
      server logs for these errors.  Objects accruing schema qualification in
      the wake of the earlier commit are unlikely to need further correction.
      Back-patch to v10, which introduced logical replication.
      
      Security: CVE-2020-14349
      11da9702
    • Noah Misch's avatar
      Move connect.h from fe_utils to src/include/common. · e078fb5d
      Noah Misch authored
      Any libpq client can use the header.  Clients include backend components
      postgres_fdw, dblink, and logical replication apply worker.  Back-patch
      to v10, because another fix needs this.  In released branches, just copy
      the header and keep the original.
      e078fb5d
    • Tom Lane's avatar
      Make contrib modules' installation scripts more secure. · 7eeb1d98
      Tom Lane authored
      Hostile objects located within the installation-time search_path could
      capture references in an extension's installation or upgrade script.
      If the extension is being installed with superuser privileges, this
      opens the door to privilege escalation.  While such hazards have existed
      all along, their urgency increases with the v13 "trusted extensions"
      feature, because that lets a non-superuser control the installation path
      for a superuser-privileged script.  Therefore, make a number of changes
      to make such situations more secure:
      
      * Tweak the construction of the installation-time search_path to ensure
      that references to objects in pg_catalog can't be subverted; and
      explicitly add pg_temp to the end of the path to prevent attacks using
      temporary objects.
      
      * Disable check_function_bodies within installation/upgrade scripts,
      so that any security gaps in SQL-language or PL-language function bodies
      cannot create a risk of unwanted installation-time code execution.
      
      * Adjust lookup of type input/receive functions and join estimator
      functions to complain if there are multiple candidate functions.  This
      prevents capture of references to functions whose signature is not the
      first one checked; and it's arguably more user-friendly anyway.
      
      * Modify various contrib upgrade scripts to ensure that catalog
      modification queries are executed with secure search paths.  (These
      are in-place modifications with no extension version changes, since
      it is the update process itself that is at issue, not the end result.)
      
      Extensions that depend on other extensions cannot be made fully secure
      by these methods alone; therefore, revert the "trusted" marking that
      commit eb67623c applied to earthdistance and hstore_plperl, pending
      some better solution to that set of issues.
      
      Also add documentation around these issues, to help extension authors
      write secure installation scripts.
      
      Patch by me, following an observation by Andres Freund; thanks
      to Noah Misch for review.
      
      Security: CVE-2020-14350
      7eeb1d98
  7. 09 Aug, 2020 3 commits
    • Peter Geoghegan's avatar
      Correct nbtree page split lock coupling comment. · d129c074
      Peter Geoghegan authored
      There is no reason to distinguish between readers and writers here.
      d129c074
    • Tom Lane's avatar
      Check for fseeko() failure in pg_dump's _tarAddFile(). · 1b9cde51
      Tom Lane authored
      Coverity pointed out, not unreasonably, that we checked fseeko's
      result at every other call site but these.  Failure to seek in the
      temp file (note this is NOT pg_dump's output file) seems quite
      unlikely, and even if it did happen the file length cross-check
      further down would probably detect the problem.  Still, that's a
      poor excuse for not checking the result of a system call.
      1b9cde51
    • Tom Lane's avatar
      Remove useless Assert. · 1c164ef3
      Tom Lane authored
      Testing that an unsigned variable is >= 0 is pretty pointless,
      as noted by Coverity and numerous buildfarm members.
      
      In passing, add comment about new uses of "volatile" --- Coverity
      doesn't much like that either, but it seems probably necessary.
      1c164ef3
  8. 08 Aug, 2020 6 commits
    • Tom Lane's avatar
      Remove <@ from contrib/intarray's GiST operator classes. · 20e7e1fe
      Tom Lane authored
      Since commit efc77cf5, an indexed query using <@ has required a
      full-index scan, so that it actually performs worse than a plain seqscan
      would do.  As I noted at the time, we'd be better off to not treat <@ as
      being indexable by such indexes at all; and that's what this patch does.
      
      It would have been difficult to remove these opclass members without
      dropping the whole opclass before commit 9f968278 fixed GiST opclass
      member dependency rules, but now it's quite simple, so let's do it.
      
      I left the existing support code in place for the time being, with
      comments noting it's now unreachable.  At some point, perhaps we should
      remove that code in favor of throwing an error telling people to upgrade
      the extension version.
      
      Discussion: https://postgr.es/m/2176979.1596389859@sss.pgh.pa.us
      Discussion: https://postgr.es/m/458.1565114141@sss.pgh.pa.us
      20e7e1fe
    • Peter Geoghegan's avatar
      Teach amcheck to verify sibling links in all cases. · 39132b78
      Peter Geoghegan authored
      Teach contrib/amcheck's bt_index_check() function to check agreement
      between siblings links.  The left sibling's right link should point to a
      right sibling page whose left link points back to the same original left
      sibling.  This extends a check that bt_index_parent_check() always
      performed to bt_index_check().
      
      This is the first time amcheck has been taught to perform buffer lock
      coupling, which we have explicitly avoided up until now.  The sibling
      link check tends to catch a lot of real world index corruption with
      little overhead, so it seems worth accepting the complexity.  Note that
      the new lock coupling logic would not work correctly on replica servers
      without the changes made by commits 0a7d771f and 9a9db08a (there could
      be false positives without those changes).
      
      Author: Andrey Borodin, Peter Geoghegan
      Discussion: https://postgr.es/m/0EB0CFA8-CBD8-4296-8049-A2C0F28FAE8C@yandex-team.ru
      39132b78
    • Alvaro Herrera's avatar
      walsnd: Don't set waiting_for_ping_response spuriously · 470687b4
      Alvaro Herrera authored
      Ashutosh Bapat noticed that when logical walsender needs to wait for
      WAL, and it realizes that it must send a keepalive message to
      walreceiver to update the sent-LSN, which *does not* request a reply
      from walreceiver, it wrongly sets the flag that it's going to wait for
      that reply.  That means that any future would-be sender of feedback
      messages ends up not sending a feedback message, because they all
      believe that a reply is expected.
      
      With built-in logical replication there's not much harm in this, because
      WalReceiverMain will send a ping-back every wal_receiver_timeout/2
      anyway; but with other logical replication systems (e.g. pglogical) it
      can cause significant pain.
      
      This problem was introduced in commit 41d5f8ad, where the
      request-reply flag was changed from true to false to WalSndKeepalive,
      without at the same time removing the line that sets
      waiting_for_ping_response.
      
      Just removing that line would be a sufficient fix, but it seems better
      to shift the responsibility of setting the flag to WalSndKeepalive
      itself instead of requiring caller to do it; this is clearly less
      error-prone.
      
      Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
      Reported-by: default avatarAshutosh Bapat <ashutosh.bapat@2ndquadrant.com>
      Backpatch: 9.5 and up
      Discussion: https://postgr.es/m/20200806225558.GA22401@alvherre.pgsql
      470687b4
    • Amit Kapila's avatar
      Fix the logical streaming test. · 82a0ba77
      Amit Kapila authored
      Commit 7259736a added the capability to stream changes in ReorderBuffer
      which has some tests to test the streaming mode. It is quite possible that
      while this test is running a parallel transaction could be logged by
      autovacuum. Such a transaction won't perform any insert/update/delete to
      non-catalog tables so will be shown as an empty transaction. Fix it by
      skipping the empty transactions during this test.
      
      Per report by buildfarm.
      82a0ba77
    • Peter Eisentraut's avatar
      Add some const decorations · a13421c9
      Peter Eisentraut authored
      a13421c9
    • Amit Kapila's avatar
      Implement streaming mode in ReorderBuffer. · 7259736a
      Amit Kapila authored
      Instead of serializing the transaction to disk after reaching the
      logical_decoding_work_mem limit in memory, we consume the changes we have
      in memory and invoke stream API methods added by commit 45fdc973.
      However, sometimes if we have incomplete toast or speculative insert we
      spill to the disk because we can't generate the complete tuple and stream.
      And, as soon as we get the complete tuple we stream the transaction
      including the serialized changes.
      
      We can do this incremental processing thanks to having assignments
      (associating subxact with toplevel xacts) in WAL right away, and
      thanks to logging the invalidation messages at each command end. These
      features are added by commits 0bead9af and c55040cc respectively.
      
      Now that we can stream in-progress transactions, the concurrent aborts
      may cause failures when the output plugin consults catalogs (both system
      and user-defined).
      
      We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
      sqlerrcode from system table scan APIs to the backend or WALSender
      decoding a specific uncommitted transaction. The decoding logic on the
      receipt of such a sqlerrcode aborts the decoding of the current
      transaction and continue with the decoding of other transactions.
      
      We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
      know which xact it belongs to.  The output plugin can use this to decide
      which changes to discard in case of stream_abort_cb (e.g. when a subxact
      gets discarded).
      
      We also provide a new option via SQL APIs to fetch the changes being
      streamed.
      
      Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
      Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
      Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
      Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
      7259736a
  9. 07 Aug, 2020 5 commits
    • Peter Geoghegan's avatar
      Make nbtree split REDO locking match original execution. · 0a7d771f
      Peter Geoghegan authored
      Make the nbtree page split REDO routine consistent with original
      execution in its approach to acquiring and releasing buffer locks (at
      least for pages on the tree level of the page being split).  This brings
      btree_xlog_split() in line with btree_xlog_unlink_page(), which was
      taught to couple buffer locks by commit 9a9db08a.
      
      Note that the precise order in which we both acquire and release sibling
      buffer locks in btree_xlog_split() now matches original execution
      exactly (the precise order in which the locks are released probably
      doesn't matter much, but we might as well be consistent about it).
      
      The rule for nbtree REDO routines from here on is that same-level locks
      should be acquired in an order that's consistent with original
      execution.  It's not practical to have a similar rule for cross-level
      page locks, since for the most part original execution holds those locks
      for a period that spans multiple atomic actions/WAL records.  It's also
      not necessary, because clearly the cross-level lock coupling is only
      truly needed during original execution because of the presence of
      concurrent inserters.
      
      This is not a bug fix (unlike the similar aforementioned commit, commit
      9a9db08a).  The immediate reason to tighten things up in this area is to
      enable an upcoming enhancement to contrib/amcheck that allows it to
      verify that sibling links are in agreement with only an AccessShareLock
      (this check produced false positives when run on a replica server on
      account of the inconsistency fixed by this commit).  But that's not the
      only reason to be stricter here.
      
      It is generally useful to make locking on replicas be as close to what
      happens during original execution as practically possible.  It makes it
      less likely that hard to catch bugs will slip in in the future.  The
      previous state of affairs seems to be a holdover from before the
      introduction of Hot Standby, when buffer lock acquisitions during
      recovery were totally unnecessary.  See also: commit 3bbf668d, which
      tightened things up in this area a few years after the introduction of
      Hot Standby.
      
      Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
      0a7d771f
    • Alvaro Herrera's avatar
      Remove PROC_IN_ANALYZE and derived flags · cea3d558
      Alvaro Herrera authored
      These flags are unused and always have been.
      
      Discussion: https://postgr.es/m/20200805235549.GA8118@alvherre.pgsql
      cea3d558
    • Tom Lane's avatar
      Support testing of cases where table schemas change after planning. · 6f0b632f
      Tom Lane authored
      We have various cases where we allow DDL on tables to be performed with
      less than full AccessExclusiveLock.  This requires concurrent queries
      to be able to cope with the DDL change mid-flight, but up to now we had
      no repeatable way to test such cases.  To improve that, invent a test
      module that allows halting a backend after planning and then resuming
      execution once we've done desired actions in another session.  (The same
      approach could be used to inject delays in other places, if there's a
      suitable hook available.)
      
      This commit includes a single test case, which is meant to exercise the
      previously-untestable ExecCreatePartitionPruneState code repaired by
      commit 7a980dfc.  We'd probably not bother with this if that were the
      only foreseen benefit, but I expect additional test cases will use this
      infrastructure in the future.
      
      Test module by Andy Fan, partition-addition test case by me.
      
      Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
      6f0b632f
    • Peter Geoghegan's avatar
      Rename nbtree split REDO routine variables. · 3df92bbd
      Peter Geoghegan authored
      Make the nbtree page split REDO routine variable names consistent with
      _bt_split() (which handles the original execution of page splits).
      These names make the code easier to follow by making the distinction
      between the original page and the left half of the split clear.  (The
      left half of the split page is a temp page that REDO creates to replace
      the origpage contents.)
      
      Also reduce the elevel used when adding a new high key to the temp page
      from PANIC to ERROR to be consistent.  We already only raise an ERROR
      when data item PageAddItem() temp page calls fail.
      3df92bbd
    • Etsuro Fujita's avatar
      Fix yet another issue with step generation in partition pruning. · 199cec97
      Etsuro Fujita authored
      Commit 13838740 fixed some issues with step generation in partition
      pruning, but there was yet another one: get_steps_using_prefix() assumes
      that clauses in the passed-in prefix list are sorted in ascending order
      of their partition key numbers, but the caller failed to ensure this for
      range partitioning, which led to an assertion failure in debug builds.
      Adjust the caller function to arrange the clauses in the prefix list in
      the required order for range partitioning.
      
      Back-patch to v11, like the previous commit.
      
      Patch by me, reviewed by Amit Langote.
      
      Discussion: https://postgr.es/m/CAPmGK16jkXiFG0YqMbU66wte-oJTfW6D1HaNvQf%3D%2B5o9%3Dm55wQ%40mail.gmail.com
      199cec97
  10. 06 Aug, 2020 2 commits