1. 22 Feb, 2022 1 commit
    • Michael Paquier's avatar
      Add compute_query_id = regress · 627c79a1
      Michael Paquier authored
      "regress" is a new mode added to compute_query_id aimed at facilitating
      regression testing when a module computing query IDs is loaded into the
      backend, like pg_stat_statements.  It works the same way as "auto",
      meaning that query IDs are computed if a module enables it, except that
      query IDs are hidden in EXPLAIN outputs to ensure regression output
      stability.
      
      Like any GUCs of the kind (force_parallel_mode, etc.), this new
      configuration can be added to an instance's postgresql.conf, or just
      passed down with PGOPTIONS at command level.  compute_query_id uses an
      enum for its set of option values, meaning that this addition ensures
      ABI compatibility.
      
      Using this new configuration mode allows installcheck-world to pass when
      running the tests on an instance with pg_stat_statements enabled,
      stabilizing the test output while checking the paths doing query ID
      computations.
      
      Reported-by: Anton Melnikov
      Reviewed-by: Julien Rouhaud
      Discussion: https://postgr.es/m/1634283396.372373993@f75.i.mail.ru
      Discussion: https://postgr.es/m/YgHlxgc/OimuPYhH@paquier.xyz
      Backpatch-through: 14
      627c79a1
  2. 21 Feb, 2022 1 commit
    • Andres Freund's avatar
      Fix temporary object cleanup failing due to toast access without snapshot. · 7bbfe599
      Andres Freund authored
      When cleaning up temporary objects during process exit the cleanup could fail
      with:
        FATAL: cannot fetch toast data without an active snapshot
      
      The bug is caused by RemoveTempRelationsCallback() not setting up a
      snapshot. If an object with toasted catalog data needs to be cleaned up,
      init_toast_snapshot() could fail with the above error.
      
      Most of the time however the the problem is masked due to cached catalog
      snapshots being returned by GetOldestSnapshot(). But dropping an object can
      cause catalog invalidations to be emitted. If no further catalog accesses are
      necessary between the invalidation processing and the next toast datum
      deletion, the bug becomes visible.
      
      It's easy to miss this bug because it typically happens after clients
      disconnect and the FATAL error just ends up in the log.
      
      Luckily temporary table cleanup at the next use of the same temporary schema
      or during DISCARD ALL does not have the same problem.
      
      Fix the bug by pushing a snapshot in RemoveTempRelationsCallback(). Also add
      isolation tests for temporary object cleanup, including objects with toasted
      catalog data.
      
      A future HEAD only commit will add more assertions.
      
      Reported-By: Miles Delahunty
      Author: Andres Freund
      Discussion: https://postgr.es/m/CAOFAq3BU5Mf2TTvu8D9n_ZOoFAeQswuzk7yziAb7xuw_qyw5gw@mail.gmail.com
      Backpatch: 10-
      7bbfe599
  3. 20 Feb, 2022 2 commits
  4. 18 Feb, 2022 1 commit
    • Tom Lane's avatar
      Suppress warning about stack_base_ptr with late-model GCC. · 2e30d77a
      Tom Lane authored
      GCC 12 complains that set_stack_base is storing the address of
      a local variable in a long-lived pointer.  This is an entirely
      reasonable warning (indeed, it just helped us find a bug);
      but that behavior is intentional here.  We can work around it
      by using __builtin_frame_address(0) instead of a specific local
      variable; that produces an address a dozen or so bytes different,
      in my testing, but we don't care about such a small difference.
      Maybe someday a compiler lacking that function will start to issue
      a similar warning, but we'll worry about that when it happens.
      
      Patch by me, per a suggestion from Andres Freund.  Back-patch to
      v12, which is as far back as the patch will go without some pain.
      (Recently-established project policy would permit a back-patch as
      far as 9.2, but I'm disinclined to expend the work until GCC 12
      is much more widespread.)
      
      Discussion: https://postgr.es/m/3773792.1645141467@sss.pgh.pa.us
      2e30d77a
  5. 16 Feb, 2022 1 commit
  6. 14 Feb, 2022 2 commits
  7. 12 Feb, 2022 1 commit
    • Tom Lane's avatar
      Fix thinko in PQisBusy(). · ae27b1ac
      Tom Lane authored
      In commit 1f39a1c0 I made PQisBusy consider conn->write_failed, but
      that is now looking like complete brain fade.  In the first place, the
      logic is quite wrong: it ought to be like "and not" rather than "or".
      This meant that once we'd gotten into a write_failed state, PQisBusy
      would always return true, probably causing the calling application to
      iterate its loop until PQconsumeInput returns a hard failure thanks
      to connection loss.  That's not what we want: the intended behavior
      is to return an error PGresult, which the application probably has
      much cleaner support for.
      
      But in the second place, checking write_failed here seems like the
      wrong thing anyway.  The idea of the write_failed mechanism is to
      postpone handling of a write failure until we've read all we can from
      the server; so that flag should not interfere with input-processing
      behavior.  (Compare 7247e243.)  What we *should* check for is
      status = CONNECTION_BAD, ie, socket already closed.  (Most places that
      close the socket don't touch asyncStatus, but they do reset status.)
      This primarily ensures that if PQisBusy() returns true then there is
      an open socket, which is assumed by several call sites in our own
      code, and probably other applications too.
      
      While at it, fix a nearby thinko in libpq's my_sock_write: we should
      only consult errno for res < 0, not res == 0.  This is harmless since
      pqsecure_raw_write would force errno to zero in such a case, but it
      still could confuse readers.
      
      Noted by Andres Freund.  Backpatch to v12 where 1f39a1c0 came in.
      
      Discussion: https://postgr.es/m/20220211011025.ek7exh6owpzjyudn@alap3.anarazel.de
      ae27b1ac
  8. 11 Feb, 2022 1 commit
    • Tom Lane's avatar
      Don't use_physical_tlist for an IOS with non-returnable columns. · 277e744a
      Tom Lane authored
      createplan.c tries to save a runtime projection step by specifying
      a scan plan node's output as being exactly the table's columns, or
      index's columns in the case of an index-only scan, if there is not a
      reason to do otherwise.  This logic did not previously pay attention
      to whether an index's columns are returnable.  That worked, sort of
      accidentally, until commit 9a3ddeb51 taught setrefs.c to reject plans
      that try to read a non-returnable column.  I have no desire to loosen
      setrefs.c's new check, so instead adjust use_physical_tlist() to not
      try to optimize this way when there are non-returnable column(s).
      
      Per report from Ryan Kelly.  Like the previous patch, back-patch
      to all supported branches.
      
      Discussion: https://postgr.es/m/CAHUie24ddN+pDNw7fkhNrjrwAX=fXXfGZZEHhRuofV_N_ftaSg@mail.gmail.com
      277e744a
  9. 10 Feb, 2022 5 commits
    • Tom Lane's avatar
      Make pg_ctl stop/restart/promote recheck postmaster aliveness. · 1e8c5cf7
      Tom Lane authored
      "pg_ctl stop/restart" checked that the postmaster PID is valid just
      once, as a side-effect of sending the stop signal, and then would
      wait-till-timeout for the postmaster.pid file to go away.  This
      neglects the case wherein the postmaster dies uncleanly after we
      signal it.  Similarly, once "pg_ctl promote" has sent the signal,
      it'd wait for the corresponding on-disk state change to occur
      even if the postmaster dies.
      
      I'm not sure how we've managed not to notice this problem, but it
      seems to explain slow execution of the 017_shm.pl test script on AIX
      since commit 4fdbf9af5, which added a speculative "pg_ctl stop" with
      the idea of making real sure that the postmaster isn't there.  In the
      test steps that kill-9 and then restart the postmaster, it's possible
      to get past the initial signal attempt before kill() stops working
      for the doomed postmaster.  If that happens, pg_ctl waited till
      PGCTLTIMEOUT before giving up ... and the buildfarm's AIX members
      have that set very high.
      
      To fix, include a "kill(pid, 0)" test (similar to what
      postmaster_is_alive uses) in these wait loops, so that we'll
      give up immediately if the postmaster PID disappears.
      
      While here, I chose to refactor those loops out of where they were.
      do_stop() and do_restart() can perfectly well share one copy of the
      wait-for-stop loop, and it seems desirable to put a similar function
      beside that for wait-for-promote.
      
      Back-patch to all supported versions, since pg_ctl's wait logic
      is substantially identical in all, and we're seeing the slow test
      behavior in all branches.
      
      Discussion: https://postgr.es/m/20220210023537.GA3222837@rfd.leadboat.com
      1e8c5cf7
    • Andrew Dunstan's avatar
      Use gendef instead of pexports for building windows .def files · 92f60f53
      Andrew Dunstan authored
      Modern msys systems lack pexports but have gendef instead, so use that.
      
      Discussion: https://postgr.es/m/3ccde7a9-e4f9-e194-30e0-0936e6ad68ba@dunslane.net
      
      Backpatch to release 9.4 to enable building with perl on older branches.
      Before that pexports is not used for plperl.
      92f60f53
    • Tom Lane's avatar
      Make timeout.c more robust against missed timer interrupts. · 2e211c16
      Tom Lane authored
      Commit 09cf1d52 taught schedule_alarm() to not do anything if
      the next requested event is after when we expect the next interrupt
      to fire.  However, if somehow an interrupt gets lost, we'll continue
      to not do anything indefinitely, even after the "next interrupt" time
      is obviously in the past.  Thus, one missed interrupt can break
      timeout scheduling for the life of the session.  Michael Harris
      reported a scenario where a bug in a user-defined function caused this
      to happen, so you don't even need to assume kernel bugs exist to think
      this is worth fixing.  We can make things more robust at little cost
      by detecting the case where signal_due_at is before "now" and forcing
      a new setitimer call to occur.  This isn't a completely bulletproof
      fix of course; but in our typical usage pattern where we frequently set
      timeouts and clear them before they are reached, the interrupt will
      get re-enabled after at most one timeout interval, which with a little
      luck will be before we really need it.
      
      While here, let's mark signal_due_at as volatile, since the signal
      handler can both examine and set it.  I'm not sure there's any
      actual risk given that signal_pending is already volatile, but
      it's surely questionable.
      
      Backpatch to v14 where this logic came in.
      
      Michael Harris and Tom Lane
      
      Discussion: https://postgr.es/m/CADofcAWbMrvgwSMqO4iG_iD3E2v8ZUrC-_crB41my=VMM02-CA@mail.gmail.com
      2e211c16
    • Daniel Gustafsson's avatar
      Set SNI ClientHello extension to localhost in tests · 5f00ef06
      Daniel Gustafsson authored
      The connection strings in the SSL client tests were using the host
      set up from Cluster.pm which is a temporary pathname. When SNI is
      enabled we pass the host to OpenSSL in order to set the server name
      indication ClientHello extension via SSL_set_tlsext_host_name.
      
      OpenSSL doesn't validate the hostname apart from checking the max
      length, but LibreSSL checks for RFC 5890 conformance which results
      in errors during testing as the pathname from Cluster.pm is not a
      valid hostname.
      
      Fix by setting the host explicitly to localhost, as that's closer
      to the intent of the test.
      
      Backpatch through 14 where SNI support came in.
      Reported-by: default avatarNazir Bilal Yavuz <byavuz81@gmail.com>
      Reviewed-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Discussion: https://postgr.es/m/17391-304f81bcf724b58b@postgresql.org
      Backpatch-through: 14
      5f00ef06
    • Noah Misch's avatar
      Use Test::Builder::todo_start(), replacing $::TODO. · 1a83297d
      Noah Misch authored
      Some pre-2017 Test::More versions need perfect $Test::Builder::Level
      maintenance to find the variable.  Buildfarm member snapper reported an
      overall failure that the file intended to hide via the TODO construct.
      That trouble was reachable in v11 and v10.  For later branches, this
      serves as defense in depth.  Back-patch to v10 (all supported versions).
      
      Discussion: https://postgr.es/m/20220202055556.GB2745933@rfd.leadboat.com
      1a83297d
  10. 09 Feb, 2022 2 commits
  11. 07 Feb, 2022 2 commits
  12. 06 Feb, 2022 1 commit
  13. 05 Feb, 2022 2 commits
    • Tom Lane's avatar
      Doc: be clearer that foreign-table partitions need user-added constraints. · d0cd7b77
      Tom Lane authored
      A very well-informed user might deduce this from what we said already,
      but I'd bet against it.  Lay it out explicitly.
      
      While here, rewrite the comment about tuple routing to be more
      intelligible to an average SQL user.
      
      Per bug #17395 from Alexander Lakhin.  Back-patch to v11.  (The text
      in this area is different in v10 and I'm not sufficiently excited
      about this point to adapt the patch.)
      
      Discussion: https://postgr.es/m/17395-8c326292078d1a57@postgresql.org
      d0cd7b77
    • Tom Lane's avatar
      Test, don't just Assert, that mergejoin's inputs are in order. · d13a838e
      Tom Lane authored
      There are two Asserts in nodeMergejoin.c that are reachable if
      the input data is not in the expected order.  This seems way too
      fragile.  Alexander Lakhin reported a case where the assertions
      could be triggered with misconfigured foreign-table partitions,
      and bitter experience with unstable operating system collation
      definitions suggests another easy route to hitting them.  Neither
      Assert is in a place where we can't afford one more test-and-branch,
      so replace 'em with plain test-and-elog logic.
      
      Per bug #17395.  While the reported symptom is relatively recent,
      collation changes could happen anytime, so back-patch to all
      supported branches.
      
      Discussion: https://postgr.es/m/17395-8c326292078d1a57@postgresql.org
      d13a838e
  14. 04 Feb, 2022 1 commit
    • Tom Lane's avatar
      First-draft release notes for 14.2. · ab22eea8
      Tom Lane authored
      As usual, the release notes for older branches will be made by cutting
      these down, but put them up for community review first.
      ab22eea8
  15. 03 Feb, 2022 3 commits
  16. 02 Feb, 2022 2 commits
    • Peter Eisentraut's avatar
      doc: Fix mistake in PL/Python documentation · ee57467c
      Peter Eisentraut authored
      Small thinko introduced by 94aceed3
      
      Reported-by: nassehk@gmail.com
      ee57467c
    • Tom Lane's avatar
      Replace use of deprecated Python module distutils.sysconfig, take 2. · 803f0b17
      Tom Lane authored
      With Python 3.10, configure spits out warnings about the module
      distutils.sysconfig being deprecated and scheduled for removal in
      Python 3.12.  Change the uses in configure to use the module sysconfig
      instead.  The logic stays largely the same, although we have to
      rely on INCLUDEPY instead of the deprecated get_python_inc function.
      
      Note that sysconfig exists since Python 2.7, so this moves the
      minimum required version up from Python 2.6 (or 2.4, before v13).
      Also, sysconfig didn't exist in Python 3.1, so the minimum 3.x
      version is now 3.2.
      
      Back-patch of commit bd233bdd8 into all supported branches.
      
      In v10, this also includes back-patching v11's beff4bb9, primarily
      because this opinion is clearly out-of-date:
      
          While at it, get rid of the code's assumption that both the major and
          minor numbers contain exactly one digit.  That will foreseeably be
          broken by Python 3.10 in perhaps four or five years.  That's far enough
          out that we probably don't need to back-patch this.
      
      Peter Eisentraut, Tom Lane, Andres Freund
      
      Discussion: https://postgr.es/m/c74add3c-09c4-a9dd-1a03-a846e5b2fc52@enterprisedb.com
      803f0b17
  17. 31 Jan, 2022 4 commits
  18. 29 Jan, 2022 2 commits
    • Tom Lane's avatar
      Fix failure to validate the result of select_common_type(). · c025067f
      Tom Lane authored
      Although select_common_type() has a failure-return convention, an
      apparent successful return just provides a type OID that *might* work
      as a common supertype; we've not validated that the required casts
      actually exist.  In the mainstream use-cases that doesn't matter,
      because we'll proceed to invoke coerce_to_common_type() on each input,
      which will fail appropriately if the proposed common type doesn't
      actually work.  However, a few callers didn't read the (nonexistent)
      fine print, and thought that if they got back a nonzero OID then the
      coercions were sure to work.
      
      This affects in particular the recently-added "anycompatible"
      polymorphic types; we might think that a function/operator using
      such types matches cases it really doesn't.  A likely end result
      of that is unexpected "ambiguous operator" errors, as for example
      in bug #17387 from James Inform.  Another, much older, case is that
      the parser might try to transform an "x IN (list)" construct to
      a ScalarArrayOpExpr even when the list elements don't actually have
      a common supertype.
      
      It doesn't seem desirable to add more checking to select_common_type
      itself, as that'd just slow down the mainstream use-cases.  Instead,
      write a separate function verify_common_type that performs the
      missing checks, and add a call to that where necessary.  Likewise add
      verify_common_type_from_oids to go with select_common_type_from_oids.
      
      Back-patch to v13 where the "anycompatible" types came in.  (The
      symptom complained of in bug #17387 doesn't appear till v14, but
      that's just because we didn't get around to converting || to use
      anycompatible till then.)  In principle the "x IN (list)" fix could
      go back all the way, but I'm not currently convinced that it makes
      much difference in real-world cases, so I won't bother for now.
      
      Discussion: https://postgr.es/m/17387-5dfe54b988444963@postgresql.org
      c025067f
    • Michael Paquier's avatar
      Fix incorrect memory context switch in COPY TO execution · b30282fc
      Michael Paquier authored
      c532d15d has split the logic of COPY commands into multiple files, one
      change being to move the internals of BeginCopy() to BeginCopyTo().
      Originally the code was written so as we'd switch back-and-forth between
      the current execution memory context and the dedicated memory context
      for the COPY command, and this refactoring has introduced an extra
      switch to the current memory context from the COPY context once
      BeginCopyTo() is done with the past logic coming from BeginCopy().
      
      The code was correctly doing the analyze, rewrite and planning phases in
      the COPY context, but it was not assigning "copy_file" (FILE* used when
      copying to a source file) and "filename" in the COPY context, making the
      COPY status data inconsistent.
      
      Author: Bharath Rupireddy
      Reviewed-by: Japin Li
      Discussion: https://postgr.es/m/CALj2ACWvVa69foi9jhHFY=2BuHxAoYboyE+vXQTARwxZcJnVrQ@mail.gmail.com
      Backpatch-through: 14
      b30282fc
  19. 28 Jan, 2022 2 commits
  20. 27 Jan, 2022 4 commits
    • Tomas Vondra's avatar
      Fix ordering of XIDs in ProcArrayApplyRecoveryInfo · fb2f8e53
      Tomas Vondra authored
      Commit 8431e296 reworked ProcArrayApplyRecoveryInfo to sort XIDs
      before adding them to KnownAssignedXids. But the XIDs are sorted using
      xidComparator, which compares the XIDs simply as uint32 values, not
      logically. KnownAssignedXidsAdd() however expects XIDs in logical order,
      and calls TransactionIdFollowsOrEquals() to enforce that. If there are
      XIDs for which the two orderings disagree, an error is raised and the
      recovery fails/restarts.
      
      Hitting this issue is fairly easy - you just need two transactions, one
      started before the 4B limit (e.g. XID 4294967290), the other sometime
      after it (e.g. XID 1000). Logically (4294967290 <= 1000) but when
      compared using xidComparator we try to add them in the opposite order.
      Which makes KnownAssignedXidsAdd() fail with an error like this:
      
        ERROR: out-of-order XID insertion in KnownAssignedXids
      
      This only happens during replica startup, while processing RUNNING_XACTS
      records to build the snapshot. Once we reach STANDBY_SNAPSHOT_READY, we
      skip these records. So this does not affect already running replicas,
      but if you restart (or create) a replica while there are transactions
      with XIDs for which the two orderings disagree, you may hit this.
      
      Long-running transactions and frequent replica restarts increase the
      likelihood of hitting this issue. Once the replica gets into this state,
      it can't be started (even if the old transactions are terminated).
      
      Fixed by sorting the XIDs logically - this is fine because we're dealing
      with normal XIDs (because it's XIDs assigned to backends) and from the
      same wraparound epoch (otherwise the backends could not be running at
      the same time on the primary node). So there are no problems with the
      triangle inequality, which is why xidComparator compares raw values.
      
      Investigation and root cause analysis by Abhijit Menon-Sen. Patch by me.
      
      This issue is present in all releases since 9.4, however releases up to
      9.6 are EOL already so backpatch to 10 only.
      
      Reviewed-by: Abhijit Menon-Sen
      Reviewed-by: Alvaro Herrera
      Backpatch-through: 10
      Discussion: https://postgr.es/m/36b8a501-5d73-277c-4972-f58a4dce088a%40enterprisedb.com
      fb2f8e53
    • Andrew Dunstan's avatar
      Improve msys2 detection for TAP tests · 999dc1d2
      Andrew Dunstan authored
      Perl instances on some msys toolchains (e.g. UCRT64) have their
      configured osname set to 'MSWin32' rather than 'msys'.  The test for
      the msys2 platform is adjusted accordingly.
      
      Backpatch to release 14.
      999dc1d2
    • Etsuro Fujita's avatar
      postgres_fdw: Fix handling of a pending asynchronous request in postgresReScanForeignScan(). · d1cca944
      Etsuro Fujita authored
      Commit 27e1f145 failed to process a pending asynchronous request made
      for a given ForeignScan node in postgresReScanForeignScan() (if any) in
      cases where we would only reset the next_tuple counter in that function,
      contradicting the assumption that there should be no pending
      asynchronous requests that have been made for async-capable subplans for
      the parent Append node after ReScan.  This led to an assert failure in
      an assert-enabled build.  I think this would also lead to mis-rewinding
      the cursor in that function in the case where we have already fetched
      one batch for the ForeignScan node and the asynchronous request has been
      made for the second batch, because even in that case we would just reset
      the counter when called from that function, so we would fail to execute
      MOVE BACKWARD ALL.
      
      To fix, modify that function to process the asynchronous request before
      restarting the scan.
      
      While at it, add a comment to a function to match other places.
      
      Per bug #17344 from Alexander Lakhin.  Back-patch to v14 where the
      aforesaid commit came in.
      
      Patch by me.  Test case by Alexander Lakhin, adjusted by me.  Reviewed
      and tested by Alexander Lakhin and Dmitry Dolgov.
      
      Discussion: https://postgr.es/m/17344-226b78b00de73a7e@postgresql.org
      d1cca944
    • Noah Misch's avatar
      On sparc64+ext4, suppress test failures from known WAL read failure. · d94a95cc
      Noah Misch authored
      Buildfarm members kittiwake, tadarida and snapper began to fail
      frequently when commits 3cd9c3b921977272e6650a5efbeade4203c4bca2 and
      f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests of concurrency, but
      the problem was reachable before those commits.  Back-patch to v10 (all
      supported versions).
      
      Discussion: https://postgr.es/m/20220116210241.GC756210@rfd.leadboat.com
      d94a95cc