1. 14 Aug, 2017 5 commits
    • Tom Lane's avatar
      Handle elog(FATAL) during ROLLBACK more robustly. · 5b6289c1
      Tom Lane authored
      Stress testing by Andreas Seltenreich disclosed longstanding problems that
      occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are
      trying to execute a ROLLBACK of an already-failed transaction.  In such a
      case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction
      would skip AbortTransaction and go straight to CleanupTransaction.  This
      led to an assert failure in an assert-enabled build (due to the ROLLBACK's
      portal still having a cleanup hook) or without assertions, to a FATAL exit
      complaining about "cannot drop active portal".  The latter's not
      disastrous, perhaps, but it's messy enough to want to improve it.
      
      We don't really want to run all of AbortTransaction in this code path.
      The minimum required to clean up the open portal safely is to do
      AtAbort_Memory and AtAbort_Portals.  It seems like a good idea to
      do AtAbort_Memory unconditionally, to be entirely sure that we are
      starting with a safe CurrentMemoryContext.  That means that if the
      main loop in AbortOutOfAnyTransaction does nothing, we need an extra
      step at the bottom to restore CurrentMemoryContext = TopMemoryContext,
      which I chose to do by invoking AtCleanup_Memory.  This'll result in
      calling AtCleanup_Memory twice in many of the paths through this function,
      but that seems harmless and reasonably inexpensive.
      
      The original motivation for the assertion in AtCleanup_Portals was that
      we wanted to be sure that any user-defined code executed as a consequence
      of the cleanup hook runs during AbortTransaction not CleanupTransaction.
      That still seems like a valid concern, and now that we've seen one case
      of the assertion firing --- which means that exactly that would have
      happened in a production build --- let's replace the Assert with a runtime
      check.  If we see the cleanup hook still set, we'll emit a WARNING and
      just drop the hook unexecuted.
      
      This has been like this a long time, so back-patch to all supported
      branches.
      
      Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
      5b6289c1
    • Peter Eisentraut's avatar
      Fix typo · 7f1bb1d7
      Peter Eisentraut authored
      Author: Masahiko Sawada <sawada.mshk@gmail.com>
      7f1bb1d7
    • Peter Eisentraut's avatar
      doc: Fix logical replication protocol doc detail · 79e5de69
      Peter Eisentraut authored
      Author: Masahiko Sawada <sawada.mshk@gmail.com>
      Reported-by: default avatarKyle Conroy <kyle@kyleconroy.com>
      Bug: #14775
      79e5de69
    • Tom Lane's avatar
      Absorb -D_USE_32BIT_TIME_T switch from Perl, if relevant. · 5a5c2fec
      Tom Lane authored
      Commit 3c163a7f's original choice to ignore all #define symbols whose
      names begin with underscore turns out to be too simplistic.  On Windows,
      some Perl installations are built with -D_USE_32BIT_TIME_T, and we must
      absorb that or we get the wrong result for sizeof(PerlInterpreter).
      
      This effectively re-reverts commit ef58b87d, which injected that symbol
      in a hacky way, making it apply to all of Postgres not just PL/Perl.
      More significantly, it did so on *all* 32-bit Windows builds, even when
      the Perl build to be used did not select this option; so that it fails
      to work properly with some newer Perl builds.
      
      By making this change, we would be introducing an ABI break in 32-bit
      Windows builds; but fortunately we have not used type time_t in any
      exported Postgres APIs in a long time.  So it should be OK, both for
      PL/Perl itself and for third-party extensions, if an extension library
      is built with a different _USE_32BIT_TIME_T setting than the core code.
      
      Patch by me, based on research by Ashutosh Sharma and Robert Haas.
      Back-patch to all supported branches, as commit 3c163a7f was.
      
      Discussion: https://postgr.es/m/CANFyU97OVQ3+Mzfmt3MhuUm5NwPU=-FtbNH5Eb7nZL9ua8=rcA@mail.gmail.com
      5a5c2fec
    • Michael Meskes's avatar
  2. 13 Aug, 2017 3 commits
    • Tom Lane's avatar
      Remove AtEOXact_CatCache(). · 004a9702
      Tom Lane authored
      The sole useful effect of this function, to check that no catcache
      entries have positive refcounts at transaction end, has really been
      obsolete since we introduced ResourceOwners in PG 8.1.  We reduced the
      checks to assertions years ago, so that the function was a complete
      no-op in production builds.  There have been previous discussions about
      removing it entirely, but consensus up to now was that it had some small
      value as a cross-check for bugs in the ResourceOwner logic.
      
      However, it now emerges that it's possible to trigger these assertions
      if you hit an assert-enabled backend with SIGTERM during a call to
      SearchCatCacheList, because that function temporarily increases the
      refcounts of entries it's intending to add to a catcache list construct.
      In a normal ERROR scenario, the extra refcounts are cleaned up by
      SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
      transaction abort and exit without ever executing PG_CATCH handlers.
      
      There's a case to be made that this is a generic hazard and we should
      consider restructuring elog(FATAL) handling so that pending PG_CATCH
      handlers do get run.  That's pretty scary though: it could easily create
      more problems than it solves.  Preliminary stress testing by Andreas
      Seltenreich suggests that there are not many live problems of this ilk,
      so we rejected that idea.
      
      There are more-localized ways to fix the problem; the most principled
      one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
      But adding cycles to SearchCatCacheList isn't very appealing.  We could
      also weaken the assertions in AtEOXact_CatCache in some more or less
      ad-hoc way, but that just makes its raison d'etre even less compelling.
      In the end, the most reasonable solution seems to be to just remove
      AtEOXact_CatCache altogether, on the grounds that it's not worth trying
      to fix it.  It hasn't found any bugs for us in many years.
      
      Per report from Jeevan Chalke.  Back-patch to all supported branches.
      
      Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
      004a9702
    • Alvaro Herrera's avatar
      2336f842
    • Noah Misch's avatar
  3. 12 Aug, 2017 1 commit
  4. 11 Aug, 2017 11 commits
  5. 10 Aug, 2017 5 commits
  6. 09 Aug, 2017 2 commits
    • Tom Lane's avatar
      Fix handling of container types in find_composite_type_dependencies. · 749c7c41
      Tom Lane authored
      find_composite_type_dependencies correctly found columns that are of
      the specified type, and columns that are of arrays of that type, but
      not columns that are domains or ranges over the given type, its array
      type, etc.  The most general way to handle this seems to be to assume
      that any type that is directly dependent on the specified type can be
      treated as a container type, and processed recursively (allowing us
      to handle nested cases such as ranges over domains over arrays ...).
      Since a type's array type already has such a dependency, we can drop
      the existing special case for the array type.
      
      The very similar logic in get_rels_with_domain was likewise a few
      bricks shy of a load, as it supposed that a directly dependent type
      could *only* be a sub-domain.  This is already wrong for ranges over
      domains, and it'll someday be wrong for arrays over domains.
      
      Add test cases illustrating the problems, and back-patch to all
      supported branches.
      
      Discussion: https://postgr.es/m/15268.1502309024@sss.pgh.pa.us
      749c7c41
    • Tom Lane's avatar
      Prevent passing down MAKELEVEL/MAKEFLAGS from non-GNU make to GNU make. · a76200de
      Tom Lane authored
      FreeBSD's make, for one, sets the MAKELEVEL environment variable when
      invoking commands.  In the special Makefile we provide to hand off control
      from a non-GNU make to GNU make, this causes GNU make to think it is a
      child make invocation rather than top-level.  That interferes with the hack
      added in commit dcae5fac to cause the temp-install tree to be made only by
      the top-level invocation of gmake.  Unset the variable to prevent that.
      
      Likewise unset MAKEFLAGS, which FreeBSD's make also sets, and which could
      easily confuse gmake.  There are no reports of actual trouble from that,
      but it seems better to be proactive.
      
      Back-patch to 9.5 where dcae5fac came in.
      
      Thomas Munro, hacked a bit more by me
      
      Discussion: https://postgr.es/m/CAEepm=1ueww35AXTkt1A3gyzZUqv5XCzh8RUNvJZAQAW=eOhVw@mail.gmail.com
      a76200de
  7. 08 Aug, 2017 8 commits
    • Peter Eisentraut's avatar
    • Tom Lane's avatar
      Fix datumSerialize infrastructure to not crash on non-varlena data. · 9bf4068c
      Tom Lane authored
      Commit 1efc7e53 did a poor job of emulating existing logic for touching
      Datums that might be expanded-object pointers.  It didn't check for typlen
      being -1 first, which meant it could crash on fixed-length pass-by-ref
      values, and probably on cstring values as well.  It also didn't use
      DatumGetPointer before VARATT_IS_EXTERNAL_EXPANDED, which while currently
      harmless is not according to documentation nor prevailing style.
      
      I also think the lack of any explanation as to why datumSerialize makes
      these particular nonobvious choices is pretty awful, so fix that.
      
      Per report from Jarred Ward.  Back-patch to 9.6 where this code came in.
      
      Discussion: https://postgr.es/m/6F61E6D2-2F5E-4794-9479-A429BE1CEA4B@simple.com
      9bf4068c
    • Alvaro Herrera's avatar
      Reword some unclear comments · 77d2c00a
      Alvaro Herrera authored
      77d2c00a
    • Alvaro Herrera's avatar
      Fix typo in comment · f5d54ef9
      Alvaro Herrera authored
      f5d54ef9
    • Tom Lane's avatar
      Fix yet another race condition in recovery/t/001_stream_rep.pl. · 4576a693
      Tom Lane authored
      In commit 5c77690f, we added polling in front of most of the
      get_slot_xmins calls in 001_stream_rep.pl, but today's results from
      buildfarm member nightjar show that at least one more poll loop
      is needed.
      
      Proactively add a poll loop before the next-to-last get_slot_xmins call
      as well.  It may be that there is no race condition there because the
      standby_2 server is shut down at that point, but I'm quite tired of
      fighting with this test script.  The empirical evidence that it's safe,
      from the buildfarm, is no stronger than the evidence for the other
      call that nightjar just proved unsafe.
      
      The only remaining get_slot_xmins calls without wait_slot_xmins
      protection are the first two, which should be OK since nothing has
      happened at that point.  It's tempting to ignore that special case
      and merge get_slot_xmins and wait_slot_xmins into a single function.
      I didn't go that far though.
      
      Discussion: https://postgr.es/m/18436.1502228036@sss.pgh.pa.us
      4576a693
    • Alvaro Herrera's avatar
      Fix replication origin-related race conditions · b2c95a37
      Alvaro Herrera authored
      Similar to what was fixed in commit 9915de6c for replication slots,
      but this time it's related to replication origins: DROP SUBSCRIPTION
      attempts to drop the replication origin, but that fails if the
      replication worker process hasn't yet marked it unused.  This causes
      failures in the buildfarm:
      ERROR:  could not drop replication origin with OID 1, in use by PID 34069
      
      Like the aforementioned commit, fix by having the process running DROP
      SUBSCRIPTION sleep until the worker marks the the replication origin
      struct as free.  This uses a condition variable on each replication
      origin shmem state struct, so that the session trying to drop can sleep
      and expect to be awakened by the process keeping the origin open.
      
      Also fix a SGML markup in the previous commit.
      
      Discussion: https://postgr.es/m/20170808001433.rozlseaf4m2wkw3n@alvherre.pgsql
      b2c95a37
    • Alvaro Herrera's avatar
      Fix inadequacies in recently added wait events · 030273b7
      Alvaro Herrera authored
      In commit 9915de6c, we introduced a new wait point for replication
      slots and incorrectly labelled it as wait event PG_WAIT_LOCK.  That's
      wrong, so invent an appropriate new wait event instead, and document it
      properly.
      
      While at it, fix numerous other problems in the vicinity:
      - two different walreceiver wait events were being mixed up in a single
        wait event (which wasn't documented either); split it out so that they
        can be distinguished, and document the new events properly.
      
      - ParallelBitmapPopulate was documented but didn't exist.
      
      - ParallelBitmapScan was not documented (I think this should be called
        "ParallelBitmapScanInit" instead.)
      
      - Logical replication wait events weren't documented
      
      - various symbols had been added in dartboard order in various places.
        Put them in alphabetical order instead, as was originally intended.
      
      Discussion: https://postgr.es/m/20170808181131.mu4fjepuh5m75cyq@alvherre.pgsql
      030273b7
    • Noah Misch's avatar
      Disclaim xmltable() support for non-UTF8 databases. · b4a2eea0
      Noah Misch authored
      The xmltable() implementation mirrors xpath(), including its lack of
      character encoding awareness.
      b4a2eea0
  8. 07 Aug, 2017 5 commits