1. 06 Oct, 2017 2 commits
    • Alvaro Herrera's avatar
      Fix traversal of half-frozen update chains · a5736bf7
      Alvaro Herrera authored
      When some tuple versions in an update chain are frozen due to them being
      older than freeze_min_age, the xmax/xmin trail can become broken.  This
      breaks HOT (and probably other things).  A subsequent VACUUM can break
      things in more serious ways, such as leaving orphan heap-only tuples
      whose root HOT redirect items were removed.  This can be seen because
      index creation (or REINDEX) complain like
        ERROR:  XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
      
      Because of relfrozenxid contraints, we cannot avoid the freezing of the
      early tuples, so we must cope with the results: whenever we see an Xmin
      of FrozenTransactionId, consider it a match for whatever the previous
      Xmax value was.
      
      This problem seems to have appeared in 9.3 with multixact changes,
      though strictly speaking it seems unrelated.
      
      Since 9.4 we have commit 37484ad2 "Change the way we mark tuples as
      frozen", so the fix is simple: just compare the raw Xmin (still stored
      in the tuple header, since freezing merely set an infomask bit) to the
      Xmax.  But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
      the original value is lost and we have nothing to compare the Xmax with.
      To cope with that case we need to compare the Xmin with FrozenXid,
      assume it's a match, and hope for the best.  Sadly, since you can
      pg_upgrade a 9.3 instance containing half-frozen pages to newer
      releases, we need to keep the old check in newer versions too, which
      seems a bit brittle; I hope we can somehow get rid of that.
      
      I didn't optimize the new function for performance.  The new coding is
      probably a bit slower than before, since there is a function call rather
      than a straight comparison, but I'd rather have it work correctly than
      be fast but wrong.
      
      This is a followup after 20b65522 fixed a few related problems.
      Apparently, in 9.6 and up there are more ways to get into trouble, but
      in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
      there must be a separate bug.
      
      Reported-by: Peter Geoghegan
      Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
      	Yi Wen Wong, Álvaro
      Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
      a5736bf7
    • Robert Haas's avatar
      Basic partition-wise join functionality. · f49842d1
      Robert Haas authored
      Instead of joining two partitioned tables in their entirety we can, if
      it is an equi-join on the partition keys, join the matching partitions
      individually.  This involves teaching the planner about "other join"
      rels, which are related to regular join rels in the same way that
      other member rels are related to baserels.  This can use significantly
      more CPU time and memory than regular join planning, because there may
      now be a set of "other" rels not only for every base relation but also
      for every join relation.  In most practical cases, this probably
      shouldn't be a problem, because (1) it's probably unusual to join many
      tables each with many partitions using the partition keys for all
      joins and (2) if you do that scenario then you probably have a big
      enough machine to handle the increased memory cost of planning and (3)
      the resulting plan is highly likely to be better, so what you spend in
      planning you'll make up on the execution side.  All the same, for now,
      turn this feature off by default.
      
      Currently, we can only perform joins between two tables whose
      partitioning schemes are absolutely identical.  It would be nice to
      cope with other scenarios, such as extra partitions on one side or the
      other with no match on the other side, but that will have to wait for
      a future patch.
      
      Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
      Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
      Khandekar, and by me.  A few final adjustments by me.
      
      Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
      Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
      f49842d1
  2. 05 Oct, 2017 10 commits
  3. 04 Oct, 2017 5 commits
  4. 03 Oct, 2017 3 commits
    • Tom Lane's avatar
      Allow multiple tables to be specified in one VACUUM or ANALYZE command. · 11d8d72c
      Tom Lane authored
      Not much to say about this; does what it says on the tin.
      
      However, formerly, if there was a column list then the ANALYZE action was
      implied; now it must be specified, or you get an error.  This is because
      it would otherwise be a bit unclear what the user meant if some tables
      have column lists and some don't.
      
      Nathan Bossart, reviewed by Michael Paquier and Masahiko Sawada, with some
      editorialization by me
      
      Discussion: https://postgr.es/m/E061A8E3-5E3D-494D-94F0-E8A9B312BBFC@amazon.com
      11d8d72c
    • Tom Lane's avatar
      Fix race condition with unprotected use of a latch pointer variable. · 45f9d086
      Tom Lane authored
      Commit 597a87cc introduced a latch pointer variable to replace use
      of a long-lived shared latch in the shared WalRcvData structure.
      This was not well thought out, because there are now hazards of the
      pointer variable changing while it's being inspected by another
      process.  This could obviously lead to a core dump in code like
      
      	if (WalRcv->latch)
      		SetLatch(WalRcv->latch);
      
      and there's a more remote risk of a torn read, if we have any
      platforms where reading/writing a pointer is not atomic.
      
      An actual problem would occur only if the walreceiver process
      exits (gracefully) while the startup process is trying to
      signal it, but that seems well within the realm of possibility.
      
      To fix, treat the pointer variable (not the referenced latch)
      as being protected by the WalRcv->mutex spinlock.  There
      remains a race condition that we could apply SetLatch to a
      process latch that no longer belongs to the walreceiver, but
      I believe that's harmless: at worst it'd cause an extra wakeup
      of the next process to use that PGPROC structure.
      
      Back-patch to v10 where the faulty code was added.
      
      Discussion: https://postgr.es/m/22735.1507048202@sss.pgh.pa.us
      45f9d086
    • Alvaro Herrera's avatar
      Fix coding rules violations in walreceiver.c · 89e434b5
      Alvaro Herrera authored
      1. Since commit b1a9bad9 we had pstrdup() inside a
      spinlock-protected critical section; reported by Andreas Seltenreich.
      Turn those into strlcpy() to stack-allocated variables instead.
      Backpatch to 9.6.
      
      2. Since commit 9ed551e0 we had a pfree() uselessly inside a
      spinlock-protected critical section.  Tom Lane noticed in code review.
      Move down.  Backpatch to 9.6.
      
      3. Since commit 64233902 we had GetCurrentTimestamp() (a kernel
      call) inside a spinlock-protected critical section.  Tom Lane noticed in
      code review.  Move it up.  Backpatch to 9.2.
      
      4. Since commit 1bb25580 we did elog(PANIC) while holding spinlock.
      Tom Lane noticed in code review.  Release spinlock before dying.
      Backpatch to 9.2.
      
      Discussion: https://postgr.es/m/87h8vhtgj2.fsf@ansel.ydns.eu
      89e434b5
  5. 02 Oct, 2017 4 commits
  6. 01 Oct, 2017 8 commits
  7. 30 Sep, 2017 5 commits
    • Tom Lane's avatar
      Fix pg_dump to assign domain array type OIDs during pg_upgrade. · 2632bcce
      Tom Lane authored
      During a binary upgrade, all type OIDs are supposed to be assigned by
      pg_dump based on their values in the old cluster.  But now that domains
      have arrays, there's nothing to base the arrays' type OIDs on, if we're
      upgrading from a pre-v11 cluster.  Make pg_dump search for an unused type
      OID to use for this purpose.  Per buildfarm.
      
      Discussion: https://postgr.es/m/E1dyLlE-0002gT-H5@gemulon.postgresql.org
      2632bcce
    • Tom Lane's avatar
      Support arrays over domains. · c12d570f
      Tom Lane authored
      Allowing arrays with a domain type as their element type was left un-done
      in the original domain patch, but not for any very good reason.  This
      omission leads to such surprising results as array_agg() not working on
      a domain column, because the parser can't identify a suitable output type
      for the polymorphic aggregate.
      
      In order to fix this, first clean up the APIs of coerce_to_domain() and
      some internal functions in parse_coerce.c so that we consistently pass
      around a CoercionContext along with CoercionForm.  Previously, we sometimes
      passed an "isExplicit" boolean flag instead, which is strictly less
      information; and coerce_to_domain() didn't even get that, but instead had
      to reverse-engineer isExplicit from CoercionForm.  That's contrary to the
      documentation in primnodes.h that says that CoercionForm only affects
      display and not semantics.  I don't think this change fixes any live bugs,
      but it makes things more consistent.  The main reason for doing it though
      is that now build_coercion_expression() receives ccontext, which it needs
      in order to be able to recursively invoke coerce_to_target_type().
      
      Next, reimplement ArrayCoerceExpr so that the node does not directly know
      any details of what has to be done to the individual array elements while
      performing the array coercion.  Instead, the per-element processing is
      represented by a sub-expression whose input is a source array element and
      whose output is a target array element.  This simplifies life in
      parse_coerce.c, because it can build that sub-expression by a recursive
      invocation of coerce_to_target_type().  The executor now handles the
      per-element processing as a compiled expression instead of hard-wired code.
      The main advantage of this is that we can use a single ArrayCoerceExpr to
      handle as many as three successive steps per element: base type conversion,
      typmod coercion, and domain constraint checking.  The old code used two
      stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty
      inefficient, and adding yet another array deconstruction to do domain
      constraint checking seemed very unappetizing.
      
      In the case where we just need a single, very simple coercion function,
      doing this straightforwardly leads to a noticeable increase in the
      per-array-element runtime cost.  Hence, add an additional shortcut evalfunc
      in execExprInterp.c that skips unnecessary overhead for that specific form
      of expression.  The runtime speed of simple cases is within 1% or so of
      where it was before, while cases that previously required two levels of
      array processing are significantly faster.
      
      Finally, create an implicit array type for every domain type, as we do for
      base types, enums, etc.  Everything except the array-coercion case seems
      to just work without further effort.
      
      Tom Lane, reviewed by Andrew Dunstan
      
      Discussion: https://postgr.es/m/9852.1499791473@sss.pgh.pa.us
      c12d570f
    • Andres Freund's avatar
      Fix copy & pasto in 510b8cbf. · 248e3375
      Andres Freund authored
      Reported-By: Peter Geoghegan
      248e3375
    • Andres Freund's avatar
      Fix typo. · f1424123
      Andres Freund authored
      Reported-By: Thomas Munro and Jesper Pedersen
      f1424123
    • Andres Freund's avatar
      Extend & revamp pg_bswap.h infrastructure. · 510b8cbf
      Andres Freund authored
      Upcoming patches are going to address performance issues that involve
      slow system provided ntohs/htons etc. To address that expand
      pg_bswap.h to provide pg_ntoh{16,32,64}, pg_hton{16,32,64} and
      optimize their respective implementations by using compiler intrinsics
      for gcc compatible compilers and msvc. Fall back to manual
      implementations using shifts etc otherwise.
      
      Additionally remove multiple evaluation hazards from the existing
      BSWAP32/64 macros, by replacing them with inline functions when
      necessary. In the course of that the naming scheme is changed to
      pg_bswap16/32/64.
      
      Author: Andres Freund
      Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
      510b8cbf
  8. 29 Sep, 2017 3 commits
    • Peter Eisentraut's avatar
      Use Py_RETURN_NONE where suitable · 0008a106
      Peter Eisentraut authored
      This is more idiomatic style and available as of Python 2.4, which is
      our minimum.
      0008a106
    • Tom Lane's avatar
      Fix inadequate locking during get_rel_oids(). · 19de0ab2
      Tom Lane authored
      get_rel_oids used to not take any relation locks at all, but that stopped
      being a good idea with commit 3c3bb993, which inserted a syscache lookup
      into the function.  A concurrent DROP TABLE could now produce "cache lookup
      failed", which we don't want to have happen in normal operation.  The best
      solution seems to be to transiently take a lock on the relation named by
      the RangeVar (which also makes the result of RangeVarGetRelid a lot less
      spongy).  But we shouldn't hold the lock beyond this function, because we
      don't want VACUUM to lock more than one table at a time.  (That would not
      be a big problem right now, but it will become one after the pending
      feature patch to allow multiple tables to be named in VACUUM.)
      
      In passing, adjust vacuum_rel and analyze_rel to document that we don't
      trust the passed RangeVar to be accurate, and allow the RangeVar to
      possibly be NULL --- which it is anyway for a whole-database VACUUM,
      though we accidentally didn't crash for that case.
      
      The passed RangeVar is in fact inaccurate when dealing with a child
      partition, as of v10, and it has been wrong for a whole long time in the
      case of vacuum_rel() recursing to a TOAST table.  None of these things
      present visible bugs up to now, because the passed RangeVar is in fact
      only consulted for autovacuum logging, and in that particular context it's
      always accurate because autovacuum doesn't let vacuum.c expand partitions
      nor recurse to toast tables.  Still, this seems like trouble waiting to
      happen, so let's nail the door at least partly shut.  (Further cleanup
      is planned, in HEAD only, as part of the pending feature patch.)
      
      Fix some sadly inaccurate/obsolete comments too.  Back-patch to v10.
      
      Michael Paquier and Tom Lane
      
      Discussion: https://postgr.es/m/25023.1506107590@sss.pgh.pa.us
      19de0ab2
    • Robert Haas's avatar
      psql: Don't try to print a partition constraint we didn't fetch. · 69c16983
      Robert Haas authored
      If \d rather than \d+ is used, then verbose is false and we don't ask
      the server for the partition constraint; so we shouldn't print it in
      that case either.
      
      Maksim Milyutin, per a report from Jesper Pedersen.  Reviewed by
      Jesper Pedersen and Amit Langote.
      
      Discussion: http://postgr.es/m/2af5fc4d-7bcc-daa8-4fe6-86274bea363c@redhat.com
      69c16983