1. 11 Dec, 2010 2 commits
    • Robert Haas's avatar
      Minor documentation cleanup. · 1490946c
      Robert Haas authored
      Fujii Masao
      1490946c
    • Tom Lane's avatar
      Move a couple of initdb's subroutines into src/port/. · 67119992
      Tom Lane authored
      mkdir_p and check_data_dir will be useful in CREATE TABLESPACE, since we
      have agreed that that command should handle subdirectory creation just like
      initdb creates the PGDATA directory.  Push them into src/port/ so that they
      are available to both initdb and the backend.  Rename to pg_mkdir_p and
      pg_check_dir, just to be on the safe side.  Add FreeBSD's copyright notice
      to pgmkdirp.c, since that's where the code came from originally (this
      really should have been in initdb.c).  Very marginal code/comment cleanup.
      67119992
  2. 10 Dec, 2010 2 commits
    • Tom Lane's avatar
      Use symbolic names not octal constants for file permission flags. · 04f4e10c
      Tom Lane authored
      Purely cosmetic patch to make our coding standards more consistent ---
      we were doing symbolic some places and octal other places.  This patch
      fixes all C-coded uses of mkdir, chmod, and umask.  There might be some
      other calls I missed.  Inconsistency noted while researching tablespace
      directory permissions issue.
      04f4e10c
    • Tom Lane's avatar
      Fix efficiency problems in tuplestore_trim(). · 244407a7
      Tom Lane authored
      The original coding in tuplestore_trim() was only meant to work efficiently
      in cases where each trim call deleted most of the tuples in the store.
      Which, in fact, was the pattern of the original usage with a Material node
      supporting mark/restore operations underneath a MergeJoin.  However,
      WindowAgg now uses tuplestores and it has considerably less friendly
      trimming behavior.  In particular it can attempt to trim one tuple at a
      time off a large tuplestore.  tuplestore_trim() had O(N^2) runtime in this
      situation because of repeatedly shifting its tuple pointer array.  Fix by
      avoiding shifting the array until a reasonably large number of tuples have
      been deleted.  This can waste some pointer space, but we do still reclaim
      the tuples themselves, so the percentage wastage should be pretty small.
      
      Per Jie Li's report of slow percent_rank() evaluation.  cume_dist() and
      ntile() would certainly be affected as well, along with any other window
      function that has a moving frame start and requires reading substantially
      ahead of the current row.
      
      Back-patch to 8.4, where window functions were introduced.  There's no
      need to tweak it before that.
      244407a7
  3. 09 Dec, 2010 4 commits
    • Tom Lane's avatar
      Eliminate O(N^2) behavior in parallel restore with many blobs. · 663fc32e
      Tom Lane authored
      With hundreds of thousands of TOC entries, the repeated searches in
      reduce_dependencies() become the dominant cost.  Get rid of that searching
      by constructing reverse-dependency lists, which we can do in O(N) time
      during the fix_dependencies() preprocessing.  I chose to store the reverse
      dependencies as DumpId arrays for consistency with the forward-dependency
      representation, and keep the previously-transient tocsByDumpId[] array
      around to locate actual TOC entry structs quickly from dump IDs.
      
      While this fixes the slow case reported by Vlad Arkhipov, there is still
      a potential for O(N^2) behavior with sufficiently many tables:
      fix_dependencies itself, as well as mark_create_done and
      inhibit_data_for_failed_table, are doing repeated searches to deal with
      table-to-table-data dependencies.  Possibly this work could be extended
      to deal with that, although the latter two functions are also used in
      non-parallel restore where we currently don't run fix_dependencies.
      
      Another TODO is that we fail to parallelize restore of multiple blobs
      at all.  This appears to require changes in the archive format to fix.
      
      Back-patch to 9.0 where the problem was reported.  8.4 has potential issues
      as well; but since it doesn't create a separate TOC entry for each blob,
      it's at much less risk of having enough TOC entries to cause real problems.
      663fc32e
    • Simon Riggs's avatar
    • Simon Riggs's avatar
      Reduce spurious Hot Standby conflicts from never-visible records. · b9075a6d
      Simon Riggs authored
      Hot Standby conflicts only with tuples that were visible at
      some point. So ignore tuples from aborted transactions or for
      tuples updated/deleted during the inserting transaction when
      generating the conflict transaction ids.
      
      Following detailed analysis and test case by Noah Misch.
      Original report covered btree delete records, correctly observed
      by Heikki Linnakangas that this applies to other cases also.
      Fix covers all sources of cleanup records via common code.
      b9075a6d
    • Tom Lane's avatar
      Force default wal_sync_method to be fdatasync on Linux. · 576477e7
      Tom Lane authored
      Recent versions of the Linux system header files cause xlogdefs.h to
      believe that open_datasync should be the default sync method, whereas
      formerly fdatasync was the default on Linux.  open_datasync is a bad
      choice, first because it doesn't actually outperform fdatasync (in fact
      the reverse), and second because we try to use O_DIRECT with it, causing
      failures on certain filesystems (e.g., ext4 with data=journal option).
      This part of the patch is largely per a proposal from Marti Raudsepp.
      More extensive changes are likely to follow in HEAD, but this is as much
      change as we want to back-patch.
      
      Also clean up confusing code and incorrect documentation surrounding the
      fsync_writethrough option.  Those changes shouldn't result in any actual
      behavioral change, but I chose to back-patch them anyway to keep the
      branches looking similar in this area.
      
      In 9.0 and HEAD, also do some copy-editing on the WAL Reliability
      documentation section.
      
      Back-patch to all supported branches, since any of them might get used
      on modern Linux versions.
      576477e7
  4. 08 Dec, 2010 1 commit
  5. 07 Dec, 2010 2 commits
    • Heikki Linnakangas's avatar
      Fix bugs in the hot standby known-assigned-xids tracking logic. If there's · 5a031a55
      Heikki Linnakangas authored
      an old transaction running in the master, and a lot of transactions have
      started and finished since, and a WAL-record is written in the gap between
      the creating the running-xacts snapshot and WAL-logging it, recovery will fail
      with "too many KnownAssignedXids" error. This bug was reported by
      Joachim Wieland on Nov 19th.
      
      In the same scenario, when fewer transactions have started so that all the
      xids fit in KnownAssignedXids despite the first bug, a more serious bug
      arises. We incorrectly initialize the clog code with the oldest still running
      transaction, and when we see the WAL record belonging to a transaction with
      an XID larger than one that committed already before the checkpoint we're
      recovering from, we zero the clog page containing the already committed
      transaction, leading to data loss.
      
      In hindsight, trying to track xids in the known-assigned-xids array before
      seeing the running-xacts record was too complicated. To fix that, hold
      XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
      ensures that no transaction can begin or end in that gap, so that in recvoery
      we know that the snapshot contains all transactions running at that point in
      WAL.
      5a031a55
    • Tom Lane's avatar
      Add a stack overflow check to copyObject(). · 8b569280
      Tom Lane authored
      There are some code paths, such as SPI_execute(), where we invoke
      copyObject() on raw parse trees before doing parse analysis on them.  Since
      the bison grammar is capable of building heavily nested parsetrees while
      itself using only minimal stack depth, this means that copyObject() can be
      the front-line function that hits stack overflow before anything else does.
      Accordingly, it had better have a check_stack_depth() call.  I did a bit of
      performance testing and found that this slows down copyObject() by only a
      few percent, so the hit ought to be negligible in the context of complete
      processing of a query.
      
      Per off-list report from Toshihide Katayama.  Back-patch to all supported
      branches.
      8b569280
  6. 06 Dec, 2010 3 commits
  7. 05 Dec, 2010 1 commit
    • Tom Lane's avatar
      Reduce memory consumption inside inheritance_planner(). · d1001a78
      Tom Lane authored
      Avoid eating quite so much memory for large inheritance trees, by
      reclaiming the space used by temporary copies of the original parsetree and
      range table, as well as the workspace needed during planning.  The cost is
      needing to copy the finished plan trees out of the child memory context.
      Although this looks like it ought to slow things down, my testing shows
      it actually is faster, apparently because fewer interactions with malloc()
      are needed and/or we can do the work within a more readily cacheable amount
      of memory.  That result might be platform-dependent, but I'll take it.
      
      Per a gripe from John Papandriopoulos, in which it was pointed out that the
      memory consumption actually grew as O(N^2) for sufficiently many child
      tables, since we were creating N copies of the N-element range table.
      d1001a78
  8. 04 Dec, 2010 7 commits
    • Tom Lane's avatar
      Fix two small bugs in new gistget.c logic. · d1f5a92e
      Tom Lane authored
      1. Complain, rather than silently doing nothing, if an "invalid" tuple
      is found on a leaf page.  Per off-list discussion with Heikki.
      
      2. Fix oversight in code that removes a GISTSearchItem from the search
      queue: we have to reset lastHeap if this was the last heap item in the
      parent GISTSearchTreeItem.  Otherwise subsequent additions will do the
      wrong thing.  This was probably masked in early testing because in typical
      cases the parent item would now be completely empty and would be deleted on
      next call.  You'd need a queued non-leaf page at exactly the same distance
      as a heap tuple to expose the bug.
      d1f5a92e
    • Peter Eisentraut's avatar
      Make output width consistent for all ways of invoking a regression test · 387e468b
      Peter Eisentraut authored
      run_schedule() and run_single_test() were using different output widths, which
      would show up in bigcheck/bigtest, for example.
      387e468b
    • Tom Lane's avatar
      Update comment to match later code changes. · e194a942
      Tom Lane authored
      e194a942
    • Tom Lane's avatar
      Add KNNGIST support to contrib/pg_trgm. · b525bf77
      Tom Lane authored
      Teodor Sigaev, with some revision by Tom
      b525bf77
    • Tom Lane's avatar
      Add external documentation for KNNGIST. · b576757d
      Tom Lane authored
      b576757d
    • Tom Lane's avatar
      Put back gistgettuple's check for backwards scan request. · 04910a3a
      Tom Lane authored
      On reflection it's a bad idea for the KNNGIST patch to have removed that.
      We don't want it silently returning incorrect answers.
      04910a3a
    • Tom Lane's avatar
      KNNGIST, otherwise known as order-by-operator support for GIST. · 55450687
      Tom Lane authored
      This commit represents a rather heavily editorialized version of
      Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1
      patches.  I redid the opclass API to add a separate Distance method
      instead of turning the Consistent method into an illogical mess,
      fixed some bit-rot in the rbtree interfaces, and generally worked over
      the code style and comments.
      
      There's still no non-code documentation to speak of, but I'll work on
      that separately.  Some contrib-module changes are also yet to come
      (right now, point <-> point is the only KNN-ified operator).
      
      Teodor Sigaev and Tom Lane
      55450687
  9. 03 Dec, 2010 6 commits
  10. 02 Dec, 2010 5 commits
  11. 01 Dec, 2010 1 commit
    • Tom Lane's avatar
      Prevent inlining a SQL function with multiple OUT parameters. · 225f0aa3
      Tom Lane authored
      There were corner cases in which the planner would attempt to inline such
      a function, which would result in a failure at runtime due to loss of
      information about exactly what the result record type is.  Fix by disabling
      inlining when the function's recorded result type is RECORD.  There might
      be some sub-cases where inlining could still be allowed, but this is a
      simple and backpatchable fix, so leave refinements for another day.
      Per bug #5777 from Nate Carson.
      
      Back-patch to all supported branches.  8.1 happens to avoid a core-dump
      here, but it still does the wrong thing.
      225f0aa3
  12. 29 Nov, 2010 4 commits
    • Tom Lane's avatar
      Simplify and speed up mapping of index opfamilies to pathkeys. · c0b5fac7
      Tom Lane authored
      Formerly we looked up the operators associated with each index (caching
      them in relcache) and then the planner looked up the btree opfamily
      containing such operators in order to build the btree-centric pathkey
      representation that describes the index's sort order.  This is quite
      pointless for btree indexes: we might as well just use the index's opfamily
      information directly.  That saves syscache lookup cycles during planning,
      and furthermore allows us to eliminate the relcache's caching of operators
      altogether, which may help in reducing backend startup time.
      
      I added code to plancat.c to perform the same type of double lookup
      on-the-fly if it's ever faced with a non-btree amcanorder index AM.
      If such a thing actually becomes interesting for production, we should
      replace that logic with some more-direct method for identifying the
      corresponding btree opfamily; but it's not worth spending effort on now.
      
      There is considerably more to do pursuant to my recent proposal to get rid
      of sort-operator-based representations of sort orderings, but this patch
      grabs some of the low-hanging fruit.  I'll look at the remainder of that
      work after the current commitfest.
      c0b5fac7
    • Heikki Linnakangas's avatar
      3c42efce
    • Robert Haas's avatar
      Fix typo. · fab7fdb9
      Robert Haas authored
      Fujii Masao
      fab7fdb9
    • Simon Riggs's avatar
      Move call to GetTopTransactionId() earlier in LockAcquire(), · ed78384a
      Simon Riggs authored
      removing an infrequently occurring race condition in Hot Standby.
      An xid must be assigned before a lock appears in shared memory,
      rather than immediately after, else GetRunningTransactionLocks()
      may see InvalidTransactionId, causing assertion failures during
      lock processing on standby.
      
      Bug report and diagnosis by Fujii Masao, fix by me.
      ed78384a
  13. 27 Nov, 2010 2 commits