1. 23 Jan, 2013 9 commits
    • Alvaro Herrera's avatar
      Improve concurrency of foreign key locking · 0ac5ad51
      Alvaro Herrera authored
      This patch introduces two additional lock modes for tuples: "SELECT FOR
      KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
      other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
      FOR UPDATE".  UPDATE commands that do not modify the values stored in
      the columns that are part of the key of the tuple now grab a SELECT FOR
      NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
      with tuple locks of the FOR KEY SHARE variety.
      
      Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
      means the concurrency improvement applies to them, which is the whole
      point of this patch.
      
      The added tuple lock semantics require some rejiggering of the multixact
      module, so that the locking level that each transaction is holding can
      be stored alongside its Xid.  Also, multixacts now need to persist
      across server restarts and crashes, because they can now represent not
      only tuple locks, but also tuple updates.  This means we need more
      careful tracking of lifetime of pg_multixact SLRU files; since they now
      persist longer, we require more infrastructure to figure out when they
      can be removed.  pg_upgrade also needs to be careful to copy
      pg_multixact files over from the old server to the new, or at least part
      of multixact.c state, depending on the versions of the old and new
      servers.
      
      Tuple time qualification rules (HeapTupleSatisfies routines) need to be
      careful not to consider tuples with the "is multi" infomask bit set as
      being only locked; they might need to look up MultiXact values (i.e.
      possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
      whereas they previously were assured to only use information readily
      available from the tuple header.  This is considered acceptable, because
      the extra I/O would involve cases that would previously cause some
      commands to block waiting for concurrent transactions to finish.
      
      Another important change is the fact that locking tuples that have
      previously been updated causes the future versions to be marked as
      locked, too; this is essential for correctness of foreign key checks.
      This causes additional WAL-logging, also (there was previously a single
      WAL record for a locked tuple; now there are as many as updated copies
      of the tuple there exist.)
      
      With all this in place, contention related to tuples being checked by
      foreign key rules should be much reduced.
      
      As a bonus, the old behavior that a subtransaction grabbing a stronger
      tuple lock than the parent (sub)transaction held on a given tuple and
      later aborting caused the weaker lock to be lost, has been fixed.
      
      Many new spec files were added for isolation tester framework, to ensure
      overall behavior is sane.  There's probably room for several more tests.
      
      There were several reviewers of this patch; in particular, Noah Misch
      and Andres Freund spent considerable time in it.  Original idea for the
      patch came from Simon Riggs, after a problem report by Joel Jacobson.
      Most code is from me, with contributions from Marti Raudsepp, Alexander
      Shulgin, Noah Misch and Andres Freund.
      
      This patch was discussed in several pgsql-hackers threads; the most
      important start at the following message-ids:
      	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
      	1290721684-sup-3951@alvh.no-ip.org
      	1294953201-sup-2099@alvh.no-ip.org
      	1320343602-sup-2290@alvh.no-ip.org
      	1339690386-sup-8927@alvh.no-ip.org
      	4FE5FF020200002500048A3D@gw.wicourts.gov
      	4FEAB90A0200002500048B7D@gw.wicourts.gov
      0ac5ad51
    • Robert Haas's avatar
      Further documentation tweaks for event triggers. · f925c79b
      Robert Haas authored
      Per discussion between Dimitri Fontaine, myself, and others.
      f925c79b
    • Robert Haas's avatar
    • Heikki Linnakangas's avatar
      Implement pg_unreachable() on MSVC. · 52906f17
      Heikki Linnakangas authored
      52906f17
    • Andrew Dunstan's avatar
      Gitignore vcxproj files. · eaf76484
      Andrew Dunstan authored
      Per request from Craig Ringer.
      eaf76484
    • Heikki Linnakangas's avatar
      Fix more issues with cascading replication and timeline switches. · 990fe3c4
      Heikki Linnakangas authored
      When a standby server follows the master using WAL archive, and it chooses
      a new timeline (recovery_target_timeline='latest'), it only fetches the
      timeline history file for the chosen target timeline, not any other history
      files that might be missing from pg_xlog. For example, if the current
      timeline is 2, and we choose 4 as the new recovery target timeline, the
      history file for timeline 3 is not fetched, even if it's part of this
      server's history. That's enough for the standby itself - the history file
      for timeline 4 includes timeline 3 as well - but if a cascading standby
      server wants to recover to timeline 3, it needs the history file. To fix,
      when a new recovery target timeline is chosen, try to copy any missing
      history files from the archive to pg_xlog between the old and new target
      timeline.
      
      A second similar issue was with the WAL files. When a standby recovers from
      archive, and it reaches a segment that contains a switch to a new timeline,
      recovery fetches only the WAL file labelled with the new timeline's ID. The
      file from the new timeline contains a copy of the WAL from the old timeline
      up to the point where the switch happened, and recovery recovers it from the
      new file. But in streaming replication, walsender only tries to read it
      from the old timeline's file. To fix, change walsender to read it from the
      new file, so that it behaves the same as recovery in that sense, and doesn't
      try to open the possibly nonexistent file with the old timeline's ID.
      990fe3c4
    • Bruce Momjian's avatar
      pg_upgrade: remove --single-transaction usage · 861ad67b
      Bruce Momjian authored
      With AtEOXact applied, --single-transaction makes pg_restore slower, and
      has the potential to require lock table configuration, so remove the
      argument.
      
      Per suggestion from Tom.
      861ad67b
    • Peter Eisentraut's avatar
      doc: Fix declared number of columns in table · 21c87a0d
      Peter Eisentraut authored
      This was broken in 841a5150.
      21c87a0d
    • Robert Haas's avatar
      Fix a few small bugs in yesterday's event trigger patch. · ddef9a00
      Robert Haas authored
      Dimitri Fontaine
      ddef9a00
  2. 22 Jan, 2013 3 commits
    • Robert Haas's avatar
      Fix CREATE EVENT TRIGGER syntax synopsis in documentation. · 4c977319
      Robert Haas authored
      Dimitri Fontaine, per a report from Thom Brown
      4c977319
    • Robert Haas's avatar
      Typo fixes. · 9917a491
      Robert Haas authored
      Noted by Thom Brown.
      9917a491
    • Tom Lane's avatar
      Add infrastructure for storing a VARIADIC ANY function's VARIADIC flag. · 75b39e79
      Tom Lane authored
      Originally we didn't bother to mark FuncExprs with any indication whether
      VARIADIC had been given in the source text, because there didn't seem to be
      any need for it at runtime.  However, because we cannot fold a VARIADIC ANY
      function's arguments into an array (since they're not necessarily all the
      same type), we do actually need that information at runtime if VARIADIC ANY
      functions are to respond unsurprisingly to use of the VARIADIC keyword.
      Add the missing field, and also fix ruleutils.c so that VARIADIC ANY
      function calls are dumped properly.
      
      Extracted from a larger patch that also fixes concat() and format() (the
      only two extant VARIADIC ANY functions) to behave properly when VARIADIC is
      specified.  This portion seems appropriate to review and commit separately.
      
      Pavel Stehule
      75b39e79
  3. 21 Jan, 2013 5 commits
  4. 20 Jan, 2013 2 commits
    • Tom Lane's avatar
      Fix an O(N^2) performance issue for sessions modifying many relations. · d5b31cc3
      Tom Lane authored
      AtEOXact_RelationCache() scanned the entire relation cache at the end of
      any transaction that created a new relation or assigned a new relfilenode.
      Thus, clients such as pg_restore had an O(N^2) performance problem that
      would start to be noticeable after creating 10000 or so tables.  Since
      typically only a small number of relcache entries need any cleanup, we
      can fix this by keeping a small list of their OIDs and doing hash_searches
      for them.  We fall back to the full-table scan if the list overflows.
      
      Ideally, the maximum list length would be set at the point where N
      hash_searches would cost just less than the full-table scan.  Some quick
      experimentation says that point might be around 50-100; I (tgl)
      conservatively set MAX_EOXACT_LIST = 32.  For the case that we're worried
      about here, which is short single-statement transactions, it's unlikely
      there would ever be more than about a dozen list entries anyway; so it's
      probably not worth being too tense about the value.
      
      We could avoid the hash_searches by instead keeping the target relcache
      entries linked into a list, but that would be noticeably more complicated
      and bug-prone because of the need to maintain such a list in the face of
      relcache entry drops.  Since a relcache entry can only need such cleanup
      after a somewhat-heavyweight filesystem operation, trying to save a
      hash_search per cleanup doesn't seem very useful anyway --- it's the scan
      over all the not-needing-cleanup entries that we wish to avoid here.
      
      Jeff Janes, reviewed and tweaked a bit by Tom Lane
      d5b31cc3
    • Magnus Hagander's avatar
      Clarify that streaming replication can be both async and sync · 0a2da528
      Magnus Hagander authored
      Josh Kupershmidt
      0a2da528
  5. 19 Jan, 2013 4 commits
    • Tom Lane's avatar
      Use SET TRANSACTION READ ONLY in pg_dump, if server supports it. · 26d905a1
      Tom Lane authored
      This currently does little except serve as documentation.  (The one case
      where it has a performance benefit, SERIALIZABLE mode in 9.1 and up, was
      already using READ ONLY mode.)  However, it's possible that it might have
      performance benefits in future, and in any case it seems like good
      practice since it would catch any accidentally non-read-only operations.
      
      Pavan Deolasee
      26d905a1
    • Tom Lane's avatar
      Modernize string literal syntax in tutorial example. · 4b94cfb5
      Tom Lane authored
      Un-double the backslashes in the LIKE patterns, since
      standard_conforming_strings is now the default.  Just to be sure, include
      a command to set standard_conforming_strings to ON in the example.
      
      Back-patch to 9.1, where standard_conforming_strings became the default.
      
      Josh Kupershmidt, reviewed by Jeff Janes
      4b94cfb5
    • Andrew Dunstan's avatar
      Make pgxs build executables with the right suffix. · 9f10f7dc
      Andrew Dunstan authored
      Complaint and patch from Zoltán Böszörményi.
      
      When cross-compiling, the native make doesn't know
      about the Windows .exe suffix, so it only builds with
      it when explicitly told to do so.
      
      The native make will not see the link between the target
      name and the built executable, and might this do unnecesary
      work, but that's a bigger problem than this one, if in fact
      we consider it a problem at all.
      
      Back-patch to all live branches.
      9f10f7dc
    • Peter Eisentraut's avatar
      libpq doc: Clarify what commands return PGRES_TUPLES_OK · fb197290
      Peter Eisentraut authored
      The old text claimed that INSERT and UPDATE always return
      PGRES_COMMAND_OK, but INSERT/UPDATE with RETURNING return
      PGRES_TUPLES_OK.
      
      Josh Kupershmidt
      fb197290
  6. 18 Jan, 2013 8 commits
    • Tom Lane's avatar
      Protect against SnapshotNow race conditions in pg_tablespace scans. · c2a14bc7
      Tom Lane authored
      Use of SnapshotNow is known to expose us to race conditions if the tuple(s)
      being sought could be updated by concurrently-committing transactions.
      CREATE DATABASE and DROP DATABASE are particularly exposed because they do
      heavyweight filesystem operations during their scans of pg_tablespace,
      so that the scans run for a very long time compared to most.  Furthermore,
      the potential consequences of a missed or twice-visited row are nastier
      than average:
      
      * createdb() could fail with a bogus "file already exists" error, or
        silently fail to copy one or more tablespace's worth of files into the
        new database.
      
      * remove_dbtablespaces() could miss one or more tablespaces, thus failing
        to free filesystem space for the dropped database.
      
      * check_db_file_conflict() could likewise miss a tablespace, leading to an
        OID conflict that could result in data loss either immediately or in
        future operations.  (This seems of very low probability, though, since a
        duplicate database OID would be unlikely to start with.)
      
      Hence, it seems worth fixing these three places to use MVCC snapshots, even
      though this will someday be superseded by a generic solution to SnapshotNow
      race conditions.
      
      Back-patch to all active branches.
      
      Stephen Frost and Tom Lane
      c2a14bc7
    • Bruce Momjian's avatar
    • Robert Haas's avatar
      Unbreak lock conflict detection for Hot Standby. · d8c38966
      Robert Haas authored
      This got broken in the original fast-path locking patch, because
      I failed to account for the fact that Hot Standby startup process
      might take a strong relation lock on a relation in a database to
      which it is not bound, and confused MyDatabaseId with the database
      ID of the relation being locked.
      
      Report and diagnosis by Andres Freund.  Final form of patch by me.
      d8c38966
    • Bruce Momjian's avatar
      Improve pg_upgrade error report · 600250d0
      Bruce Momjian authored
      If the cluster alignments don't match, output this suggestion:
      
      	Likely one cluster is a 32-bit install, the other 64-bit
      600250d0
    • Alvaro Herrera's avatar
      Fix off-by-one bug in xlog reading logic · 8c17144c
      Alvaro Herrera authored
      Bug reported by Michael Paquier
      
      Author: Andres Freund
      8c17144c
    • Bruce Momjian's avatar
      psql latex fixes · 74a82baf
      Bruce Momjian authored
      Remove extra line at bottom of table for new 'latex' mode border=3.
      Also update 'latex'-longtable 'tableattr' docs to say
      'whitespace-separated' instead of 'space'.
      74a82baf
    • Heikki Linnakangas's avatar
      Now that START_REPLICATION returns the next timeline's ID after reaching end · 6f7cddc7
      Heikki Linnakangas authored
      of timeline, take advantage of that in walreceiver.
      
      Startup process is still in control of choosign the target timeline, by
      scanning the timeline history files present in pg_xlog, but walreceiver now
      uses the next timeline's ID to fetch its history file immediately after it
      has finished streaming the old timeline. Before, the standby would first try
      to restart streaming on the old timeline, which fetches the missing timeline
      history file as a side-effect, and only then restart from the new timeline.
      This patch eliminates the extra iteration, which speeds up the timeline
      switch and reduces the noise in the log caused by the extra restart on the
      old timeline.
      6f7cddc7
    • Heikki Linnakangas's avatar
      Use the right timeline when beginning to stream from master. · 2ff65553
      Heikki Linnakangas authored
      The xlogreader refactoring broke the logic to decide which timeline to start
      streaming from. XLogPageRead() uses the timeline history to check which
      timeline the requested WAL position falls into. However, after the
      refactoring, XLogPageRead() is always first called with the first page in
      the segment, to verify the segment header, and only then with the actual WAL
      position we're interested in. That first read of the segment's header made
      XLogPageRead() to always start streaming from the old timeline containing
      the segment header, not the timeline containing the actual record, if there
      was a timeline switch within the segment.
      
      I thought I fixed this yesterday, but that fix was too narrow and only fixed
      this for the corner-case that the timeline switch happened in the first page
      of the segment. To fix this more robustly, pass explicitly the position of
      the record we're actually interested in to XLogPageRead, and use that to
      decide which timeline to read from, rather than deduce it from the page and
      offset.
      
      Per report from Fujii Masao.
      2ff65553
  7. 17 Jan, 2013 9 commits
    • Heikki Linnakangas's avatar
      When xlogreader asks the callback function to read a page, make sure we · 88228e6f
      Heikki Linnakangas authored
      get a large enough part of the page to include the beginning of the next
      record we're interested in. The XLogPageRead callback uses the requested
      length to decide which timeline to stream WAL from, and if the first call
      is short, and the page contains a timeline switch, we'll repeatedly try
      to stream that page from the old timeline, and never get across the
      timeline switch.
      88228e6f
    • Heikki Linnakangas's avatar
      I added a result set to START_STREAMING command, but neglected walreceiver. · 3684a534
      Heikki Linnakangas authored
      The patch to allow pg_receivexlog to switch timeline added a result set
      after copy has ended in START_STREAMING command, to return the next
      timeline's ID to the client. But walreceived didn't get the memo, and threw
      an error on the unexpected result set. Fix.
      3684a534
    • Alvaro Herrera's avatar
      Accelerate end-of-transaction dropping of relations · 279628a0
      Alvaro Herrera authored
      When relations are dropped, at end of transaction we need to remove the
      files and clean the buffer pool of buffers containing pages of those
      relations.  Previously we would scan the buffer pool once per relation
      to clean up buffers.  When there are many relations to drop, the
      repeated scans make this process slow; so we now instead pass a list of
      relations to drop and scan the pool once, checking each buffer against
      the passed list.  When the number of relations is larger than a
      threshold (which as of this patch is being set to 20 relations) we sort
      the array before starting, and bsearch the array; when it's smaller, we
      simply scan the array linearly each time, because that's faster.  The
      exact optimal threshold value depends on many factors, but the
      difference is not likely to be significant enough to justify making it
      user-settable.
      
      This has been measured to be a significant win (a 15x win when dropping
      100,000 relations; an extreme case, but reportedly a real one).
      
      Author: Tomas Vondra, some tweaks by me
      Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera
      279628a0
    • Heikki Linnakangas's avatar
      Make pg_receivexlog and pg_basebackup -X stream work across timeline switches. · 0b632913
      Heikki Linnakangas authored
      This mirrors the changes done earlier to the server in standby mode. When
      receivelog reaches the end of a timeline, as reported by the server, it
      fetches the timeline history file of the next timeline, and restarts
      streaming from the new timeline by issuing a new START_STREAMING command.
      
      When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
      last segment on the old timeline. This helps you to tell apart a partial
      segment left in the directory because of a timeline switch, and a completed
      segment. If you just follow a single server, it won't make a difference, but
      it can be significant in more complicated scenarios where new WAL is still
      generated on the old timeline.
      
      This includes two small changes to the streaming replication protocol:
      First, when you reach the end of timeline while streaming, the server now
      sends the TLI of the next timeline in the server's history to the client.
      pg_receivexlog uses that as the next timeline, so that it doesn't need to
      parse the timeline history file like a standby server does. Second, when
      BASE_BACKUP command sends the begin and end WAL positions, it now also sends
      the timeline IDs corresponding the positions.
      0b632913
    • Tom Lane's avatar
      Improve memory space management in tuplesort and tuplestore. · 8ae35e91
      Tom Lane authored
      The code originally just doubled the size of the tuple-pointer array so
      long as that would fit in allowedMem.  This could result in failing to use
      as much as half of allowedMem, if (as is typical) the last doubling attempt
      didn't quite fit.  Worse, we might double the array size but be unable to
      use most of the added slots, because there was no room left within the
      allowedMem limit for tuples the slots should point to.  To fix, double only
      so long as we've used less than half of allowedMem in total.  Then do one
      more array enlargement, but scale it based on total memory consumption so
      far.  This will work nicely as long as the average tuple size is reasonably
      stable, and in any case should be better than the old method.
      
      This change will result in large sort operations consuming a larger
      fraction of work_mem than they typically did in the past.  The release
      notes should mention that users may want to revisit their work_mem
      settings, if they'd tuned those settings based on the old behavior of
      sorting.
      
      Jeff Janes, reviewed by Peter Geoghegan and Robert Haas
      8ae35e91
    • Heikki Linnakangas's avatar
      Fix a couple of error-handling bugs in the xlogreader patch. · 1296d5c5
      Heikki Linnakangas authored
      XLogReadRecord should reset its state on every error, to make sure it
      re-reads the page on next call. It was inconsistent in that some errors did
      that, but some did not.
      
      In ReadRecord(), don't give up on an error if we're in standby mode. The
      loop was set up to retry, but the checks within the loop broke out of the
      loop on any error.
      
      Andres Freund, with some tweaking by me.
      1296d5c5
    • Bruce Momjian's avatar
      Add a latex-longtable output format to psql · b14f81bc
      Bruce Momjian authored
      latex longtable is more powerful than the 'tabular' output format
      'latex' uses.  Also add border=3 support to 'latex'.
      b14f81bc
    • Magnus Hagander's avatar
      Silence compiler warnings · 8ef69616
      Magnus Hagander authored
      8ef69616
    • Heikki Linnakangas's avatar
      Make GiST indexes on-disk compatible with 9.2 again. · 9ee4d06f
      Heikki Linnakangas authored
      The patch that turned XLogRecPtr into a uint64 inadvertently changed the
      on-disk format of GiST indexes, because the NSN field in the GiST page
      opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
      field back to the two-field struct that XLogRecPtr was before. This is the
      same we did to LSNs in the page header to avoid changing on-disk format.
      
      Bump catversion, as this invalidates any existing GiST indexes built on
      9.3devel.
      9ee4d06f