Commits · 3684a534efbd2ffd72e2c4cbd21f9752be3efaf1 · Abuhujair Javed / Postgres FD Implementation

17 Jan, 2013 13 commits

I added a result set to START_STREAMING command, but neglected walreceiver. · 3684a534

Heikki Linnakangas authored Jan 17, 2013

The patch to allow pg_receivexlog to switch timeline added a result set
after copy has ended in START_STREAMING command, to return the next
timeline's ID to the client. But walreceived didn't get the memo, and threw
an error on the unexpected result set. Fix.

3684a534

Accelerate end-of-transaction dropping of relations · 279628a0

Alvaro Herrera authored Jan 17, 2013

When relations are dropped, at end of transaction we need to remove the
files and clean the buffer pool of buffers containing pages of those
relations. Previously we would scan the buffer pool once per relation
to clean up buffers. When there are many relations to drop, the
repeated scans make this process slow; so we now instead pass a list of
relations to drop and scan the pool once, checking each buffer against
the passed list. When the number of relations is larger than a
threshold (which as of this patch is being set to 20 relations) we sort
the array before starting, and bsearch the array; when it's smaller, we
simply scan the array linearly each time, because that's faster. The
exact optimal threshold value depends on many factors, but the
difference is not likely to be significant enough to justify making it
user-settable.

This has been measured to be a significant win (a 15x win when dropping
100,000 relations; an extreme case, but reportedly a real one).

Author: Tomas Vondra, some tweaks by me
Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera

279628a0

Make pg_receivexlog and pg_basebackup -X stream work across timeline switches. · 0b632913

Heikki Linnakangas authored Jan 17, 2013

This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.

When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.

This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.

0b632913

Improve memory space management in tuplesort and tuplestore. · 8ae35e91

Tom Lane authored Jan 17, 2013

The code originally just doubled the size of the tuple-pointer array so
long as that would fit in allowedMem. This could result in failing to use
as much as half of allowedMem, if (as is typical) the last doubling attempt
didn't quite fit. Worse, we might double the array size but be unable to
use most of the added slots, because there was no room left within the
allowedMem limit for tuples the slots should point to. To fix, double only
so long as we've used less than half of allowedMem in total. Then do one
more array enlargement, but scale it based on total memory consumption so
far. This will work nicely as long as the average tuple size is reasonably
stable, and in any case should be better than the old method.

This change will result in large sort operations consuming a larger
fraction of work_mem than they typically did in the past. The release
notes should mention that users may want to revisit their work_mem
settings, if they'd tuned those settings based on the old behavior of
sorting.

Jeff Janes, reviewed by Peter Geoghegan and Robert Haas

8ae35e91

Fix a couple of error-handling bugs in the xlogreader patch. · 1296d5c5

Heikki Linnakangas authored Jan 17, 2013

XLogReadRecord should reset its state on every error, to make sure it
re-reads the page on next call. It was inconsistent in that some errors did
that, but some did not.

In ReadRecord(), don't give up on an error if we're in standby mode. The
loop was set up to retry, but the checks within the loop broke out of the
loop on any error.

Andres Freund, with some tweaking by me.

1296d5c5

Add a latex-longtable output format to psql · b14f81bc

Bruce Momjian authored Jan 17, 2013

latex longtable is more powerful than the 'tabular' output format
'latex' uses.  Also add border=3 support to 'latex'.

b14f81bc

Silence compiler warnings · 8ef69616
Magnus Hagander authored Jan 17, 2013

8ef69616

Make GiST indexes on-disk compatible with 9.2 again. · 9ee4d06f

Heikki Linnakangas authored Jan 17, 2013

The patch that turned XLogRecPtr into a uint64 inadvertently changed the
on-disk format of GiST indexes, because the NSN field in the GiST page
opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
field back to the two-field struct that XLogRecPtr was before. This is the
same we did to LSNs in the page header to avoid changing on-disk format.

Bump catversion, as this invalidates any existing GiST indexes built on
9.3devel.

9ee4d06f

Base the default SSL ciphers on DEFAULT instead of ALL · bba486f3

Magnus Hagander authored Jan 17, 2013

It's better to start from what the OpenSSL people consider a good
default and then remove insecure things (low encryption, exportable
encryption and md5 at this point) from that, instead of starting
from everything that exists and remove from that. We trust the
OpenSSL people to make good choices about what the default is.

bba486f3

Make size-output fixed length in pg_basebackup verbose mode · 4eebf130
Magnus Hagander authored Jan 17, 2013
```
This way the line doesn't shift right as the amount of data processed
increases.
```
4eebf130

Truncate filenames in the leadning end in pg_basebackup verbose output · d7e9ca7f

Magnus Hagander authored Jan 17, 2013

When truncating at the end, like before, the output would often end up
just showing the path instead of the filename.

Also increase the length of the filename by 5, which still keeps us at
less than 80 characters in most outputs.

d7e9ca7f

Support multiple -t/--table arguments for more commands · f3af5344

Magnus Hagander authored Jan 17, 2013

On top of the previous support in pg_dump, add support to specify
multiple tables (by using the -t option multiple times) to
pg_restore, clsuterdb, reindexdb and vacuumdb.

Josh Kupershmidt, reviewed by Karl O. Pinc

f3af5344

Get rid of pg_dump's README · 36bdfa52

Peter Eisentraut authored Jan 16, 2013

It was largely full of outdated and incorrect information.  Move the few
notes which were still relevant into header comments of pg_backup_tar.c
and pg_dumpall.c.

Josh Kupershmidt

36bdfa52

16 Jan, 2013 1 commit

Split out XLog reading as an independent facility · 7fcbf6a4

Alvaro Herrera authored Jan 16, 2013

This new facility can not only be used by xlog.c to carry out crash
recovery, but also by external programs.  By supplying a function to
read XLog pages from somewhere, all the WAL reading can be used for
completely different purposes.

For the standard backend use, the behavior should be pretty much the
same as previously.  As for non-backend programs, an hypothetical
pg_xlogdump program is now closer to reality, but some more backend
support is still necessary.

This patch was originally submitted by Andres Freund in a different
form, but Heikki Linnakangas opted for and authored another design of
the concept.  Andres has advanced the patch since Heikki's initial
version.  Review and some (mostly cosmetics) changes by me.

7fcbf6a4

15 Jan, 2013 5 commits

Make \? help message more clear when not connected. · 8606dd81

Heikki Linnakangas authored Jan 15, 2013

On second thought, "none" could mislead to think that you're connected a
database with that name. Duplicate the whole string, so that it can be
more easily translated. In back-branches, thought, just use an empty string
in place of the database name, to avoid adding a translatable string.

8606dd81

Don't pass NULL to fprintf, if not currently connected to a database. · b04ce529
Heikki Linnakangas authored Jan 15, 2013
```
Backpatch all the way to 8.3. Fixes bug #7811, per report and diagnosis by
Meng Qingzhong.
```
b04ce529

Rework order of checks in ALTER / SET SCHEMA · 7ac5760f

Alvaro Herrera authored Jan 15, 2013

When attempting to move an object into the schema in which it already
was, for most objects classes we were correctly complaining about
exactly that ("object is already in schema"); but for some other object
classes, such as functions, we were instead complaining of a name
collision ("object already exists in schema").  The latter is wrong and
misleading, per complaint from Robert Haas in
CA+TgmoZ0+gNf7RDKRc3u5rHXffP=QjqPZKGxb4BsPz65k7qnHQ@mail.gmail.com

To fix, refactor the way these checks are done.  As a bonus, the
resulting code is smaller and can also share some code with Rename
cases.

While at it, remove use of getObjectDescriptionOids() in error messages.
These are normally disallowed because of translatability considerations,
but this one had slipped through since 9.1.  (Not sure that this is
worth backpatching, though, as it would create some untranslated
messages in back branches.)

This is loosely based on a patch by KaiGai Kohei, heavily reworked by
me.

7ac5760f

Give a proper error message if connecting to incompatible server. · ffda0597
Heikki Linnakangas authored Jan 15, 2013
```
The WAL streaming message format changed in 9.3, so 9.3 pg_basebackup or
pg_receivelog won't work against older servers.
```
ffda0597

Fix hash_update_hash_key() to handle same-bucket case correctly. · 1b794d3f

Tom Lane authored Jan 14, 2013

Original coding would corrupt the hashtable if the item being updated was
at the end of its bucket chain and the new hash key hashed to that same
bucket. Diagnosis and fix by Heikki Linnakangas.

1b794d3f

14 Jan, 2013 7 commits

Return value of lseek() can be negative on failure. · 3f4b1749

Heikki Linnakangas authored Jan 15, 2013

Because the return value of lseek() was assigned to an unsigned size_t
variable, we'd fail to notice an error return code -1. Compiler gave a
warning about this.

Andres Freund

3f4b1749

Fix obsolete SQL syntax in comment. · 325c54b6

Tom Lane authored Jan 14, 2013

This was legal back in the days of add_missing_from, though perhaps
never good style.  It's not legal anymore ...

Jan Urbański

325c54b6

Reject out-of-range dates in to_date(). · 5c4eb916

Tom Lane authored Jan 14, 2013

Dates outside the supported range could be entered, but would not print
reasonably, and operations such as conversion to timestamp wouldn't behave
sanely either.  Since this has the potential to result in undumpable table
data, it seems worth back-patching.

Hitoshi Harada

5c4eb916

Add new timezone abbrevation "FET". · 7127293a

Tom Lane authored Jan 14, 2013

This seems to have been invented in 2011 to represent GMT+3, non daylight
savings rules, as now used in Europe/Kaliningrad and Europe/Minsk.
There are no conflicts so might as well add it to the Default list.
Per bug #7804 from Ruslan Izmaylov.

7127293a

Remove spurious space · 692079e5
Alvaro Herrera authored Jan 11, 2013
```
Andres Freund
```
692079e5

Prevent very-low-probability PANIC during PREPARE TRANSACTION. · 2065dd28

Tom Lane authored Jan 13, 2013

The code in PostPrepare_Locks supposed that it could reassign locks to
the prepared transaction's dummy PGPROC by deleting the PROCLOCK table
entries and immediately creating new ones. This was safe when that code
was written, but since we invented partitioning of the shared lock table,
it's not safe --- another process could steal away the PROCLOCK entry in
the short interval when it's on the freelist. Then, if we were otherwise
out of shared memory, PostPrepare_Locks would have to PANIC, since it's
too late to back out of the PREPARE at that point.

Fix by inventing a dynahash.c function to atomically update a hashtable
entry's key. (This might possibly have other uses in future.)

This is an ancient bug that in principle we ought to back-patch, but the
odds of someone hitting it in the field seem really tiny, because (a) the
risk window is small, and (b) nobody runs servers with maxed-out lock
tables for long, because they'll be getting non-PANIC out-of-memory errors
anyway. So fixing it in HEAD seems sufficient, at least until the new
code has gotten some testing.

2065dd28

Make spelling more uniform · 9d2cd99a
Peter Eisentraut authored Jan 13, 2013

9d2cd99a

13 Jan, 2013 2 commits

Update comments for elog_start(). · 24dd0502
Tom Lane authored Jan 13, 2013
```
Forgot I was going to do this as part of the previous patch ...
```
24dd0502

Improve handling of ereport(ERROR) and elog(ERROR). · b853eb97

Tom Lane authored Jan 13, 2013

In commit 71450d7f, we added code to inform
suitably-intelligent compilers that ereport() doesn't return if the elevel
is ERROR or higher. This patch extends that to elog(), and also fixes a
double-evaluation hazard that the previous commit created in ereport(),
as well as reducing the emitted code size.

The elog() improvement requires the compiler to support __VA_ARGS__, which
should be available in just about anything nowadays since it's required by
C99. But our minimum language baseline is still C89, so add a configure
test for that.

The previous commit assumed that ereport's elevel could be evaluated twice,
which isn't terribly safe --- there are already counterexamples in xlog.c.
On compilers that have __builtin_constant_p, we can use that to protect the
second test, since there's no possible optimization gain if the compiler
doesn't know the value of elevel. Otherwise, use a local variable inside
the macros to prevent double evaluation. The local-variable solution is
inferior because (a) it leads to useless code being emitted when elevel
isn't constant, and (b) it increases the optimization level needed for the
compiler to recognize that subsequent code is unreachable. But it seems
better than not teaching non-gcc compilers about unreachability at all.

Lastly, if the compiler has __builtin_unreachable(), we can use that
instead of abort(), resulting in a noticeable code savings since no
function call is actually emitted. However, it seems wise to do this only
in non-assert builds. In an assert build, continue to use abort(), so that
the behavior will be predictable and debuggable if the "impossible"
happens.

These changes involve making the ereport and elog macros emit do-while
statement blocks not just expressions, which forces small changes in
a few call sites.

Andres Freund, Tom Lane, Heikki Linnakangas

b853eb97

12 Jan, 2013 1 commit

Extend and improve use of EXTRA_REGRESS_OPTS. · 4ae5ee6c

Andrew Dunstan authored Jan 12, 2013

This is now used by ecpg tests, and not clobbered by pg_upgrade
tests. This change won't affect anything that doesn't set this
environment variable, but will enable the buildfarm to control
exactly what port regression test installs will be running on,
and thus to detect possible rogue postmasters more easily.

Backpatch to release 9.2 where EXTRA_REGRESS_OPTS was first used.

4ae5ee6c

11 Jan, 2013 2 commits

Redesign the planner's handling of index-descent cost estimation. · 31f38f28

Tom Lane authored Jan 11, 2013

Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited. In
commit 21a39de5 I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b fudged it some more,
but still with not a lot of principle behind it.

What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs. We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads. However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs. So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent. The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).

To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree. There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.

This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward. In 9.2 I'll just revert
the fudge-factor change.

31f38f28

Last-gasp attempt to save libperl.so configure probe. · e1b735ae

Tom Lane authored Jan 10, 2013

I notice that plperl's makefile adds the -I for $perl_archlibexp/CORE
at the end of CPPFLAGS not the beginning. It seems somewhat unlikely
that the include search order has anything to do with why buildfarm
member okapi is failing, but I'm about out of other ideas.

e1b735ae

10 Jan, 2013 2 commits

Test linking libperl.so using only Perl's required libraries. · 9d5a160c

Tom Lane authored Jan 09, 2013

It appears that perl_embed_ldflags should already mention all the libraries
that are required by libperl.so itself. So let's try the test link with
just those and not the other LIBS we've found up to now. This should
more nearly reproduce what will happen when plperl is linked, and perhaps
will fix buildfarm member okapi's problem.

9d5a160c

Add explicit configure-time checks for perl.h and libperl.so. · 1f3ed51f

Tom Lane authored Jan 09, 2013

Although most platforms seem to package Perl in such a way that these files
are present even in basic Perl installations, Debian does not. Hence, make
an effort to fail during configure rather than build if --with-perl was
given and these files are lacking. Per gripe from Josh Berkus.

1f3ed51f

09 Jan, 2013 4 commits
- Detect Windows perl linkage parameters in configure script. · 7fb97ecd
  Andrew Dunstan authored Jan 09, 2013
```
This means we can now construct a configure test for the library
presence. Previously these parameters were only figured out at
build time in plperl's GnuMakefile.
```
  7fb97ecd
- Properly install ecpg_compat and pgtypes libraries on msvc · 6e650a55
  Magnus Hagander authored Jan 09, 2013
```
JiangGuiqing
```
  6e650a55
- Don't attempt to write recovery.conf when -R is not specified · b5ed1376
  Magnus Hagander authored Jan 09, 2013
```
Fixes segmentation fault during regular use.

Fujii Masao
```
  b5ed1376
- Allow parallel copy/link in pg_upgrade · a89c46f9
  Bruce Momjian authored Jan 09, 2013
```
This patch implements parallel copying/linking of files by tablespace
using the --jobs option in pg_upgrade.
```
  a89c46f9
08 Jan, 2013 2 commits

Fix potential corruption of lock table in CREATE/DROP INDEX CONCURRENTLY. · c00dc337

Tom Lane authored Jan 08, 2013

If VirtualXactLock() has to wait for a transaction that holds its VXID lock
as a fast-path lock, it must first convert the fast-path lock to a regular
lock. It failed to take the required "partition" lock on the main
shared-memory lock table while doing so. This is the direct cause of the
assert failure in GetLockStatusData() recently observed in the buildfarm,
but more worryingly it could result in arbitrary corruption of the shared
lock table if some other process were concurrently engaged in modifying the
same partition of the lock table. Fortunately, VirtualXactLock() is only
used by CREATE INDEX CONCURRENTLY and DROP INDEX CONCURRENTLY, so the
opportunities for failure are fewer than they might have been.

In passing, improve some comments and be a bit more consistent about
order of operations.

c00dc337

Fix typo · f31d5baf
Peter Eisentraut authored Jan 07, 2013

f31d5baf

07 Jan, 2013 1 commit
- Fix a logic bug in pgindent. · 74570db9
  Andrew Dunstan authored Jan 07, 2013
  
  74570db9