Commits · 709e461befa8a4999c4ccdbfc7260ef8092e805c · Abuhujair Javed / Postgres FD Implementation

20 Oct, 2016 6 commits

Fix EXPLAIN so that it doesn't emit invalid XML in corner cases. · 709e461b

Tom Lane authored Oct 20, 2016

With track_io_timing = on, EXPLAIN (ANALYZE, BUFFERS) will emit fields
named like "I/O Read Time". The slash makes that invalid as an XML
element name, so that adding FORMAT XML would produce invalid XML.

We already have code in there to translate spaces to dashes, so let's
generalize that to convert anything that isn't a valid XML name character,
viz letters, digits, hyphens, underscores, and periods. We could just
reject slashes, which would run a bit faster. But the fact that this went
unnoticed for so long doesn't give me a warm feeling that we'd notice the
next creative violation, so let's make it a permanent fix.

Reported by Markus Winand, though this isn't his initial patch proposal.

Back-patch to 9.2 where track_io_timing was added. The problem is only
latent in 9.1, so I don't feel a need to fix it there.

Discussion: <E0BF6A45-68E8-45E6-918F-741FB332C6BB@winand.at>

709e461b

Sync our copy of the timezone library with IANA release tzcode2016h. · 5e21b681

Tom Lane authored Oct 20, 2016

This absorbs a fix for a symlink-manipulation bug in zic that was
introduced in 2016g. It probably isn't interesting for our use-case,
but I'm not quite sure, so let's update while we're at it.

5e21b681

Update time zone data files to tzdata release 2016h. · d8fc45bd

Tom Lane authored Oct 20, 2016

(Didn't I just do this?  Oh well.)

DST law changes in Palestine.  Historical corrections for Turkey.
Switch to numeric abbreviations for Asia/Colombo.

d8fc45bd

Rename "pg_xlog" directory to "pg_wal". · f82ec32a

Robert Haas authored Oct 20, 2016

"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences.  So, rename the directory to "pg_wal".

This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.

We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal".  However, that's still under discussion, so let's do just this
much for now.

Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com

Michael Paquier

f82ec32a

Remove a comment which is now incorrect. · ec7db2b4

Robert Haas authored Oct 20, 2016

Before 5d305d86, this comment was
correct, but now it says we do something which we don't actually do.
Accordingly, remove the comment.

ec7db2b4

Another portability fix for tzcode2016g update. · 23ed2ba8

Tom Lane authored Oct 19, 2016

clang points out that SIZE_MAX wouldn't fit into an int, which means
this comparison is pretty useless.  Per report from Thomas Munro.

23ed2ba8

19 Oct, 2016 11 commits

Windows portability fix. · ad90ac4d
Tom Lane authored Oct 19, 2016
```
Per buildfarm.
```
ad90ac4d

Sync our copy of the timezone library with IANA release tzcode2016g. · f3094920

Tom Lane authored Oct 19, 2016

This is mostly to absorb some corner-case fixes in zic for year-2037
timestamps. The other changes that have been made are unlikely to affect
our usage, but nonetheless we may as well take 'em.

f3094920

Suppress "Factory" zone in pg_timezone_names view for tzdata >= 2016g. · a3215431

Tom Lane authored Oct 19, 2016

IANA got rid of the really silly "abbreviation" and replaced it with one
that's only moderately silly.  But it's still pointless, so keep on not
showing it.

a3215431

Update time zone data files to tzdata release 2016g. · ecbac3e6

Tom Lane authored Oct 19, 2016

DST law changes in Turkey. Historical corrections for America/Los_Angeles,
Europe/Kirov, Europe/Moscow, Europe/Samara, and Europe/Ulyanovsk.
Rename Asia/Rangoon to Asia/Yangon, with a backward compatibility link.

The IANA crew continue their campaign to replace invented time zone
abbrevations with numeric GMT offsets. This update changes numerous zones
in Antarctica and the former Soviet Union, for instance Antarctica/Casey
now reports "+08" not "AWST" in the pg_timezone_names view. I kept these
abbreviations in the tznames/ data files, however, so that we will still
accept them for input. (We may want to start trimming those files someday,
but today is not that day.)

An exception is that since IANA no longer claims that "AMT" is in use
in Armenia for GMT+4, I replaced it in the Default file with GMT-4,
corresponding to Amazon Time which is in use in South America. It may be
that that meaning is also invented and IANA will drop it in a future
update; but for now, it seems silly to give pride of place to a meaning
not traceable to IANA over one that is.

ecbac3e6

Make getrusage() output a little more readable · 9ffe4a8b
Peter Eisentraut authored Oct 19, 2016
```
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Peter Geoghegan <pg@heroku.com>
```
9ffe4a8b

Use pg_ctl promote -w in TAP tests · e5a9bcb5

Peter Eisentraut authored Oct 19, 2016

Switch TAP tests to use the new wait mode of pg_ctl promote.  This
allows avoiding extra logic with poll_query_until() to be sure that a
promoted standby is ready for read-write queries.

From: Michael Paquier <michael.paquier@gmail.com>

e5a9bcb5

initdb pg_basebackup: Rename --noxxx options to --no-xxx · 5d58c07a

Peter Eisentraut authored Oct 19, 2016

--noclean and --nosync were the only options spelled without a hyphen,
so change this for consistency with other options.  The options in
pg_basebackup have not been in a release, so we just rename them.  For
initdb, we retain the old variants.

Vik Fearing and me

5d58c07a

pg_ctl: Add long option for -o · caf936b0
Peter Eisentraut authored Oct 19, 2016
```
Now all normally used options are covered by long options as well.
```
caf936b0
doc: Consistently use = sign in long options synopses · c709c607
Peter Eisentraut authored Oct 19, 2016
```
This was already the predominant form in man pages and help output.
```
c709c607
pg_ctl: Add long options for -w and -W · 0be22457
Peter Eisentraut authored Oct 19, 2016
```
From: Vik Fearing <vik@2ndquadrant.fr>
```
0be22457

Fix WAL-logging of FSM and VM truncation. · 917dc7d2

Heikki Linnakangas authored Oct 19, 2016

When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.

To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.

Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.

Analysis by Pavan Deolasee.

Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>

917dc7d2

18 Oct, 2016 5 commits

Improve regression test coverage for hash indexes. · b801e120

Robert Haas authored Oct 18, 2016

On my system, this improves coverage for src/backend/access/hash from
61.3% of lines to 88.2% of lines, and from 83.5% of functions to 97.5%
of functions, which is pretty good for 36 lines of tests.

Mithun Cy, reviewing by Amit Kapila and Álvaro Herrera

b801e120

Fix a few typos in simplehash.h. · 90d3da11
Andres Freund authored Oct 18, 2016
```
Author: Erik Rijkers
Discussion: <274e4c8ac545d6622735f97c1f6c354b@xs4all.nl>
```
90d3da11
Fix typo in comment. · fca41acb
Robert Haas authored Oct 18, 2016
```
Amit Langote
```
fca41acb

Fix cidin() to handle values above 2^31 platform-independently. · 6f13a682

Tom Lane authored Oct 18, 2016

CommandId is declared as uint32, and values up to 4G are indeed legal.
cidout() handles them properly by treating the value as unsigned int.
But cidin() was just using atoi(), which has platform-dependent behavior
for values outside the range of signed int, as reported by Bart Lengkeek
in bug #14379.  Use strtoul() instead, as xidin() does.

In passing, make some purely cosmetic changes to make xidin/xidout
look more like cidin/cidout; the former didn't have a monopoly on
best practice IMO.

Neither xidin nor cidin make any attempt to throw error for invalid input.
I didn't change that here, and am not sure it's worth worrying about
since neither is really a user-facing type.  The point is just to ensure
that indubitably-valid inputs work as expected.

It's been like this for a long time, so back-patch to all supported
branches.

Report: <20161018152550.1413.6439@wrigleys.postgresql.org>

6f13a682

Revert "Replace PostmasterRandom() with a stronger way of generating randomness." · faae1c91

Heikki Linnakangas authored Oct 18, 2016

This reverts commit 9e083fd4. That was a
few bricks shy of a load:

* Query cancel stopped working
* Buildfarm member pademelon stopped working, because the box doesn't have
  /dev/urandom nor /dev/random.

This clearly needs some more discussion, and a quite different patch, so
revert for now.

faae1c91

17 Oct, 2016 4 commits

By default, set log_line_prefix = '%m [%p] '. · 7d3235ba

Robert Haas authored Oct 17, 2016

This value might not be to everyone's taste; in particular, some
people might prefer %t to %m, and others may want %u, %d, or other
fields.  However, it's a vast improvement on the old default of ''.

Christoph Berg

7d3235ba

Use OpenSSL EVP API for symmetric encryption in pgcrypto. · 5ff4a67f

Heikki Linnakangas authored Oct 17, 2016

The old "low-level" API is deprecated, and doesn't support hardware
acceleration. And this makes the code simpler, too.

Discussion: <561274F1.1030000@iki.fi>

5ff4a67f

Fix use-after-free around DISTINCT transition function calls. · d8589946

Heikki Linnakangas authored Oct 17, 2016

Have tuplesort_gettupleslot() copy the contents of its current table slot
as needed. This is based on an approach taken by tuplestore_gettupleslot().
In the future, tuplesort_gettupleslot() may also be taught to avoid copying
the tuple where caller can determine that that is safe (the
tuplestore_gettupleslot() interface already offers this option to callers).

Patch by Peter Geoghegan. Fixes bug #14344, reported by Regina Obe.

Report: <20160929035538.20224.39628@wrigleys.postgresql.org>

Backpatch-through: 9.6

d8589946

Replace PostmasterRandom() with a stronger way of generating randomness. · 9e083fd4

Heikki Linnakangas authored Oct 17, 2016

This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.

pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:
- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom
- /dev/random

Original patch by Magnus Hagander, with further work by Michael Paquier
and me.

Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>

9e083fd4

15 Oct, 2016 1 commit

Use more efficient hashtable for execGrouping.c to speed up hash aggregation. · 5dfc1981

Andres Freund authored Oct 14, 2016

The more efficient hashtable speeds up hash-aggregations with more than
a few hundred groups significantly. Improvements of over 120% have been
measured.

Due to the the different hash table queries that not fully
determined (e.g. GROUP BY without ORDER BY) may change their result
order.

The conversion is largely straight-forward, except that, due to the
static element types of simplehash.h type hashes, the additional data
some users store in elements (e.g. the per-group working data for hash
aggregaters) is now stored in TupleHashEntryData->additional.  The
meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
has been changed to only be about the additionally stored size.  That
size is only used for the initial sizing of the hash-table.

Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>

5dfc1981

14 Oct, 2016 5 commits

Use more efficient hashtable for tidbitmap.c to speed up bitmap scans. · 75ae538b

Andres Freund authored Oct 14, 2016

Use the new simplehash.h to speed up tidbitmap.c uses. For bitmap scan
heavy queries speedups of over 100% have been measured. Both lossy and
exact scans benefit, but the wins are bigger for mostly exact scans.

The conversion is mostly trivial, except that tbm_lossify() now restarts
lossifying at the point it previously stopped. Otherwise the hash table
becomes unbalanced because the scan in done in hash-order, leaving the
end of the hashtable more densely filled then the beginning. That caused
performance issues with dynahash as well, but due to the open chaining
they were less pronounced than with the linear adressing from
simplehash.h.

Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>

75ae538b

Add a macro templatized hashtable. · b30d3ea8

Andres Freund authored Oct 14, 2016

dynahash.c hash tables aren't quite fast enough for some
use-cases. There are several reasons for lacking performance:
- the use of chaining for collision handling makes them cache
  inefficient, that's especially an issue when the tables get bigger.
- as the element sizes for dynahash are only determined at runtime,
  offset computations are somewhat expensive
- hash and element comparisons are indirect function calls, causing
  unnecessary pipeline stalls
- it's two level structure has some benefits (somewhat natural
  partitioning), but increases the number of indirections
to fix several of these the hash tables have to be adjusted to the
individual use-case at compile-time. C unfortunately doesn't provide a
good way to do compile code generation (like e.g. c++'s templates for
all their weaknesses do).  Thus the somewhat ugly approach taken here is
to allow for code generation using a macro-templatized header file,
which generates functions and types based on a prefix and other
parameters.

Later patches use this infrastructure to use such hash tables for
tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation,
...). In queries where these use up a large fraction of the time, this
has been measured to lead to performance improvements of over 100%.

There are other cases where this could be useful (e.g. catcache.c).

The hash table design chosen is a variant of linear open-addressing. The
biggest disadvantage of simple linear addressing schemes are highly
variable lookup times due to clustering, and deletions leaving a lot of
tombstones around.  To address these issues a variant of "robin hood"
hashing is employed.  Robin hood hashing optimizes chaining lengths by
moving elements close to their optimal bucket ("rich" elements), out of
the way if a to-be-inserted element is further away from its optimal
position (i.e. it's "poor").  While that can make insertions slower, the
average lookup performance is a lot better, and higher fill factors can
be used in a still performant manner.  To avoid tombstones - which
normally solve the issue that a deleted node's presence is relevant to
determine whether a lookup needs to continue looking or is done -
buckets following a deleted element are shifted backwards, unless
they're empty or already at their optimal position.

There's further possible improvements that can be made to this
implementation. Amongst others:
- Use distance as a termination criteria during searches. This is
  generally a good idea, but I've been able to see the overhead of
  distance calculations in some cases.
- Consider combining the 'empty' status into the hashvalue, and enforce
  storing the hashvalue. That could, in some cases, increase memory
  density and remove a few instructions.
- Experiment further with the, very conservatively choosen, fillfactor.
- Make maximum size of hashtable configurable, to allow storing very
  very large tables. That'd require 64bit hash values to be more common
  than now, though.
- some smaller memcpy calls could be optimized to copy larger chunks
But since the new implementation is already considerably faster than
dynahash it seem sensible to start using it.

Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>

b30d3ea8

Add likely/unlikely() branch hint macros. · aa3ca5e3

Andres Freund authored Oct 14, 2016

These are useful for very hot code paths. Because it's easy to guess
wrongly about likelihood, and because such likelihoods change over time,
they should be used sparingly.

Past tests have shown it'd be a good idea to use them in some places,
e.g. in error checks around ereports that ERROR out, but that's work for
later.

Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>

aa3ca5e3

Fix assorted integer-overflow hazards in varbit.c. · 32fdf42c

Tom Lane authored Oct 14, 2016

bitshiftright() and bitshiftleft() would recursively call each other
infinitely if the user passed INT_MIN for the shift amount, due to integer
overflow in negating the shift amount.  To fix, clamp to -VARBITMAXLEN.
That doesn't change the results since any shift distance larger than the
input bit string's length produces an all-zeroes result.

Also fix some places that seemed inadequately paranoid about input typmods
exceeding VARBITMAXLEN.  While a typmod accepted by anybit_typmodin() will
certainly be much less than that, at least some of these spots are
reachable with user-chosen integer values.

Andreas Seltenreich and Tom Lane

Discussion: <87d1j2zqtz.fsf@credativ.de>

32fdf42c

Fix typo. · 13d3180f
Tatsuo Ishii authored Oct 14, 2016
```
Confirmed by Michael Paquier.
```
13d3180f

13 Oct, 2016 8 commits

Fix handling of pgstat counters for TRUNCATE in a prepared transaction. · 81e82a2b

Tom Lane authored Oct 13, 2016

pgstat_twophase_postcommit is supposed to duplicate the math in
AtEOXact_PgStat, but it had missed out the bit about clearing
t_delta_live_tuples/t_delta_dead_tuples for a TRUNCATE.

It's harder than you might think to replicate the issue here, because
those counters would only be nonzero when a previous transaction in
the same backend had added/deleted tuples in the truncated table,
and those counts hadn't been sent to the stats collector yet.

Evident oversight in commit d42358ef.  I've not added a regression
test for this; we tried to add one in d42358ef, and had to revert it
because it was too timing-sensitive for the buildfarm.

Back-patch to 9.5 where d42358ef came in.

Stas Kelvich

Discussion: <EB57BF68-C06D-4737-BDDC-4BA778F4E62B@postgrespro.ru>

81e82a2b

Fix typo. · b1ee762a
Tatsuo Ishii authored Oct 14, 2016
```
Confirmed by Tom Lane.
```
b1ee762a

Fix another bug in merging of inherited CHECK constraints. · 3cca13cb

Tom Lane authored Oct 13, 2016

It's not good for an inherited child constraint to be marked connoinherit;
that would result in the constraint not propagating to grandchild tables,
if any are created later. The code mostly prevented this from happening
but there was one case that was missed.

This is somewhat related to commit e55a946a, which also tightened checks
on constraint merging. Hence, back-patch to 9.2 like that one. This isn't
so much because there's a concrete feature-related reason to stop there,
as to avoid having more distinct behaviors than we have to in this area.

Amit Langote

Discussion: <b28ee774-7009-313d-dd55-5bdd81242c41@lab.ntt.co.jp>

3cca13cb

Remove dead code in pg_dump. · c08521eb

Tom Lane authored Oct 13, 2016

I'm not sure if this provision for "pg_backup" behaving a bit differently
from "pg_dump" ever did anything useful in a released version.  But it's
definitely dead code now.

Michael Paquier

c08521eb

Try to find out the actual hugepage size when making a MAP_HUGETLB request. · cb775768

Tom Lane authored Oct 13, 2016

Even if Linux's mmap() is okay with a partial-hugepage request, munmap()
is not, as reported by Chris Richards.  Therefore it behooves us to try
a bit harder to find out the actual hugepage size, instead of assuming
that we can skate by with a guess.

For the moment, just look into /proc/meminfo to find out the default
hugepage size, and use that.  Later, on kernels that support requests
for nondefault sizes, we might try to consider other alternatives.
But that smells more like a new feature than a bug fix, especially if
we want to provide any way for the DBA to control it, so leave it for
another day.

I set this up to allow easy addition of platform-specific code for
non-Linux platforms, if needed; but right now there are no reports
suggesting that we need to work harder on other platforms.

Back-patch to 9.4 where hugepage support was introduced.

Discussion: <31056.1476303954@sss.pgh.pa.us>

cb775768

Clean up handling of anonymous mmap'd shared-memory segment. · 15fc5e15

Tom Lane authored Oct 13, 2016

Fix detaching of the mmap'd segment to have its own on_shmem_exit callback,
rather than piggybacking on the one for detaching from the SysV segment.
That was confusing, and given the distance between the two attach calls,
it was trouble waiting to happen.

Make the detaching calls idempotent by clearing AnonymousShmem to show
we've already unmapped. I spent quite a bit of time yesterday trying
to find a path that would allow the munmap()'s to be done twice, and
while I did not succeed, it seems silly that there's even a question.

Make the #ifdef logic less confusing by separating "do we want to use
anonymous shmem" from EXEC_BACKEND. Even though there's no current
scenario where those conditions are different, it is not helpful for
different places in the same file to be testing EXEC_BACKEND for what
are fundamentally different reasons.

Don't do on_exit_reset() in StartBackgroundWorker(). At best that's
useless (InitPostmasterChild would have done it already) and at worst
it could zap some callback that's unrelated to shared memory.

Improve comments, and simplify the huge_pages enablement logic slightly.

Back-patch to 9.4 where hugepage support was introduced.
Arguably this should go into 9.3 as well, but the code looks
significantly different there, and I doubt it's worth the
trouble of adapting the patch given I can't show a live bug.

15fc5e15

Fix pg_dumpall regression test to be locale-independent. · 0a4bf6b1

Tom Lane authored Oct 13, 2016

The expected results in commit b4fc6457 seem to have been generated
in a non-C locale, which just points up the fact that the ORDER BY
clause was locale-sensitive.

Per buildfarm.

0a4bf6b1

Fix broken jsonb_set() logic for replacing array elements. · 9c4cc9e2

Tom Lane authored Oct 13, 2016

Commit 0b62fd03 did a fairly sloppy job of refactoring setPath()
to support jsonb_insert() along with jsonb_set(). In its defense,
though, there was no regression test case exercising the case of
replacing an existing element in a jsonb array.

Per bug #14366 from Peng Sun. Back-patch to 9.6 where bug was introduced.

Report: <20161012065349.1412.47858@wrigleys.postgresql.org>

9c4cc9e2