Commits · 6b93fcd149329d4ee7319561b30fc15a573c6307 · Abuhujair Javed / Postgres FD Implementation

13 Apr, 2016 5 commits

Avoid atomic operation in MarkLocalBufferDirty(). · 6b93fcd1

Andres Freund authored Apr 13, 2016

The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.

On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail.  As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.

In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.

Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us

6b93fcd1

Widen amount-to-flush arguments of FileWriteback and callers. · 95ef43c4

Tom Lane authored Apr 13, 2016

It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.

95ef43c4

Fix assorted portability issues with using msync() for data flushing. · fa11a09f

Tom Lane authored Apr 13, 2016

Commit 428b1d6b introduced the use of
msync() for flushing dirty data from the kernel's file buffers.  Several
portability issues were overlooked, though:

* Not all implementations of mmap() think that nbytes == 0 means "map
the whole file".  To fix, use lseek() to find out the true length.
Fix callers of pg_flush_data to be aware that nbytes == 0 may result
in trashing the file's seek position.

* Not all implementations of mmap() will accept partial-page mmap
requests.  To fix, round down the length request to whatever sysconf()
says the page size is.  (I think this is OK from a portability standpoint,
because sysconf() is required by SUS v2, and we aren't trying to compile
this part on Windows anyway.  Buildfarm should let us know if not.)

* On 32-bit machines, the file size might exceed the available free
address space, or even exceed what will fit in size_t.  Check for
the latter explicitly to avoid passing a false request size to mmap().
If mmap fails, silently fall through to the next implementation method,
rather than bleating to the postmaster log and giving up.

* mmap'ing directories fails on some platforms, and even if it works,
msync'ing the directory is quite unlikely to help, as for that matter are
the other flush implementations.  In pre_sync_fname(), just skip flush
attempts on directories.

In passing, copy-edit the comments a bit.

Stas Kelvich and myself

fa11a09f

Improve documentation for \crosstabview. · 85e00470

Tom Lane authored Apr 13, 2016

Fix misleading syntax summary (there cannot be a space between colH and
scolH).  Provide a link from the existing crosstab() function's
documentation to \crosstabview.  Copy-edit the command's description.

Christoph Berg and Tom Lane

85e00470

Use PG_INT32_MIN instead of reiterating the constant. · cbb2a812
Robert Haas authored Apr 13, 2016
```
Makes no difference, but it's cleaner this way.

Michael Paquier
```
cbb2a812

12 Apr, 2016 13 commits

Provide errno-translation wrappers around bind() and listen() on Windows. · d1b7d487

Tom Lane authored Apr 12, 2016

I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members.  Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno.  Put in a wrapper layer to do that.

If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.

Discussion: <4065.1452450340@sss.pgh.pa.us>

d1b7d487

Fix costing for parallel aggregation. · deb71fa9

Robert Haas authored Apr 12, 2016

The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed.  This patch
fixes that oversight.

David Rowley

deb71fa9

Remove unused function GetOldestWALSendPointer from walsender code. · 46d73e0d

Fujii Masao authored Apr 13, 2016

That unused function was introduced as a sample because synchronous
replication or replication monitoring tools might need it in the future.
Recently commit 989be081 added the function SyncRepGetOldestSyncRecPtr
which provides almost the same functionality for multiple synchronous
standbys feature. So it's time to remove that unused sample function.
This commit does that.

46d73e0d

Redefine create_upper_paths_hook as being invoked once per upper relation. · f1f01de1

Tom Lane authored Apr 12, 2016

Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.

f1f01de1

Improve coding of column-name parsing in psql's new crosstabview.c. · 7a5f8b5c

Tom Lane authored Apr 12, 2016

Coverity complained about this code, not without reason because it was
rather messy.  Adjust it to not scribble on the passed string; that adds
one malloc/free cycle per column name, which is going to be insignificant
in context.  We can actually const-ify both the string argument and the
PGresult.

Daniel Verité, with some further cleanup by me

7a5f8b5c

Avoid extra locks in GetSnapshotData if old_snapshot_threshold < 0 · 2201d801

Kevin Grittner authored Apr 12, 2016

On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used.  Just fill with dummy
values if we're not going to use them.

This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.

2201d801

Improve API of GenericXLogRegister(). · 5713f039

Tom Lane authored Apr 12, 2016

Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me

5713f039

In generic WAL application and replay, ensure page "hole" is always zero. · bdf7db81

Tom Lane authored Apr 12, 2016

The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied. This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers. So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me

bdf7db81

Add page id to bloom index · 813b456e

Teodor Sigaev authored Apr 12, 2016

Added to ensure that bloom index pages can be distinguished from other pages
by pg_filedump. Because there wasn't any public/production versions before,
it doesn't pay attention to any compatibility issues.

Per notice from Tom Lane

813b456e

Remove unnecessary definition of _WIN64 in libpq/win32.mak. · e7bcde8c

Tom Lane authored Apr 12, 2016

In commit b0e40d18, I should have just
removed the /D switch defining WIN64.  The reason the code worked before
is that all Windows64 compilers automatically predefine _WIN64.  Perhaps
at one time we had code that depended on WIN64 being defined, but it's
long gone, and we should not encourage any reappearance.  Per discussion
with Christian Ullrich.

e7bcde8c

Correct copyright for newly added genericdesc.c · cd13471f
Stephen Frost authored Apr 12, 2016
```
It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote
```
cd13471f
Fix whitespace · 70715e6a
Peter Eisentraut authored Apr 11, 2016

70715e6a

Fix _SPI_execute_plan() for CREATE TABLE IF NOT EXISTS foo AS ... · 39c283e4

Tom Lane authored Apr 11, 2016

When IF NOT EXISTS was added to CREATE TABLE AS, this logic didn't get
the memo, possibly resulting in an Assert failure.  It looks like there
would have been no ill effects in a non-Assert build, though.  Back-patch
to 9.5 where the IF NOT EXISTS option was added.

Stas Kelvich

39c283e4

11 Apr, 2016 18 commits

Fix two places that thought Windows64 is indicated by WIN64 macro. · b0e40d18

Tom Lane authored Apr 11, 2016

Everyplace else thinks it's _WIN64, so make these places fall in line.

The pg_regress.c usage is not going to result in any change in behavior,
only suppressing (or not) a compiler warning about downcasting HANDLEs.
So there seems no need for back-patching there.

The libpq/win32.mak usage might represent an actual bug, if anyone were
using this script to build for Windows64, which perhaps nobody is.
Given the lack of field complaints, no back-patch here either.

pg_regress.c problem found by Christian Ullrich, the other by me.

b0e40d18

Fix freshly-introduced PL/Python portability bug. · 1d2f9de3

Tom Lane authored Apr 11, 2016

It turns out that those PyErr_Clear() calls I removed from plpy_elog.c
in 7e3bb080 et al were not quite as random as they appeared: they
mask a Python 2.3.x bug. (Specifically, it turns out that PyType_Ready()
can fail if the error indicator is set on entry, and PLy_traceback's fetch
of frame.f_code may be the first operation in a session that requires the
"frame" type to be readied. Ick.) Put back the clear call, but in a more
centralized place closer to what it's protecting, and this time with a
comment warning what it's really for.

Per buildfarm member prairiedog. Although prairiedog was only failing
on HEAD, it seems clearly possible for this to occur in older branches
as well, so back-patch to 9.2 the same as the previous patch.

1d2f9de3

Use static inline function for BufferGetPage() · a6f6b781

Kevin Grittner authored Apr 11, 2016

I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases.  Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call.  So make it readable.

Per gripes from Álvaro Herrera and Tom Lane

a6f6b781

Make oldSnapshotControl a pointer to a volatile structure · 80647bf6

Kevin Grittner authored Apr 11, 2016

It was incorrectly declared as a volatile pointer to a non-volatile
structure.  Eliminate the OldSnapshotControl struct definition; it
is really not needed.  Pointed out by Tom Lane.

While at it, add OldSnapshotControlData to pgindent's list of
structures.

80647bf6

Fix whitespace · d8ed83cd
Peter Eisentraut authored Apr 11, 2016

d8ed83cd

Prefix RLS regression test roles with 'regress_' · 6c7b0388

Stephen Frost authored Apr 11, 2016

To avoid any possible overlap with existing roles on a system when
doing a 'make installcheck', use role names which start with
'regress_'.

Pointed out by Tom.

6c7b0388

Add directory created during build to gitignore · 29ca231b
Peter Eisentraut authored Apr 11, 2016

29ca231b

Fix missing "volatile" in PLy_output(). · 81ba9348

Tom Lane authored Apr 11, 2016

Commit 5c3c3cd0 plastered "volatile" on a bunch of variables
in PLy_output(), but removed the one that actually mattered, ie the
one on "oldcontext". This allows some versions of clang to generate
code in which "oldcontext" has been trashed when control reaches the
PG_CATCH block. Per buildfarm member tick.

81ba9348

cpluspluscheck: Update include path · ee5dbc81

Peter Eisentraut authored Apr 11, 2016

Some things in src/include/fe_utils require libpq headers, so add
libpq's include path to the command line used here.

ee5dbc81

Fix documented return type of pg_logical_emit_message() in func.sgml. · cfe96ae2
Fujii Masao authored Apr 11, 2016

cfe96ae2

Use ereport(ERROR) instead of Assert() to emit syncrep_parser error. · 0038c1e2

Fujii Masao authored Apr 11, 2016

The existing code would either Assert or generate an invalid
SyncRepConfig variable, neither of which is desirable. A regular
error should be thrown instead.

This commit silences compiler warning in non assertion-enabled builds.

Per report from Jeff Janes.
Suggested fix by Tom Lane.

0038c1e2

Fix poorly thought-through code from commit . · f73b2bbb

Tom Lane authored Apr 11, 2016

It's not entirely clear to me whether PyString_AsString can return
null (looks like the answer might vary between Python 2 and 3).
But in any case, this code's attempt to cope with the possibility
was quite broken, because pstrdup() neither allows a null argument
nor ever returns a null.

Moreover, the code below this point assumes that "message" is a
palloc'd string, which would not be the case for a dgettext result.

Fix both problems by doing the pstrdup step separately.

f73b2bbb

pg_dump: add missing "destroyPQExpBuffer(query)" in dumpForeignServer(). · 074050f1

Tom Lane authored Apr 11, 2016

Coverity complained about this resource leak (why now, I don't know,
since it's been like that a long time). Our general policy in pg_dump
is that PQExpBuffers are worth cleaning up, so do it here too. But
don't bother with a back-patch, because it seems unlikely that very
many databases contain enough FOREIGN SERVER objects to notice.

074050f1

Add comment about intentional fallthrough in switch. · 1630f5b9

Tom Lane authored Apr 10, 2016

Coverity complained about an apparent missing "break" in a switch
added by bb140506.  The human-readable comments are pretty
clear that this is intentional, but add a standard /* FALL THRU */
comment to make it clear to tools too.

1630f5b9

Clean up foreign-key caching code in planner. · 5306df28

Tom Lane authored Apr 10, 2016

Coverity complained that the code added by 015e8894 lacked an
error check for SearchSysCache1 failures, which it should have. But
the code was pretty duff in other ways too, including failure to think
about whether it could really cope with arrays of different lengths.

5306df28

Fix access-to-already-freed-memory issue in plpython's error handling. · 7e3bb080

Tom Lane authored Apr 10, 2016

PLy_elog() could attempt to access strings that Python had already freed,
because the strings that PLy_get_spi_error_data() returns are simply
pointers into storage associated with the error "val" PyObject. That's
fine at the instant PLy_get_spi_error_data() returns them, but just after
that PLy_traceback() intentionally releases the only refcount on that
object, allowing it to be freed --- so that the strings we pass to
ereport() are dangling pointers.

In principle this could result in garbage output or a coredump. In
practice, I think the risk is pretty low, because there are no Python
operations between where we decrement that refcount and where we use the
strings (and copy them into PG storage), and thus no reason for Python
to recycle the storage. Still, it's clearly hazardous, and it leads to
Valgrind complaints when running under a Valgrind that hasn't been
lobotomized to ignore Python memory allocations.

The code was a mess anyway: we fetched the error data out of Python
(clearing Python's error indicator) with PyErr_Fetch, examined it, pushed
it back into Python with PyErr_Restore (re-setting the error indicator),
then immediately pulled it back out with another PyErr_Fetch. Just to
confuse matters even more, there were some gratuitous-and-yet-hazardous
PyErr_Clear calls in the "examine" step, and we didn't get around to doing
PyErr_NormalizeException until after the second PyErr_Fetch, making it even
less clear which object was being manipulated where and whether we still
had a refcount on it. (If PyErr_NormalizeException did substitute a
different "val" object, it's possible that the problem could manifest for
real, because then we'd be doing assorted Python stuff with no refcount
on the object we have string pointers into.)

So, rearrange all that into some semblance of sanity, and don't decrement
the refcount on the Python error objects until the end of PLy_elog().
In HEAD, I failed to resist the temptation to reformat some messy bits
from 5c3c3cd0 along the way.

Back-patch as far as 9.2, because the code is substantially the same
that far back. I believe that 9.1 has the bug as well; but the code
around it is rather different and I don't want to take a chance on
breaking something for what seems a low-probability problem.

7e3bb080

Avoid the use of a separate spinlock to protect a LWLock's wait queue. · 008608b9

Andres Freund authored Apr 10, 2016

Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.

Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.

Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com
    20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de

008608b9

Allow Pin/UnpinBuffer to operate in a lockfree manner. · 48354581

Andres Freund authored Apr 10, 2016

Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).

To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).

As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths.  There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.

As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.

This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me.  Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.

On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.

Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell

48354581

10 Apr, 2016 3 commits

Improve contrib/bloom regression test using code coverage info. · cf223c3b

Tom Lane authored Apr 10, 2016

Originally, this test created a 100000-row test table, which made it
run rather slowly compared to other contrib tests. Investigation with
gcov showed that we got no further improvement in code coverage after
the first 700 or so rows, making the large table 99% a waste of time.
Cut it back to 2000 rows to fix the runtime problem and still leave
some headroom for testing behaviors that may appear later.

A closer look at the gcov results showed that the main coverage
omissions in contrib/bloom occurred because the test never filled more
than one entry in the notFullPage array; which is unsurprising because
it exercised index cleanup only in the scenario of complete table
deletion, allowing every page in the index to become deleted rather
than not-full. Add testing that allows the not-full path to be
exercised as well.

Also, test the amvalidate function, because blvalidate.c had zero
coverage without that, and besides it's a good idea to check for
mistakes in the bloom opclass definitions.

cf223c3b

Fix possible NULL dereference in ExecAlterObjectDependsStmt · bd905a0d

Alvaro Herrera authored Apr 10, 2016

I used the wrong variable here.  Doesn't make a difference today because
the only plausible caller passes a non-NULL variable, but someday it
will be wrong, and even today's correctness is subtle: the caller that
does pass a NULL is never invoked because of object type constraints.
Surely not a condition to rely on.

Noted by Coverity

bd905a0d

Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD. · 660d5fb8

Tom Lane authored Apr 10, 2016

Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.

660d5fb8

09 Apr, 2016 1 commit

Micro-optimize GenericXLogFinish(). · 68689c66

Tom Lane authored Apr 09, 2016

Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next.  To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap.  The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.

Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after.  This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.

On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.

68689c66