Commits · 8e617e29aaccfdd1b85af7f50dc83aa6dd7ef550 · Abuhujair Javed / Postgres FD Implementation

20 Jul, 2012 2 commits

Fix whole-row Var evaluation to cope with resjunk columns (again). · 8e617e29

Tom Lane authored Jul 20, 2012

When a whole-row Var is reading the result of a subquery, we need it to
ignore any "resjunk" columns that the subquery might have evaluated for
GROUP BY or ORDER BY purposes. We've hacked this area before, in commit
68e40998, but that fix only covered
whole-row Vars of named composite types, not those of RECORD type; and it
was mighty klugy anyway, since it just assumed without checking that any
extra columns in the result must be resjunk. A proper fix requires getting
hold of the subquery's targetlist so we can actually see which columns are
resjunk (whereupon we can use a JunkFilter to get rid of them). So bite
the bullet and add some infrastructure to make that possible.

Per report from Andrew Dunstan and additional testing by Merlin Moncure.
Back-patch to all supported branches. In 8.3, also back-patch commit
292176a1, which for some reason I had
not done at the time, but it's a prerequisite for this change.

8e617e29

Make new event trigger facility actually do something. · 3a0e4d36

Robert Haas authored Jul 20, 2012

Commit 3855968f added syntax, pg_dump,
psql support, and documentation, but the triggers didn't actually fire.
With this commit, they now do.  This is still a pretty basic facility
overall because event triggers do not get a whole lot of information
about what the user is trying to do unless you write them in C; and
there's still no option to fire them anywhere except at the very
beginning of the execution sequence, but it's better than nothing,
and a good building block for future work.

Along the way, add a regression test for ALTER LARGE OBJECT, since
testing of event triggers reveals that we haven't got one.

Dimitri Fontaine and Robert Haas

3a0e4d36

19 Jul, 2012 2 commits

Rethink checkpointer's fsync-request table representation. · be86e3dd

Tom Lane authored Jul 19, 2012

Instead of having one hash table entry per relation/fork/segment, just have
one per relation, and use bitmapsets to represent which specific segments
need to be fsync'd.  This eliminates the need to scan the whole hash table
to implement FORGET_RELATION_FSYNC, which fixes the O(N^2) behavior
recently demonstrated by Jeff Janes for cases involving lots of TRUNCATE or
DROP TABLE operations during a single checkpoint cycle.  Per an idea from
Robert Haas.

(FORGET_DATABASE_FSYNC still sucks, but since dropping a database is a
pretty expensive operation anyway, we'll live with that.)

In passing, improve the delayed-unlink code: remove the pass over the list
in mdpreckpt, since it wasn't doing anything for us except supporting a
useless Assert in mdpostckpt, and fix mdpostckpt so that it will absorb
fsync requests every so often when clearing a large backlog of deletion
requests.

be86e3dd

Send only one FORGET_RELATION_FSYNC request when dropping a relation. · 3072b7ba

Tom Lane authored Jul 19, 2012

We were sending one per fork, but a little bit of refactoring allows us
to send just one request with forknum == InvalidForkNumber. This not only
reduces pressure on the shared-memory request queue, but saves repeated
traversals of the checkpointer's hash table.

3072b7ba

18 Jul, 2012 6 commits

Refactor the way code is shared between some range type functions. · a7a4add6

Heikki Linnakangas authored Jul 18, 2012

Functions like range_eq, range_before etc. are exposed at the SQL-level, but
they're also used internally by the GiST consistent support function. The
code sharing was done by a hack, TrickFunctionCall2, which relied on the
knowledge that all the functions used fn_extra the same way. This commit
splits the functions into internal versions that take a TypeCacheEntry as
argument, and thin wrappers to expose the functions at the SQL-level. The
internal versions can then be called directly and in a less hacky way from
the GiST consistent function.

This is just cosmetic, but backpatch to 9.2 anyway, to avoid having a
different version of this code in the 9.2 branch. That would make
backpatching fixes in this area more difficult.

Alexander Korotkov

a7a4add6

Fix statistics breakage from bgwriter/checkpointer process split. · 80e373c3

Tom Lane authored Jul 18, 2012

ForwardFsyncRequest() supposed that it could only be called in regular
backends, which used to be true; but since the splitup of bgwriter and
checkpointer, it is also called in the bgwriter.  We do not want to count
such calls in pg_stat_bgwriter.buffers_backend statistics, so fix things
so that they aren't.

(It's worth noting here that this implies an alarmingly large increase in
the expected amount of cross-process fsync request traffic, which may well
mean that the process splitup was not such a hot idea.)

80e373c3

Fix management of pendingOpsTable in auxiliary processes. · 4a9c30a8

Tom Lane authored Jul 18, 2012

mdinit() was misusing IsBootstrapProcessingMode() to decide whether to
create an fsync pending-operations table in the current process. This led
to creating a table not only in the startup and checkpointer processes as
intended, but also in the bgwriter process, not to mention other auxiliary
processes such as walwriter and walreceiver. Creation of the table in the
bgwriter is fatal, because it absorbs fsync requests that should have gone
to the checkpointer; instead they just sit in bgwriter local memory and are
never acted on. So writes performed by the bgwriter were not being fsync'd
which could result in data loss after an OS crash. I think there is no
live bug with respect to walwriter and walreceiver because those never
perform any writes of shared buffers; but the potential is there for
future breakage in those processes too.

To fix, make AuxiliaryProcessMain() export the current process's
AuxProcType as a global variable, and then make mdinit() test directly for
the types of aux process that should have a pendingOpsTable. Having done
that, we might as well also get rid of the random bool flags such as
am_walreceiver that some of the aux processes had grown. (Note that we
could not have fixed the bug by examining those variables in mdinit(),
because it's called from BaseInit() which is run by AuxiliaryProcessMain()
before entering any of the process-type-specific code.)

Back-patch to 9.2, where the problem was introduced by the split-up of
bgwriter and checkpointer processes. The bogus pendingOpsTable exists
in walwriter and walreceiver processes in earlier branches, but absent
any evidence that it causes actual problems there, I'll leave the older
branches alone.

4a9c30a8

Syntax support and documentation for event triggers. · 3855968f

Robert Haas authored Jul 18, 2012

They don't actually do anything yet; that will get fixed in a
follow-on commit.  But this gets the basic infrastructure in place,
including CREATE/ALTER/DROP EVENT TRIGGER; support for COMMENT,
SECURITY LABEL, and ALTER EXTENSION .. ADD/DROP EVENT TRIGGER;
pg_dump and psql support; and documentation for the anticipated
initial feature set.

Dimitri Fontaine, with review and a bunch of additional hacking by me.
Thom Brown extensively reviewed earlier versions of this patch set,
but there's not a whole lot of that code left in this commit, as it
turns out.

3855968f

Get rid of useless global variable in pg_upgrade. · faf26bf1

Tom Lane authored Jul 18, 2012

Since the scandir() emulation was taken out of pg_upgrade, there's
no longer any need for scandir_file_pattern to exist as a global
variable.  Replace it with a local in the one remaining function
that was making use of it.

faf26bf1

Improve pg_upgrade's load_directory() function. · 3d6ec663

Tom Lane authored Jul 18, 2012

Error out on out-of-memory, rather than returning -1, which the sole
existing caller wasn't checking for anyway.  There doesn't seem to be
any use-case for making the caller check for failure here.

Detect failure return from readdir().

Use a less platform-dependent method of calculating the entrysize.
It's possible, but not yet confirmed, that this explains bug #6733,
in which Mike Wilson reports a pg_upgrade crash that did not occur
in 9.1.  (Note that load_directory is effectively new code in 9.2,
at least on platforms that have scandir().)

Fix up comments, avoid uselessly using two counters, reduce the number
of realloc calls to something sane.

3d6ec663

17 Jul, 2012 6 commits

Improve coding around the fsync request queue. · 73b796a5

Tom Lane authored Jul 17, 2012

In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)

In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.

In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.

73b796a5

PL/Python: Remove PLy_result_ass_item · 71f2dd23

Peter Eisentraut authored Jul 17, 2012

It is apparently no longer used after the new slicing support was
implemented (a97207b6), so let's
remove the dead code and see if anything cares.

71f2dd23

Show step titles in the pg_upgrade man page · d6ce58c0
Peter Eisentraut authored Jul 17, 2012
```
The upstream XSLT stylesheets missed that case.

found by Álvaro Herrera
```
d6ce58c0

Remove recently added PL/Perl encoding tests · 65558995

Alvaro Herrera authored Jul 17, 2012

These only pass cleanly on UTF8 and SQL_ASCII encodings, besides the
Japanese encoding in which they were originally written, which is clearly
not good enough.  Since the functionality they test has not ever been
tested from PL/Perl, the best answer seems to be to remove the new tests
completely.

Per buildfarm results and ensuing discussion.

65558995

Put back storage/proc.h in postmaster.c. · 57b9bdda

Tom Lane authored Jul 17, 2012

I took this out thinking it wasn't needed anymore, but the EXEC_BACKEND
code still needs it.  Per buildfarm.

57b9bdda

Introduce timeout handling framework · f34c68f0

Alvaro Herrera authored Jul 16, 2012

Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier.  Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.

External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.

timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.

Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase.  This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.

Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane

f34c68f0

16 Jul, 2012 3 commits

Remove unreachable code · dd16f948

Peter Eisentraut authored Jul 16, 2012

The Solaris Studio compiler warns about these instances, unlike more
mainstream compilers such as gcc.  But manual inspection showed that
the code is clearly not reachable, and we hope no worthy compiler will
complain about removing this code.

dd16f948

Add comment why seemingly dead code is necessary · a76c857e
Peter Eisentraut authored Jul 16, 2012

a76c857e

Avoid pre-determining index names during CREATE TABLE LIKE parsing. · c92be3c0

Tom Lane authored Jul 16, 2012

Formerly, when trying to copy both indexes and comments, CREATE TABLE LIKE
had to pre-assign names to indexes that had comments, because it made up an
explicit CommentStmt command to apply the comment and so it had to know the
name for the index. This creates bad interactions with other indexes, as
shown in bug #6734 from Daniele Varrazzo: the preassignment logic couldn't
take any other indexes into account so it could choose a conflicting name.

To fix, add a field to IndexStmt that allows it to carry a comment to be
assigned to the new index. (This isn't a user-exposed feature of CREATE
INDEX, only an internal option.) Now we don't need preassignment of index
names in any situation.

I also took the opportunity to refactor DefineIndex to accept the IndexStmt
as such, rather than passing all its fields individually in a mile-long
parameter list.

Back-patch to 9.2, but no further, because it seems too dangerous to change
IndexStmt or DefineIndex's API in released branches. The bug exists back
to 9.0 where CREATE TABLE LIKE grew the ability to copy comments, but given
the lack of prior complaints we'll just let it go unfixed before 9.2.

c92be3c0

15 Jul, 2012 1 commit

Prevent corner-case core dump in rfree(). · 54fd196f

Tom Lane authored Jul 15, 2012

rfree() failed to cope with the case that pg_regcomp() had initialized the
regex_t struct but then failed to allocate any memory for re->re_guts (ie,
the first malloc call in pg_regcomp() failed). It would try to touch the
guts struct anyway, and thus dump core. This is a sufficiently narrow
corner case that it's not surprising it's never been seen in the field;
but still a bug is a bug, so patch all active branches.

Noted while investigating whether we need to call pg_regfree after a
failure return from pg_regcomp. Other than this bug, it turns out we
don't, so adjust comments appropriately.

54fd196f

14 Jul, 2012 3 commits

Don't initialize TLI variable to -1, as TimeLineID is unsigned. · 2686da9d

Heikki Linnakangas authored Jul 14, 2012

This was causing a compiler warning with Solaris compiler. Use 0 instead.
The variable is initialized just for the sake of tidyness and/or debugging,
it's not used for anything before setting it to a real value.

Per report and suggestion from Peter Eisentraut.

2686da9d

Print the name of the WAL file containing latest REDO ptr in pg_controldata. · 6c349a56
Heikki Linnakangas authored Jul 14, 2012
```
This makes it easier to determine how far back you need to keep archived WAL
files, to restore from a backup.

Fujii Masao
```
6c349a56
Add link to PEP 394 regarding python2 vs python3 naming · 8e708e5e
Peter Eisentraut authored Jul 14, 2012

8e708e5e

13 Jul, 2012 2 commits

Add fsync capability to initdb, and use sync_file_range() if available. · b966dd6c

Tom Lane authored Jul 13, 2012

Historically we have not worried about fsync'ing anything during initdb
(in fact, initdb intentionally passes -F to each backend launch to prevent
it from fsync'ing).  But with filesystems getting more aggressive about
caching data, that's not such a good plan anymore.  Make initdb do a pass
over the finished data directory tree to fsync everything.  For testing
purposes, the -N/--nosync flag can be used to restore the old behavior.

Also, testing shows that on Linux, sync_file_range() is much faster than
posix_fadvise() for hinting to the kernel that an fsync is coming,
apparently because the latter blocks on a rather small request queue while
the former doesn't.  So use this function if available in initdb, and also
in the backend's pg_flush_data() (where it currently will affect only the
speed of CREATE DATABASE's cloning step).

We will later make pg_regress invoke initdb with the --nosync flag
to avoid slowing down cases such as "make check" in contrib.  But
let's not do so until we've shaken out any portability issues in this
patch.

Jeff Davis, reviewed by Andres Freund

b966dd6c

Cosmetic cleanup of ginInsertValue(). · 1a9405d2

Tom Lane authored Jul 13, 2012

Make it clearer that the passed stack mustn't be empty, and that we
are not supposed to fall off the end of the stack in the main loop.
Tighten the loop that extracts the root block number, too.

Markus Wanner and Tom Lane

1a9405d2

12 Jul, 2012 4 commits

Avoid extra newlines in XML mapping in table forest mode · a84bf492
Peter Eisentraut authored Jul 12, 2012
```
found by P. Broennimann
```
a84bf492

Skip text->binary conversion of unnecessary columns in contrib/file_fdw. · a36088bc

Tom Lane authored Jul 12, 2012

When reading from a text- or CSV-format file in file_fdw, the datatype
input routines can consume a significant fraction of the runtime.
Often, the query does not need all the columns, so we can get a useful
speed boost by skipping I/O conversion for unnecessary columns.

To support this, add a "convert_selectively" option to the core COPY code.
This is undocumented and not accessible from SQL (for now, anyway).

Etsuro Fujita, reviewed by KaiGai Kohei

a36088bc

Remove 'x =- 1' check for pgindent, not needed, per report from Andrew · 76720bdf
Bruce Momjian authored Jul 12, 2012
```
Dunstan.
```
76720bdf

Fix memory and file descriptor leaks in pg_receivexlog/pg_basebackup · 058a050e

Magnus Hagander authored Jul 12, 2012

When the internal loop mode was added, freeing memory and closing
filedescriptors before returning became important, and a few cases
in the code missed that.

Fujii Masao

058a050e

11 Jul, 2012 3 commits

Add array_remove() and array_replace() functions. · 84a42560

Tom Lane authored Jul 11, 2012

These functions support removing or replacing array element value(s)
matching a given search value. Although intended mainly to support a
future array-foreign-key feature, they seem useful in their own right.

Marco Nenciarini and Gabriele Bartolini, reviewed by Alex Hunsaker

84a42560

Document that Log-Shipping Standby Servers cannot be upgraded by · f9951252
Bruce Momjian authored Jul 10, 2012
```
pg_upgrade.

Backpatch to 9.2.
```
f9951252
Fix bogus macro definition. · 01215d61
Tom Lane authored Jul 10, 2012
```
Per buildfarm complaints.
```
01215d61

10 Jul, 2012 6 commits

Add comments about additional mule-internal charsets from emacs's · 1c7a7faa

Tatsuo Ishii authored Jul 11, 2012

source code(lisp/international/mule-conf.el).  These charsets have not
been supported up to now anyway, so this is just for adding
commentary.  Also add mention that we follow emacs's implementation,
not xemacs's.

1c7a7faa

Fix ASCII case in pg_wchar2mule_with_len. · 60e9c224
Tom Lane authored Jul 10, 2012
```
Also some cosmetic improvements for wchar-to-mblen patch.
```
60e9c224

plperl: Skip setting UTF8 flag when in SQL_ASCII encoding · 379607c9

Alvaro Herrera authored Jul 09, 2012

When in SQL_ASCII encoding, strings passed around are not necessarily
UTF8-safe.  We had already fixed this in some places, but it looks like
we missed some.

I had to backpatch Peter Eisentraut's a8b92b60 to 9.1 in order for this
patch to cherry-pick more cleanly.

Patch from Alex Hunsaker, tweaked by Kyotaro HORIGUCHI and myself.

Some desultory cleanup and comment addition by me, during patch review.

Per bug report from Christoph Berg in
20120209102116.GA14429@msgid.df7cb.de

379607c9

perltidy adjustments to new file · fc4a8a6d
Alvaro Herrera authored Jul 09, 2012

fc4a8a6d

Re-implement extraction of fixed prefixes from regular expressions. · 628cbb50

Tom Lane authored Jul 10, 2012

To generate btree-indexable conditions from regex WHERE conditions (such as
WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed
prefix that a regex might have; that is, find any string that must be a
prefix of all strings satisfying the regex. We used to do that with
entirely ad-hoc code that looked at the source text of the regex. It
didn't know very much about regex syntax, which mostly meant that it would
fail to identify some optimizable cases; but Viktor Rosenfeld reported that
it would produce actively wrong answers for quantified parenthesized
subexpressions, such as '^(foo)?bar'. Rather than trying to extend the
ad-hoc code to cover this, let's get rid of it altogether in favor of
identifying prefixes by examining the compiled form of a regex.

To do this, I've added a new entry point "pg_regprefix" to the regex library;
hopefully it is defined in a sufficiently general fashion that it can remain
in the library when/if that code gets split out as a standalone project.

Since this bug has been there for a very long time, this fix needs to get
back-patched. However it depends on some other recent commits (particularly
the addition of wchar-to-database-encoding conversion), so I'll commit this
separately and then go to work on back-porting the necessary fixes.

628cbb50

Refactor pattern_fixed_prefix() to avoid dealing in incomplete patterns. · 00dac600

Tom Lane authored Jul 09, 2012

Previously, pattern_fixed_prefix() was defined to return whatever fixed
prefix it could extract from the pattern, plus the "rest" of the pattern.
That definition was sensible for LIKE patterns, but not so much for
regexes, where reconstituting a valid pattern minus the prefix could be
quite tricky (certainly the existing code wasn't doing that correctly).
Since the only thing that callers ever did with the "rest" of the pattern
was to pass it to like_selectivity() or regex_selectivity(), let's cut out
the middle-man and just have pattern_fixed_prefix's subroutines do this
directly. Then pattern_fixed_prefix can return a simple selectivity
number, and the question of how to cope with partial patterns is removed
from its API specification.

While at it, adjust the API spec so that callers who don't actually care
about the pattern's selectivity (which is a lot of them) can pass NULL for
the selectivity pointer to skip doing the work of computing a selectivity
estimate.

This patch is only an API refactoring that doesn't actually change any
processing, other than allowing a little bit of useless work to be skipped.
However, it's necessary infrastructure for my upcoming fix to regex prefix
extraction, because after that change there won't be any simple way to
identify the "rest" of the regex, not even to the low level of fidelity
needed by regex_selectivity. We can cope with that if regex_fixed_prefix
and regex_selectivity communicate directly, but not if we have to work
within the old API. Hence, back-patch to all active branches.

00dac600

09 Jul, 2012 1 commit

Fix planner to pass correct collation to operator selectivity estimators. · e7ef6d7e

Tom Lane authored Jul 08, 2012

We can do this without creating an API break for estimation functions
by passing the collation using the existing fmgr functionality for
passing an input collation as a hidden parameter.

The need for this was foreseen at the outset, but we didn't get around to
making it happen in 9.1 because of the decision to sort all pg_statistic
histograms according to the database's default collation. That meant that
selectivity estimators generally need to use the default collation too,
even if they're estimating for an operator that will do something
different. The reason it's suddenly become more interesting is that
regexp interpretation also uses a collation (for its LC_TYPE not LC_COLLATE
property), and we no longer want to use the wrong collation when examining
regexps during planning. It's not that the selectivity estimate is likely
to change much from this; rather that we are thinking of caching compiled
regexps during planner estimation, and we won't get the intended benefit
if we cache them with a different collation than the executor will use.

Back-patch to 9.1, both because the regexp change is likely to get
back-patched and because we might as well get this right in all
collation-supporting branches, in case any third-party code wants to
rely on getting the collation. The patch turns out to be minuscule
now that I've done it ...

e7ef6d7e

07 Jul, 2012 1 commit

Simplify and document regex library's compact-NFA representation. · c6aae304

Tom Lane authored Jul 07, 2012

The previous coding abused the first element of a cNFA state's arcs list
to hold a per-state flag bit, which was confusing, undocumented, and not
even particularly efficient. Get rid of that in favor of a separate
"stflags" vector. Since there's only one bit in use, I chose to allocate a
char per state; we could possibly replace this with a bitmap at some point,
but that would make accesses a little slower. It's already about 8X
smaller than before, so let's not get overly tense.

Also document the representation better than it was before, which is to say
not at all.

This patch is a byproduct of investigations towards extracting a "fixed
prefix" string from the compact-NFA representation of regex patterns.
Might need to back-patch it if we decide to back-patch that fix, but for
now it's just code cleanup so I'll just put it in HEAD.

c6aae304