Commits · f4adc41c4f92cc91d507b19e397140c35bb9fd71 · Abuhujair Javed / Postgres FD Implementation

27 Feb, 2021 1 commit

Peter Eisentraut authored Feb 27, 2021

Per SQL:202x draft, in the CYCLE clause of a recursive query, the
cycle mark values can be of type boolean and can be omitted, in which
case they default to TRUE and FALSE.
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Discussion: https://www.postgresql.org/message-id/flat/db80ceee-6f97-9b4a-8ee8-3ba0c58e5be2@2ndquadrant.com

f4adc41c

26 Feb, 2021 6 commits

Doc: further clarify libpq's description of connection string URIs. · 4e90052c

Tom Lane authored Feb 26, 2021

Break the synopsis into named parts to make it less confusing.
Make more than zero effort at applying SGML markup.  Do a bit
of copy-editing of nearby text.

The synopsis revision is by Alvaro Herrera and Paul Förster,
the rest is my fault.  Back-patch to v10 where multi-host
connection strings appeared.

Discussion: https://postgr.es/m/6E752D6B-487C-463E-B6E2-C32E7FB007EA@gmail.com

4e90052c

Improve memory management in regex compiler. · 0fc1af17

Tom Lane authored Feb 26, 2021

The previous logic here created a separate pool of arcs for each
state, so that the out-arcs of each state were physically stored
within it. Perhaps this choice was driven by trying to not include
a "from" pointer within each arc; but Spencer gave up on that idea
long ago, and it's hard to see what the value is now. The approach
turns out to be fairly disastrous in terms of memory consumption,
though. In the first place, NFAs built by this engine seem to have
about 4 arcs per state on average, with a majority having only one
or two out-arcs. So pre-allocating 10 out-arcs for each state is
already cause for a factor of two or more bloat. Worse, the NFA
optimization phase moves arcs around with abandon. In a large NFA,
some of the states will have hundreds of out-arcs, so towards the
end of the optimization phase we have a significant number of states
whose arc pools have room for hundreds of arcs each, even though only
a few of those arcs are in use. We have seen real-world regexes in
which this effect bloats the memory requirement by 25X or even more.

Hence, get rid of the per-state arc pools in favor of a single arc
pool for the whole NFA, with variable-sized allocation batches
instead of always asking for 10 at a time. While we're at it,
let's batch the allocations of state structs too, to further reduce
the malloc traffic.

This incidentally allows moveouts() to be optimized in a similar
way to moveins(): when moving an arc to another state, it's now
valid to just re-link the same arc struct into a different outchain,
where before the code invariants required us to make a physically
new arc and then free the old one.

These changes reduce the regex compiler's typical space consumption
for average-size regexes by about a factor of two, and much more for
large or complicated regexes. In a large test set of real-world
regexes, we formerly had half a dozen cases that failed with "regular
expression too complex" due to exceeding the REG_MAX_COMPILE_SPACE
limit (about 150MB); we would have had to raise that limit to
something close to 400MB to make them work with the old code. Now,
none of those cases need more than 13MB to compile. Furthermore,
the test set is about 10% faster overall due to less malloc traffic.

Discussion: https://postgr.es/m/168861.1614298592@sss.pgh.pa.us

0fc1af17

Extend a test case a little · b3a9e989

Peter Eisentraut authored Feb 26, 2021

This will possibly help a subsequent patch by making sure the notice
messages are distinct so that it's clear that they come out in the
right order.

Author: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/alpine.DEB.2.21.1904240654120.3407%40lancre

b3a9e989

doc: Improve {archive,restore}_command for compressed logs · 329784e1

Michael Paquier authored Feb 26, 2021

The commands mentioned in the docs with gzip and gunzip did not prefix
the archives with ".gz" and used inconsistent paths for the archives,
which can be confusing.

Reported-by: Philipp Gramzow
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/161397938841.15451.13129264141285167267@wrigleys.postgresql.org

329784e1

Revert "pg_collation_actual_version() -> pg_collation_current_version()." · 8556267b

Thomas Munro authored Feb 26, 2021

This reverts commit 9cf184cc.  Name
change less well received than anticipated.

Discussion: https://postgr.es/m/afcfb97e-88a1-a540-db95-6c573b93bc2b%40eisentraut.org

8556267b

Fix list-manipulation bug in WITH RECURSIVE processing. · 80ca8464

Tom Lane authored Feb 25, 2021

makeDependencyGraphWalker and checkWellFormedRecursionWalker
thought they could hold onto a pointer to a list's first
cons cell while the list was modified by recursive calls.
That was okay when the cons cell was actually separately
palloc'd ... but since commit 1cff1b95, it's quite unsafe,
leading to core dumps or incorrect complaints of faulty
WITH nesting.

In the field this'd require at least a seven-deep WITH nest
to cause an issue, but enabling DEBUG_LIST_MEMORY_USAGE
allows the bug to be seen with lesser nesting depths.

Per bug #16801 from Alexander Lakhin.  Back-patch to v13.

Michael Paquier and Tom Lane

Discussion: https://postgr.es/m/16801-393c7922143eaa4d@postgresql.org

80ca8464

25 Feb, 2021 8 commits

VACUUM VERBOSE: Count "newly deleted" index pages. · 23763618

Peter Geoghegan authored Feb 25, 2021

Teach VACUUM VERBOSE to report on pages deleted by the _current_ VACUUM
operation -- these are newly deleted pages. VACUUM VERBOSE continues to
report on the total number of deleted pages in the entire index (no
change there). The former is a subset of the latter.

The distinction between each category of deleted index page only arises
with index AMs where page deletion is supported and is decoupled from
page recycling for performance reasons.

This is follow-up work to commit e5d8a999, which made nbtree store
64-bit XIDs (not 32-bit XIDs) in pages at the point at which they're
deleted. Note that the btm_last_cleanup_num_delpages metapage field
added by that commit usually gets set to pages_newly_deleted. The
exceptions (the scenarios in which they're not equal) all seem to be
tricky cases for the implementation (of page deletion and recycling) in
general.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX%3Dd5u-tRYhi-F4wnV2uN2zHpMUXw%40mail.gmail.com

23763618

Doc: remove src/backend/regex/re_syntax.n. · 301ed881

Tom Lane authored Feb 25, 2021

We aren't publishing this file as documentation, and it's been
much more haphazardly maintained than the real docs in func.sgml,
so let's just drop it.  I think the only reason I included it in
commit 7bcc6d98 was that the Berkeley-era sources had had a man
page in this directory.

Discussion: https://postgr.es/m/4099447.1614186542@sss.pgh.pa.us

301ed881

Change regex \D and \W shorthands to always match newlines. · 7dc13a0f

Tom Lane authored Feb 25, 2021

Newline is certainly not a digit, nor a word character, so it is
sensible that it should match these complemented character classes.
Previously, \D and \W acted that way by default, but in
newline-sensitive mode ('n' or 'p' flag) they did not match newlines.

This behavior was previously forced because explicit complemented
character classes don't match newlines in newline-sensitive mode;
but as of the previous commit that implementation constraint no
longer exists.  It seems useful to change this because the primary
real-world use for newline-sensitive mode seems to be to match the
default behavior of other regex engines such as Perl and Javascript
... and their default behavior is that these match newlines.

The old behavior can be kept by writing an explicit complemented
character class, i.e. [^[:digit:]] or [^[:word:]].  (This means
that \D and \W are not exactly equivalent to those strings, but
they weren't anyway.)

Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us

7dc13a0f

Allow complemented character class escapes within regex brackets. · 2a0af7fe

Tom Lane authored Feb 25, 2021

The complement-class escapes \D, \S, \W are now allowed within
bracket expressions.  There is no semantic difficulty with doing
that, but the rather hokey macro-expansion-based implementation
previously used here couldn't cope.

Also, invent "word" as an allowed character class name, thus "\w"
is now equivalent to "[[:word:]]" outside brackets, or "[:word:]"
within brackets.  POSIX allows such implementation-specific
extensions, and the same name is used in e.g. bash.

One surprising compatibility issue this raises is that constructs
such as "[\w-_]" are now disallowed, as our documentation has always
said they should be: character classes can't be endpoints of a range.
Previously, because \w was just a macro for "[:alnum:]_", such a
construct was read as "[[:alnum:]_-_]", so it was accepted so long as
the character after "-" was numerically greater than or equal to "_".

Some implementation cleanup along the way:

* Remove the lexnest() hack, and in consequence clean up wordchrs()
to not interact with the lexer.

* Fix colorcomplement() to not be O(N^2) in the number of colors
involved.

* Get rid of useless-as-far-as-I-can-see calls of element()
on single-character character element names in brackpart().
element() always maps these to the character itself, and things
would be quite broken if it didn't --- should "[a]" match something
different than "a" does?  Besides, the shortcut path in brackpart()
wasn't doing this anyway, making it even more inconsistent.

Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us
Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us

2a0af7fe

Improve tab-completion for TRUNCATE. · 6b40d9bd

Fujii Masao authored Feb 25, 2021

Author: Kota Miyake
Reviewed-by: Muhammad Usama
Discussion: https://postgr.es/m/f5d30053d00dcafda3280c9e267ecb0f@oss.nttdata.com

6b40d9bd

doc: Mention PGDATABASE as supported by pgbench · a6f8dc47

Michael Paquier authored Feb 25, 2021

PGHOST, PGPORT and PGUSER were already mentioned, but not PGDATABASE.
Like 5aaa584f, backpatch down to 12.

Reported-by: Christophe Courtois
Discussion: https://postgr.es/m/161399398648.21711.15387267201764682579@wrigleys.postgresql.org
Backpatch-through: 12

a6f8dc47

Use full 64-bit XIDs in deleted nbtree pages. · e5d8a999

Peter Geoghegan authored Feb 24, 2021

Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely. Commit 6655a729 did the same thing for deleted pages in
GiST indexes. That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages. There is no longer any
reason to care about that condition/the oldest XID. It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much. We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever. We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index. All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic). The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored. We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit. A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users. The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future. Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com

e5d8a999

Fix relcache reference leak introduced by . · 8a4f9522

Amit Kapila authored Feb 25, 2021

Author: Sawada Masahiko
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAD21AoA7ZEfsOXQ9HQqMv3QYGsEm2H5Wk5ic5S=mvzDf-3a3SA@mail.gmail.com

8a4f9522

24 Feb, 2021 3 commits

Fix some typos, grammar and style in docs and comments · bcf2667b

Michael Paquier authored Feb 24, 2021

The portions fixing the documentation are backpatched where needed.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20210210235557.GQ20012@telsasoft.com
backpatch-through: 9.6

bcf2667b

Message style fix · 8ec8fe0f
Peter Eisentraut authored Feb 24, 2021
```
Don't quote type name placeholders.
```
8ec8fe0f

doc: Improve description of wal_receiver_status_interval · c82d59d6

Michael Paquier authored Feb 24, 2021

This parameter description was previously confusing, telling that a
value of 0 disabled completely status updates.  This is not true as
there are cases where an update is sent while ignoring this parameter
value.  The documentation is improved to outline the difference of
treatment for scheduled status messages and when these are forced.

Reported-by: Dmitriy Kuzmin
Author: Michael Paquier
Reviewed-by: Euler Taveira
Discussion: https://postgr.es/m/161346024420.3455.1345266601055047937@wrigleys.postgresql.org

c82d59d6

23 Feb, 2021 7 commits

Fix confusion in comments about generate_gather_paths · 5a65eacf

Alvaro Herrera authored Feb 23, 2021

d2d8a229 introduced a new function generate_useful_gather_paths to
be used as a replacement for generate_gather_paths, but forgot to update
a couple of places that referenced the older function.

This is possibly not 100% complete (ref. create_ordered_paths), but it's
better than not changing anything.

Author: "Hou, Zhijie" <houzj.fnst@cn.fujitsu.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/4ce1d5116fe746a699a6d29858c6a39a@G08CNEXMBPEKD05.g08.fujitsu.local

5a65eacf

Reinstate HEAP_XMAX_LOCK_ONLY|HEAP_KEYS_UPDATED as allowed · 8deb6b38

Alvaro Herrera authored Feb 23, 2021

Commit 866e24d4 added an assert that HEAP_XMAX_LOCK_ONLY and
HEAP_KEYS_UPDATED cannot appear together, on the faulty assumption that
the latter necessarily referred to an update and not a tuple lock; but
that's wrong, because SELECT FOR UPDATE can use precisely that
combination, as evidenced by the amcheck test case added here.

Remove the Assert(), and also patch amcheck's verify_heapam.c to not
complain if the combination is found.  Also, out of overabundance of
caution, update (across all branches) README.tuplock to be more explicit
about this.

Author: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Mahendra Singh Thalor <mahi6run@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/20210124061758.GA11756@nol

8deb6b38

Suppress compiler warning in new regex match-all detection code. · 3db05e76

Tom Lane authored Feb 23, 2021

gcc 10 is smart enough to notice that control could reach this
"hasmatch[depth]" assignment with depth < 0, but not smart enough
to know that that would require a badly broken NFA graph.  Change
the assert() to a plain runtime test to shut it up.

Per report from Andres Freund.

Discussion: https://postgr.es/m/20210223173437.b3ywijygsy6q42gq@alap3.anarazel.de

3db05e76

VACUUM: ignore indexing operations with CONCURRENTLY · d9d07622

Alvaro Herrera authored Feb 23, 2021

As envisioned in commit c98763bf, it is possible for VACUUM to
ignore certain transactions that are executing CREATE INDEX CONCURRENTLY
and REINDEX CONCURRENTLY for the purposes of computing Xmin; that's
because we know those transactions are not going to examine any other
tables, and are not going to execute anything else in the same
transaction. (Only operations on "safe" indexes can be ignored: those
on indexes that are neither partial nor expressional).

This is extremely useful in cases where CIC/RC can run for a very long
time, because that used to be a significant headache for concurrent
vacuuming of other tables.
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql

d9d07622

Simplify printing of LSNs · 6f6f284c

Peter Eisentraut authored Feb 23, 2021

Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs.
Convert all applicable code to use it.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com

6f6f284c

Fix an oversight in ReorderBufferFinishPrepared. · ade89ba5

Amit Kapila authored Feb 23, 2021

We don't have anything to decode in a transaction if ReorderBufferTXN
doesn't exist by the time we decode the commit prepared. So don't create a
new ReorderBufferTXN here. This is an oversight in commit a271a1b5.

Reported-by: Markus Wanner
Discussion: https://postgr.es/m/dbec82e2-dbd7-95a2-c6b6-e488cbbdf853@bluegap.ch

ade89ba5

Change the error message for logical replication authentication failure. · bc617a7b

Amit Kapila authored Feb 23, 2021

The authentication failure error message wasn't distinguishing whether
it is a physical replication or logical replication connection failure and
was giving incomplete information on what led to failure in case of logical
replication connection.

Author: Paul Martinez and Amit Kapila
Reviewed-by: Euler Taveira and Amit Kapila
Discussion: https://postgr.es/m/CACqFVBYahrAi2OPdJfUA3YCvn3QMzzxZdw0ibSJ8wouWeDtiyQ@mail.gmail.com

bc617a7b

22 Feb, 2021 13 commits

Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls · 0f5505a8

Alvaro Herrera authored Feb 22, 2021

Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c are useless, since they are done after comparing t_self
with t_ctid.  But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.

So these checks can be removed and no harm is done.  There's no bug
here, just a code legibility issue.
Reported-by: Pavan Deolasee <pavan.deolasee@gmail.com>
Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql

0f5505a8

Fix typo · 6a03369a
Alvaro Herrera authored Feb 22, 2021

6a03369a

Fix docs build for website styles · d22d0fa9

Magnus Hagander authored Feb 22, 2021

Building the docs with STYLE=website referenced a stylesheet that long
longer exists on the website, since we changed it to use versioned
references.

To make it less likely for this to happen again, point to a single
stylesheet on the website which will in turn import the required one.
That puts the process entirely within the scope of the website
repository, so next time a version is switched that's the only place
changes have to be made, making them less likely to be missed.

Per (off-list) discussion with Peter Geoghegan and Jonathan Katz.

d22d0fa9

Tab-complete CREATE COLLATION. · 5bc09a74

Thomas Munro authored Feb 22, 2021

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20210117215940.GE8560%40telsasoft.com

5bc09a74

Refactor get_collation_current_version(). · beb4480c

Thomas Munro authored Feb 22, 2021

The code paths for three different OSes finished up with three different
ways of excluding C[.xxx] and POSIX from consideration. Merge them.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20210117215940.GE8560%40telsasoft.com

beb4480c

pg_collation_actual_version() -> pg_collation_current_version(). · 9cf184cc
Thomas Munro authored Feb 22, 2021
```
The new name seems a bit more natural.

Discussion: https://postgr.es/m/20210117215940.GE8560%40telsasoft.com
```
9cf184cc

Hide internal error for pg_collation_actual_version(<bad OID>). · 0fb0a050

Thomas Munro authored Feb 22, 2021

Instead of an unsightly internal "cache lookup failed" message, just
return NULL for bad OIDs, as is the convention for other similar things.
Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20210117215940.GE8560%40telsasoft.com

0fb0a050

Initialize atomic variable waitStart in PGPROC, at postmaster startup. · f05ed5a5

Fujii Masao authored Feb 22, 2021

Commit 46d6e5f5 added the atomic variable "waitStart" into PGPROC struct,
to store the time at which wait for lock acquisition started. Previously
this variable was initialized every time each backend started. Instead,
this commit makes postmaster initialize it at the startup, to ensure that
the variable should be initialized before any use of it.

This commit also moves the code to initialize "waitStart" variable for
prepare transaction, from TwoPhaseGetDummyProc() to MarkAsPreparingGuts().
Because MarkAsPreparingGuts() is more proper place to do that since
it initializes other PGPROC variables.

Author: Fujii Masao
Reviewed-by: Atsushi Torikoshi
Discussion: https://postgr.es/m/1df88660-6f08-cc6e-b7e2-f85296a2bdab@oss.nttdata.com

f05ed5a5

Improve new hash partition bound check error messages · efbfb642

Peter Eisentraut authored Feb 22, 2021

For the error message "every hash partition modulus must be a factor
of the next larger modulus", add a detail message that shows the
particular numbers and existing partition involved.  Also comment the
code more.
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/bb9d60b4-aadb-607a-1a9d-fdc3434dddcd%40enterprisedb.com

efbfb642

Use pgstat_progress_update_multi_param() where possible · 92942642

Michael Paquier authored Feb 22, 2021

This commit changes one code path in REINDEX INDEX and one code path
in CREATE INDEX CONCURRENTLY to report the progress of each operation
using pgstat_progress_update_multi_param() rather than
multiple calls to pgstat_progress_update_param(). This has the
advantage to make the progress report more consistent to the end-user
without impacting the amount of information provided.

Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACV5zW7GxD8D_tyO==bcj6ZktQchEKWKPBOAGKiLhAQo=w@mail.gmail.com

92942642

Remove outdated reference to RAID spindles. · db8374d8

Thomas Munro authored Feb 22, 2021

Commit b09ff536 left behind some outdated advice in the long_desc field
of the GUC "effective_io_concurrency".  Remove it.

Back-patch to 13.
Reported-by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJyyWqFBxL9gEj-qtjBThGjhAOBE8GBnF8MUJOJ3vrfag%40mail.gmail.com

db8374d8

Simplify memory management for regex DFAs a little. · 190c7988

Tom Lane authored Feb 21, 2021

Coverity complained that functions in regexec.c might leak DFA
storage. It's wrong, but this logic is confusing enough that it's
not so surprising Coverity couldn't make sense of it. Rewrite
in hopes of making it more legible to humans as well as machines.

190c7988

Fix invalid array access in trgm_regexp.c. · 6ee479ab

Tom Lane authored Feb 21, 2021

Brown-paper-bag bug in 08c0d6ad: I missed one place that needed
to guard against RAINBOW arc colors. Remarkably, nothing noticed
the invalid array access except buildfarm member thorntail.

Thanks to Noah Misch for assistance with tracking this down.

6ee479ab

21 Feb, 2021 2 commits

Avoid generating extra subre tree nodes for capturing parentheses. · ea1268f6

Tom Lane authored Feb 20, 2021

Previously, each pair of capturing parentheses gave rise to a separate
subre tree node, whose only function was to identify that we ought to
capture the match details for this particular sub-expression. In
most cases we don't really need that, since we can perfectly well
put a "capture this" annotation on the child node that does the real
matching work. As with the two preceding commits, the main value
of this is to avoid generating and optimizing an NFA for a tree node
that's not really pulling its weight.

The chosen data representation only allows one capture annotation
per subre node. In the legal-per-spec, but seemingly not very useful,
case where there are multiple capturing parens around the exact same
bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture
nodes that act the same as before. We could work harder at that but
I'll refrain, pending some evidence that such cases are worth troubling
over.

In passing, improve the comments in regex.h to say what all the
different re_info bits mean. Some of them were pretty obvious
but others not so much, so reverse-engineer some documentation.

This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.

Patch by me, reviewed by Joel Jacobson

Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us

ea1268f6

Convert regex engine's subre tree from binary to N-ary style. · 58104308

Tom Lane authored Feb 20, 2021

Instead of having left and right child links in subre structs,
have a single child link plus a sibling link. Multiple children
of a tree node are now reached by chasing the sibling chain.

The beneficiary of this is alternation tree nodes. A regular
expression with N (>1) branches is now represented by one alternation
node with N children, rather than a tree that includes N alternation
nodes as well as N children. While the old representation didn't
really cost anything extra at execution time, it was pretty horrid
for compilation purposes, because each of the alternation nodes had
its own NFA, which we were too stupid not to separately optimize.
(To make matters worse, all of those NFAs described the entire
alternation pattern, not just the portion of it that one might
expect from the tree structure.)

We continue to require concatenation nodes to have exactly two
children. This data structure is now prepared to support more,
but the executor's logic would need some careful redesign, and
it's not clear that a lot of benefit could be had.

This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.

Patch by me, reviewed by Joel Jacobson

Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us

58104308