Commits · f6b8f19a084ce949522fcbc940dc116c034cfc47 · Abuhujair Javed / Postgres FD Implementation

06 Apr, 2021 1 commit

Allocate access strategy in parallel VACUUM workers. · f6b8f19a

Peter Geoghegan authored Apr 05, 2021

Commit 49f49def took entirely the wrong approach to fixing this issue.
Just allocate a local buffer access strategy in each individual worker
instead of trying to propagate state.  This state was never propagated
by parallel VACUUM in the first place.

It looks like the only reason that this worked following commit 40d964ec
was that it involved static global variables, which are initialized to 0
per the C standard.

A more comprehensive fix may be necessary, even on HEAD.  This fix
should at least get the buildfarm green once again.

Thanks once again to Thomas Munro for continued off-list assistance with
the issue.

f6b8f19a

05 Apr, 2021 9 commits

Support INCLUDE'd columns in SP-GiST. · 09c1c6ab

Tom Lane authored Apr 05, 2021

Not much to say here: does what it says on the tin.
We steal a previously-always-zero bit from the nextOffset
field of leaf index tuples in order to track whether there
is a nulls bitmap.  Otherwise it works about like included
columns in other index types.

Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova,
and rather heavily editorialized on by me

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com

09c1c6ab

Propagate parallel VACUUM's buffer access strategy. · 49f49def

Peter Geoghegan authored Apr 05, 2021

Parallel VACUUM relied on global variable state from the leader process
being propagated to workers on fork().  Commit b4af70cb removed most
uses of global variables inside vacuumlazy.c, but did not account for
the buffer access strategy state.

To fix, propagate the state through shared memory instead.

Per buildfarm failures on elver, curculio, and morepork.

Many thanks to Thomas Munro for off-list assistance with this issue.

49f49def

Simplify state managed by VACUUM. · b4af70cb

Peter Geoghegan authored Apr 05, 2021

Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand. Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead. Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output. We were using double, but that doesn't seem to have any
advantages. Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap. These assertions will be added in later
commits.

Finally, adjust the signatures of functions with IndexBulkDeleteResult
pointer arguments in cases where there was ambiguity about whether or
not the argument relates to a single index or all indexes. Functions
now use the idiom that both ambulkdelete() and amvacuumcleanup() have
always used (where appropriate): accept a mutable IndexBulkDeleteResult
pointer argument, and return a result IndexBulkDeleteResult pointer to
caller.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com

b4af70cb

Add pg_read_all_data and pg_write_all_data roles · 6c3ffd69

Stephen Frost authored Apr 05, 2021

A commonly requested use-case is to have a role who can run an
unfettered pg_dump without having to explicitly GRANT that user access
to all tables, schemas, et al, without that role being a superuser.
This address that by adding a "pg_read_all_data" role which implicitly
gives any member of this role SELECT rights on all tables, views and
sequences, and USAGE rights on all schemas.

As there may be cases where it's also useful to have a role who has
write access to all objects, pg_write_all_data is also introduced and
gives users implicit INSERT, UPDATE and DELETE rights on all tables,
views and sequences.

These roles can not be logged into directly but instead should be
GRANT'd to a role which is able to log in.  As noted in the
documentation, if RLS is being used then an administrator may (or may
not) wish to set BYPASSRLS on the login role which these predefined
roles are GRANT'd to.

Reviewed-by: Georgios Kokolatos
Discussion: https://postgr.es/m/20200828003023.GU29590@tamriel.snowman.net

6c3ffd69

Shut down transaction tracking at startup process exit. · ad8b6749

Fujii Masao authored Apr 06, 2021

Maxim Orlov reported that the shutdown of standby server could result in
the following assertion failure. The cause of this issue was that,
when the shutdown caused the startup process to exit, recovery-time
transaction tracking was not shut down even if it's already initialized,
and some locks the tracked transactions were holding could not be released.
At this situation, if other process was invoked and the PGPROC entry that
the startup process used was assigned to it, it found such unreleased locks
and caused the assertion failure, during the initialization of it.

    TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))"

This commit fixes this issue by making the startup process shut down
transaction tracking and release all locks, at the exit of it.

Back-patch to all supported branches.

Reported-by: Maxim Orlov
Author: Fujii Masao
Reviewed-by: Maxim Orlov
Discussion: https://postgr.es/m/ad4ce692cc1d89a093b471ab1d969b0b@postgrespro.ru

ad8b6749

Align some terms in arch-dev.sgml to glossary · 6734e806

Alvaro Herrera authored Apr 05, 2021

This mostly adds links to the glossary to the existing text, instead of
using <firstterm>.  Heikki left this out of 29ad6595 out of
stylistic concerns; these have since been addressed.

Author: Jürgen Purtz <juergen@purtz.de>
Discussion: https://postgr.es/m/67d7240f-8596-83fc-5e15-af06c128a0f5@purtz.de

6734e806

Renumber cursor option flags · a63dd8af

Peter Eisentraut authored Apr 05, 2021

Move the planner-control flags up so that there is more room for parse
options. Some pending patches need some room there, so do this
renumbering separately so that there is less potential for conflicts.

a63dd8af

Fix typo in collationcmds.c · 9f6f1f9b

Michael Paquier authored Apr 05, 2021

Introduced by 51e225da.

Author: Anton Voloshin
Discussion: https://postgr.es/m/05477da0-703c-7de7-998c-5879738e8f39@postgrespro.ru

9f6f1f9b

Refactor all TAP test suites doing connection checks · c50624cd

Michael Paquier authored Apr 05, 2021

This commit refactors more TAP tests to adapt with the recent
introduction of connect_ok() and connect_fails() in PostgresNode,
introduced by 0d1a3343.  This changes the following test suites to use
the same code paths for connection checks:
- Kerberos
- LDAP
- SSL
- Authentication

Those routines are extended to be able to handle optional parameters
that are set depending on each suite's needs, as of:
- custom SQL query.
- expected stderr matching pattern.
- expected stdout matching pattern.
The new design is extensible with more parameters, and there are some
plans for those routines in the future with checks based on the contents
of the backend logs.

Author: Jacob Champion, Michael Paquier
Discussion: https://postgr.es/m/d17b919e27474abfa55d97786cb9cfadfe2b59e9.camel@vmware.com

c50624cd

04 Apr, 2021 8 commits

Fix more confusion in SP-GiST. · dfc843d4

Tom Lane authored Apr 04, 2021

spg_box_quad_leaf_consistent unconditionally returned the leaf
datum as leafValue, even though in its usage for poly_ops that
value is of completely the wrong type.

In versions before 12, that was harmless because the core code did
nothing with leafValue in non-index-only scans ... but since commit
2a636834, if we were doing a KNN-style scan, spgNewHeapItem would
unconditionally try to copy the value using the wrong datatype
parameters.  Said copying is a waste of time and space if we're not
going to return the data, but it accidentally failed to fail until
I fixed the datatype confusion in ac9099fc.

Hence, change spgNewHeapItem to not copy the datum unless we're
actually going to return it later.  This saves cycles and dodges
the question of whether lossy opclasses are returning the right
type.  Also change spg_box_quad_leaf_consistent to not return
data that might be of the wrong type, as insurance against
somebody introducing a similar bug into the core code in future.

It seems like a good idea to back-patch these two changes into
v12 and v13, although I'm afraid to change spgNewHeapItem's
mistaken idea of which datatype to use in those branches.

Per buildfarm results from ac9099fc.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us

dfc843d4

Fix confusion in SP-GiST between attribute type and leaf storage type. · ac9099fc

Tom Lane authored Apr 04, 2021

According to the documentation, the attType passed to the opclass
config function (and also relied on by the core code) is the type
of the heap column or expression being indexed. But what was
actually being passed was the type stored for the index column.
This made no difference for user-defined SP-GiST opclasses,
because we weren't allowing the STORAGE clause of CREATE OPCLASS
to be used, so the two types would be the same. But it's silly
not to allow that, seeing that the built-in poly_ops opclass
has a different value for opckeytype than opcintype, and that if you
want to do lossy storage then the types must really be different.
(Thus, user-defined opclasses doing lossy storage had to lie about
what type is in the index.) Hence, remove the restriction, and make
sure that we use the input column type not opckeytype where relevant.

For reasons of backwards compatibility with existing user-defined
opclasses, we can't quite insist that the specified leafType match
the STORAGE clause; instead just add an amvalidate() warning if
they don't match.

Also fix some bugs that would only manifest when trying to return
index entries when attType is different from attLeafType. It's not
too surprising that these have not been reported, because the only
usual reason for such a difference is to store the leaf value
lossily, rendering index-only scans impossible.

Add a src/test/modules module to exercise cases where attType is
different from attLeafType and yet index-only scan is supported.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us

ac9099fc

Fix bug in brin_minmax_multi_union · d9c5b9a9

Tomas Vondra authored Apr 04, 2021

When calling sort_expanded_ranges() we need to remember the return
value, because the function sorts and also deduplicates the ranges. So
the number of ranges may decrease. brin_minmax_multi_union failed to do
that, which resulted in crashes due to bogus ranges (equal minval/maxval
but not marked as compacted).

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to

d9c5b9a9

Add regression test for minmax-multi macaddr8 type · 4908684d

Tomas Vondra authored Apr 04, 2021

The regression test for BRIN minmax-multi opclasses tested almost all
supported data types, with the exception of macaddr8. So this adds it.

4908684d

Fix order of parameters in BRIN minmax-multi calls · 1dad2a5e

Tomas Vondra authored Apr 04, 2021

The BRIN minmax-multi consistent function incorrectly assumed it can
lookup an operator, and then swap the arguments to get the commutator.
For example <(a,b) would be called as <(b,a) to get >(a,b). This works
when the arguments are of the same type, but with cross-type opclasses
this fails. We can't swap <(float4,float8) arguments, for example.

Fixed by passing arguments in the right order.

Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com

1dad2a5e

Fix BRIN minmax-multi distance for inet type · e1fbe118

Tomas Vondra authored Apr 04, 2021

The distance calculation ignored the mask, unlike the inet comparator,
which resulted in negative distance in some cases. Fixed by applying the
mask in brin_minmax_multi_distance_inet. I've considered simply calling
inetmi() to calculate the delta, but that does not consider mask either.

Reviewed-by: Zhihong Yu
Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com

e1fbe118

Fix BRIN minmax-multi distance for timetz type · 7262f242

Tomas Vondra authored Apr 04, 2021

The distance calculation ignored the time zone, so the result of (b-a)
might have ended negative even if (b > a). Fixed by considering the time
zone difference.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com

7262f242

Fix BRIN minmax-multi distance for interval type · 2b10e0e3

Tomas Vondra authored Apr 04, 2021

The distance calculation for interval type was treating months as having
31 days, which is inconsistent with the interval comparator (using 30
days). Due to this it was possible to get negative distance (b-a) when
(a<b), trigerring an assert.

Fixed by adopting the same logic as interval_cmp_value.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com

2b10e0e3

03 Apr, 2021 7 commits

Improve psql's behavior when the editor is exited without saving. · 55873a00

Tom Lane authored Apr 03, 2021

When editing the previous query buffer, if the editor is exited
without modifying the temp file then clear the query buffer,
rather than re-loading (and probably re-executing) the previous
query buffer.  This reduces the probability of accidentally
re-executing something you didn't intend to.

Similarly, in "\e file", if the file isn't actually modified
then don't load it into the query buffer.  And in "\ef" and
"\ev", if no changes are made then clear the query buffer
instead of loading the function or view definition into it.

Cases where we fail to invoke the editor at all, or it returns
a nonzero status, are treated like the no-file-modification case.

Laurenz Albe, reviewed by Jacob Champion

Discussion: https://postgr.es/m/0ba3f2a658bac6546d9934ab6ba63a805d46a49b.camel@cybertec.at

55873a00

Improve efficiency of wait event reporting, remove proc.h dependency. · 225a22b1

Andres Freund authored Apr 03, 2021

pgstat_report_wait_start() and pgstat_report_wait_end() required two
conditional branches so far. One to check if MyProc is NULL, the other to
check if pgstat_track_activities is set. As wait events are used around
comparatively lightweight operations, and are inlined (reducing branch
predictor effectiveness), that's not great.

The dependency on MyProc has a second disadvantage: Low-level subsystems, like
storage/file/fd.c, report wait events, but architecturally it is preferable
for them to not depend on inter-process subsystems like proc.h (defining
PGPROC). After this change including pgstat.h (nor obviously its
sub-components like backend_status.h, wait_event.h, ...) does not pull in IPC
related headers anymore.

These goals, efficiency and abstraction, are achieved by having
pgstat_report_wait_start/end() not interact with MyProc, but instead a new
my_wait_event_info variable. At backend startup it points to a local variable,
removing the need to check for MyProc being NULL. During process
initialization my_wait_event_info is redirected to MyProc->wait_event_info. At
shutdown this is reversed. Because wait event reporting now does not need to
know about where the wait event is stored, it does not need to know about
PGPROC anymore.

The removal of the branch for checking pgstat_track_activities is simpler:
Don't check anymore. The cost due to the branch are often higher than the
store - and even if not, pgstat_track_activities is rarely disabled.

The main motivator to commit this work now is that removing the (indirect)
pgproc.h include from pgstat.h simplifies a patch to move statistics reporting
to shared memory (which still has a chance to get into 14).

Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de

225a22b1

Split backend status and progress related functionality out of pgstat.c. · e1025044

Andres Freund authored Apr 03, 2021

Backend status (supporting pg_stat_activity) and command
progress (supporting pg_stat_progress*) related code is largely
independent from the rest of pgstat.[ch] (supporting views like
pg_stat_all_tables that accumulate data over time). See also
a333476b.

This commit doesn't rename the function names to make the distinction
from the rest of pgstat_ clearer - that'd be more invasive and not
clearly beneficial. If we were to decide to do such a rename at some
point, it's better done separately from moving the code as well.

Robert's review was of an earlier version.
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de

e1025044

Use more verbose matching patterns for errors in SSL TAP tests · 8d3a4c3e

Michael Paquier authored Apr 03, 2021

The TAP tests of src/test/ssl/ have been using rather generic matching
patterns to check some failure scenarios, like "SSL error" or just
"FATAL". These have been introduced in 081bfc19.

Those messages are not wrong per se, but when working on the integration
of new SSL libraries it becomes hard to know if those errors are legit
or not, and existing scenarios may fail in incorrect ways. This commit
makes all those messages more verbose by adding the information
generated by OpenSSL. Fortunately, the same error messages are used for
all the versions supported on HEAD (checked that after running the tests
from 1.0.1 to 1.1.1), so the change is straight-forward.

Reported-by: Jacob Champion, Álvaro Herrera
Discussion: https://postgr.es/m/YGU3AxQh0zBMMW8m@paquier.xyz

8d3a4c3e

Refactor HMAC implementations · e6bdfd97

Michael Paquier authored Apr 03, 2021

Similarly to the cryptohash implementations, this refactors the existing
HMAC code into a single set of APIs that can be plugged with any crypto
libraries PostgreSQL is built with (only OpenSSL currently).  If there
is no such libraries, a fallback implementation is available.  Those new
APIs are designed similarly to the existing cryptohash layer, so there
is no real new design here, with the same logic around buffer bound
checks and memory handling.

HMAC has a dependency on cryptohashes, so all the cryptohash types
supported by cryptohash{_openssl}.c can be used with HMAC.  This
refactoring is an advantage mainly for SCRAM, that included its own
implementation of HMAC with SHA256 without relying on the existing
crypto libraries even if PostgreSQL was built with their support.

This code has been tested on Windows and Linux, with and without
OpenSSL, across all the versions supported on HEAD from 1.1.1 down to
1.0.1.  I have also checked that the implementations are working fine
using some sample results, a custom extension of my own, and doing
cross-checks across different major versions with SCRAM with the client
and the backend.

Author: Michael Paquier
Reviewed-by: Bruce Momjian
Discussion: https://postgr.es/m/X9m0nkEJEzIPXjeZ@paquier.xyz

e6bdfd97

Do not rely on pgstat.h to indirectly include storage/ headers. · 1d9c5d0c

Andres Freund authored Apr 02, 2021

An upcoming patch might remove the (now indirect) proc.h
include (which in turn includes other headers), and it's cleaner for
the modified files to include their dependencies directly anyway...

Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de

1d9c5d0c

Split wait event related code from pgstat.[ch] into wait_event.[ch]. · a333476b

Andres Freund authored Apr 02, 2021

The wait event related code is independent from the rest of the
pgstat.[ch] code, of nontrivial size and changes on a regular
basis. Put it into its own set of files.

As there doesn't seem to be a good pre-existing directory for code
like this, add src/backend/utils/activity.
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de

a333476b

02 Apr, 2021 15 commits

Remove useless Asserts in Result Cache code · 1267d986

David Rowley authored Apr 03, 2021

Testing if an unsigned variable is >= 0 is pretty pointless.

There's likely enough code in remove_cache_entry() to verify the cache
memory accounting is correct in assert enabled builds. These Asserts
were not adding much extra cover, even if they had been checking >= 0 on a
signed variable.

Reported-by: Andres Freund
Discussion: https://postgr.es/m/20210402204734.6mo3nfacnljlicgn@alap3.anarazel.de

1267d986

Use macro MONTHS_PER_YEAR instead of '12' in /ecpg/pgtypeslib · 84bc2b17
Bruce Momjian authored Apr 02, 2021
```
All other places already use MONTHS_PER_YEAR appropriately.

Backpatch-through: 9.6
```
84bc2b17

Detect POLLHUP/POLLRDHUP while running queries. · c30f54ad

Thomas Munro authored Apr 03, 2021

Provide a new GUC check_client_connection_interval that can be used to
check whether the client connection has gone away, while running very
long queries.  It is disabled by default.

For now this uses a non-standard Linux extension (also adopted by at
least one other OS).  POLLRDHUP is not defined by POSIX, and other OSes
don't have a reliable way to know if a connection was closed without
actually trying to read or write.

In future we might consider trying to send a no-op/heartbeat message
instead, but that could require protocol changes.

Author: Sergey Cherkashin <s.cherkashin@postgrespro.ru>
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tatsuo Ishii <ishii@sraoss.co.jp>
Reviewed-by: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Maksim Milyutin <milyutinma@gmail.com>
Reviewed-by: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (much earlier version)
Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru

c30f54ad

Clarify documentation of RESET ROLE · 174edbe9

Joe Conway authored Apr 02, 2021

Command-line options, or previous "ALTER (ROLE|DATABASE) ...
SET ROLE ..." commands, can change the value of the default role
for a session. In the presence of one of these, RESET ROLE will
change the current user identifier to the default role rather
than the session user identifier. Fix the documentation to
reflect this reality. Backpatch to all supported versions.

Author: Nathan Bossart
Reviewed-By: Laurenz Albe, David G. Johnston, Joe Conway
Reported by: Nathan Bossart
Discussion: https://postgr.es/m/flat/925134DB-8212-4F60-8AB1-B1231D750CB4%40amazon.com
Backpatch-through: 9.6

174edbe9

pg_checksums: Fix progress reporting. · 2eb1fc8b

Fujii Masao authored Apr 03, 2021

pg_checksums uses two counters, total size and current size,
to calculate the progress. Previously the progress that
pg_checksums reported could not reach 100% at the end.
The cause of this issue was that the sizes of only pages excluding
new ones in each file were counted as the current size
while the size of each file is counted as the total size.
That is, the total size of all new pages could be reported
as the difference between the total size and current size.

This commit fixes this issue by making pg_checksums count
the sizes of all pages including new ones in each file as
the current size.

Back-patch to v12 where progress reporting was added to pg_checksums.

Reported-by: Shinya Kato
Author: Shinya Kato
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/TYAPR01MB289656B1ACA0A5E7CAD07BE3C47A9@TYAPR01MB2896.jpnprd01.prod.outlook.com

2eb1fc8b

Strip file names reported in error messages on Windows, too. · 53aafdb9

Tom Lane authored Apr 02, 2021

Commit dd136052 established a policy that error message FILE items
should include only the base name of the reporting source file, for
uniformity and succinctness. We now observe that some Windows compilers
use backslashes in __FILE__ strings, so truncate at backslashes as well.

This is expected to fix some platform variation in the results of the
new libpq_pipeline test module.

Discussion: https://postgr.es/m/3650140.1617372290@sss.pgh.pa.us

53aafdb9

Fix typo in 6d7a6fea · 1877c9ac
Andrew Dunstan authored Apr 02, 2021
```
Per gripe from Daniel Gustafsson
```
1877c9ac

postgres_fdw: Add option to control whether to keep connections open. · b1be3074

Fujii Masao authored Apr 02, 2021

This commit adds a new option keep_connections that controls
whether postgres_fdw keeps the connections to the foreign server open
so that the subsequent queries can re-use them. This option can only be
specified for a foreign server. The default is on. If set to off,
all connections to the foreign server will be discarded
at the end of transaction. Closed connections will be re-established
when they are necessary by future queries using a foreign table.

This option is useful, for example, when users want to prevent
the connections from eating up the foreign servers connections
capacity.

Author: Bharath Rupireddy
Reviewed-by: Alexey Kondratov, Vignesh C, Fujii Masao
Discussion: https://postgr.es/m/CALj2ACVvrp5=AVp2PupEm+nAC8S4buqR3fJMmaCoc7ftT0aD2A@mail.gmail.com

b1be3074

Add support for NullIfExpr in eval_const_expressions · 9c5f67fd

Peter Eisentraut authored Apr 02, 2021

Author: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Discussion: https://www.postgresql.org/message-id/flat/7ea5ce773bbc4eea9ff1a381acd3b102@G08CNEXMBPEKD05.g08.fujitsu.local

9c5f67fd

Fix pgstat_report_replslot() to use proper data types for its arguments. · 96bdb7e1

Fujii Masao authored Apr 02, 2021

The caller of pgstat_report_replslot() passes int64 values to the function.
Also the function stores those values in PgStat_Counter (i.e., int64) fields
of PgStat_MsgReplSlot struct. But previously the function used "int" as
the data types of some arguments for those values, which could lead to
the overflow of values.

To avoid this risk, this commit fixes pgstat_report_replslot() to use
PgStat_Counter type for the arguments. Since they are the statistics counters,
PgStat_Counter, the data type used for counters, is used for them
instead of int64.

Reported-by: Vignesh C
Author: Vignesh C
Reviewed-by: Jeevan Ladhe, Fujii Masao
Discussion: https://postgr.es/m/CALDaNm080OpG=ZwOb0i8EyChH5SyHAMFWJCKaKTXmrfvJLbgaA@mail.gmail.com

96bdb7e1

doc: Clarify how to generate backup files with non-exclusive backups · 6fb66c26

Michael Paquier authored Apr 02, 2021

The current instructions describing how to write the backup_label and
tablespace_map files are confusing.  For example, opening a file in text
mode on Windows and copy-pasting the file's contents would result in a
failure at recovery because of the extra CRLF characters generated.  The
documentation was not stating that clearly, and per discussion this is
not considered as a supported scenario.

This commit extends a bit the documentation to mention that it may be
required to open the file in binary mode before writing its data.

Reported-by: Wang Shenhao
Author: David Steele
Reviewed-by: Andrew Dunstan, Magnus Hagander
Discussion: https://postgr.es/m/8373f61426074f2cb6be92e02f838389@G08CNEXMBPEKD06.g08.fujitsu.local
Backpatch-through: 9.6

6fb66c26

Fix typos in comments. · 98e5bd10

Fujii Masao authored Apr 02, 2021

Author: Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoA1YL7t0nzVSEySx6zOaE7xO3r0jyu8hkitGL2_XbaMxQ@mail.gmail.com

98e5bd10

Attempt to fix unstable Result Cache regression tests · a4fac4ff

David Rowley authored Apr 02, 2021

force_parallel_mode = regress is causing a few more problems than I
thought. It seems that both the leader and the single worker can
contribute to the execution. I had mistakenly thought that only the worker
process would do any work. Since it's not deterministic as to which
of the two processes will get a chance to work on the plan, it seems just
better to disable force_parallel_mode for these tests. At least doing
this seems better than changing to EXPLAIN only rather than EXPLAIN
ANALYZE.

Additionally, I overlooked the fact that the number of executions of the
sub-plan below a Result Cache will execute a varying number of times
depending on cache eviction. 32-bit machines will use less memory and
evict fewer tuples from the cache. That results in the subnode being
executed fewer times on 32-bit machines. Let's just blank out the number
of loops in each node.

a4fac4ff

doc: mention that intervening major releases can be skipped · 2bda93f8

Bruce Momjian authored Apr 01, 2021

Also mention that you should read the intervening major releases notes.
This change was also applied to the website.

Discussion: https://postgr.es/m/20210330144949.GA8259@momjian.us

Backpatch-through: 9.6

2bda93f8

Add Result Cache executor node (take 2) · 9eacee2e

David Rowley authored Apr 02, 2021

Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.

For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.

The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.

For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.

For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.

The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.

For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.

Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.

Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com

9eacee2e