Commit ca8217c1 authored by Tom Lane's avatar Tom Lane

Add a test module for the regular expression package.

This module provides a function test_regex() that is functionally
rather like regexp_matches(), but with additional debugging-oriented
options and additional output.  The debug options are somewhat obscure;
they are chosen to match the API of the test harness that Henry Spencer
wrote way-back-when for use in Tcl.  With this, we can import all the
test cases that Spencer wrote originally, even for regex functionality
that we don't currently expose in Postgres.  This seems necessary
because we can no longer rely on Tcl to act as upstream and verify
any fixes or improvements that we make.

In addition to Spencer's tests, I added a few for lookbehind
constraints (which we added in 2015, and Tcl still hasn't absorbed)
that are modeled on his tests for lookahead constraints.  After looking
at code coverage reports, I also threw in a couple of tests to more
fully exercise our "high colormap" logic.

According to my testing, this brings the check-world coverage
for src/backend/regex/ from 71.1% to 86.7% of lines.
(coverage.postgresql.org shows a slightly different number,
which I think is because it measures a non-assert build.)

Discussion: https://postgr.es/m/2873268.1609732164@sss.pgh.pa.us
parent 4656e3d6
...@@ -22,6 +22,7 @@ SUBDIRS = \ ...@@ -22,6 +22,7 @@ SUBDIRS = \
test_pg_dump \ test_pg_dump \
test_predtest \ test_predtest \
test_rbtree \ test_rbtree \
test_regex \
test_rls_hooks \ test_rls_hooks \
test_shm_mq \ test_shm_mq \
unsafe_tests \ unsafe_tests \
......
# Generated subdirectories
/log/
/results/
/tmp_check/
# src/test/modules/test_regex/Makefile
MODULE_big = test_regex
OBJS = \
$(WIN32RES) \
test_regex.o
PGFILEDESC = "test_regex - test code for backend/regex/"
EXTENSION = test_regex
DATA = test_regex--1.0.sql
REGRESS = test_regex test_regex_utf8
ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
else
subdir = src/test/modules/test_regex
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif
test_regex is a module for testing the regular expression package.
It is mostly meant to allow us to absorb Tcl's regex test suite.
Therefore, there are provisions to exercise regex features that
aren't currently exposed at the SQL level by PostgreSQL.
Currently, one function is provided:
test_regex(pattern text, string text, flags text) returns setof text[]
Reports an error if the pattern is an invalid regex. Otherwise,
the first row of output contains the number of subexpressions,
followed by words reporting set bit(s) in the regex's re_info field.
If the pattern doesn't match the string, that's all.
If the pattern does match, the next row contains the whole match
as the first array element. If there are parenthesized subexpression(s),
following array elements contain the matches to those subexpressions.
If the "g" (glob) flag is set, then additional row(s) of output similarly
report any additional matches.
The "flags" argument is a string of zero or more single-character
flags that modify the behavior of the regex package or the test
function. As described in Tcl's reg.test file:
The flag characters are complex and a bit eclectic. Generally speaking,
lowercase letters are compile options, uppercase are expected re_info
bits, and nonalphabetics are match options, controls for how the test is
run, or testing options. The one small surprise is that AREs are the
default, and you must explicitly request lesser flavors of RE. The flags
are as follows. It is admitted that some are not very mnemonic.
- no-op (placeholder)
0 report indices not actual strings
(This substitutes for Tcl's -indices switch)
! expect partial match, report start position anyway
% force small state-set cache in matcher (to test cache replace)
^ beginning of string is not beginning of line
$ end of string is not end of line
* test is Unicode-specific, needs big character set
+ provide fake xy equivalence class and ch collating element
(Note: the equivalence class is implemented, the
collating element is not; so references to [.ch.] fail)
, set REG_PROGRESS (only useful in REG_DEBUG builds)
. set REG_DUMP (only useful in REG_DEBUG builds)
: set REG_MTRACE (only useful in REG_DEBUG builds)
; set REG_FTRACE (only useful in REG_DEBUG builds)
& test as both ARE and BRE
(Not implemented in Postgres, we use separate tests)
b BRE
e ERE
a turn advanced-features bit on (error unless ERE already)
q literal string, no metacharacters at all
g global match (find all matches)
i case-independent matching
o ("opaque") do not return match locations
p newlines are half-magic, excluded from . and [^ only
w newlines are half-magic, significant to ^ and $ only
n newlines are fully magic, both effects
x expanded RE syntax
t incomplete-match reporting
c canmatch (equivalent to "t0!", in Postgres implementation)
s match only at start (REG_BOSONLY)
A backslash-_a_lphanumeric seen
B ERE/ARE literal-_b_race heuristic used
E backslash (_e_scape) seen within []
H looka_h_ead constraint seen
I _i_mpossible to match
L _l_ocale-specific construct seen
M unportable (_m_achine-specific) construct seen
N RE can match empty (_n_ull) string
P non-_P_OSIX construct seen
Q {} _q_uantifier seen
R back _r_eference seen
S POSIX-un_s_pecified syntax seen
T prefers shortest (_t_iny)
U saw original-POSIX botch: unmatched right paren in ERE (_u_gh)
This source diff could not be displayed because it is too large. You can view the blob instead.
/*
* This test must be run in a database with UTF-8 encoding,
* because other encodings don't support all the characters used.
*/
SELECT getdatabaseencoding() <> 'UTF8'
AS skip_test \gset
\if :skip_test
\quit
\endif
set client_encoding = utf8;
set standard_conforming_strings = on;
-- Run the Tcl test cases that require Unicode
-- expectMatch 9.44 EMP* {a[\u00fe-\u0507][\u00ff-\u0300]b} \
-- "a\u0102\u02ffb" "a\u0102\u02ffb"
select * from test_regex('a[\u00fe-\u0507][\u00ff-\u0300]b', E'a\u0102\u02ffb', 'EMP*');
test_regex
----------------------------------------
{0,REG_UBBS,REG_UNONPOSIX,REG_UUNPORT}
{aĂ˿b}
(2 rows)
-- expectMatch 13.27 P "a\\U00001234x" "a\u1234x" "a\u1234x"
select * from test_regex('a\U00001234x', E'a\u1234x', 'P');
test_regex
-------------------
{0,REG_UNONPOSIX}
{aሴx}
(2 rows)
-- expectMatch 13.28 P {a\U00001234x} "a\u1234x" "a\u1234x"
select * from test_regex('a\U00001234x', E'a\u1234x', 'P');
test_regex
-------------------
{0,REG_UNONPOSIX}
{aሴx}
(2 rows)
-- expectMatch 13.29 P "a\\U0001234x" "a\u1234x" "a\u1234x"
-- Tcl has relaxed their code to allow 1-8 hex digits, but Postgres hasn't
select * from test_regex('a\U0001234x', E'a\u1234x', 'P');
ERROR: invalid regular expression: invalid escape \ sequence
-- expectMatch 13.30 P {a\U0001234x} "a\u1234x" "a\u1234x"
-- Tcl has relaxed their code to allow 1-8 hex digits, but Postgres hasn't
select * from test_regex('a\U0001234x', E'a\u1234x', 'P');
ERROR: invalid regular expression: invalid escape \ sequence
-- expectMatch 13.31 P "a\\U000012345x" "a\u12345x" "a\u12345x"
select * from test_regex('a\U000012345x', E'a\u12345x', 'P');
test_regex
-------------------
{0,REG_UNONPOSIX}
{aሴ5x}
(2 rows)
-- expectMatch 13.32 P {a\U000012345x} "a\u12345x" "a\u12345x"
select * from test_regex('a\U000012345x', E'a\u12345x', 'P');
test_regex
-------------------
{0,REG_UNONPOSIX}
{aሴ5x}
(2 rows)
-- expectMatch 13.33 P "a\\U1000000x" "a\ufffd0x" "a\ufffd0x"
-- Tcl allows this as a standalone character, but Postgres doesn't
select * from test_regex('a\U1000000x', E'a\ufffd0x', 'P');
ERROR: invalid regular expression: invalid escape \ sequence
-- expectMatch 13.34 P {a\U1000000x} "a\ufffd0x" "a\ufffd0x"
-- Tcl allows this as a standalone character, but Postgres doesn't
select * from test_regex('a\U1000000x', E'a\ufffd0x', 'P');
ERROR: invalid regular expression: invalid escape \ sequence
-- Additional tests, not derived from Tcl
-- Exercise logic around high character ranges a bit more
select * from test_regex('a
[\u1000-\u1100]*
[\u3000-\u3100]*
[\u1234-\u25ff]+
[\u2000-\u35ff]*
[\u2600-\u2f00]*
\u1236\u1236x',
E'a\u1234\u1236\u1236x', 'xEMP');
test_regex
----------------------------------------
{0,REG_UBBS,REG_UNONPOSIX,REG_UUNPORT}
{aሴሶሶx}
(2 rows)
select * from test_regex('[[:alnum:]]*[[:upper:]]*[\u1000-\u2000]*\u1237',
E'\u1500\u1237', 'ELMP');
test_regex
----------------------------------------------------
{0,REG_UBBS,REG_UNONPOSIX,REG_UUNPORT,REG_ULOCALE}
{ᔀሷ}
(2 rows)
select * from test_regex('[[:alnum:]]*[[:upper:]]*[\u1000-\u2000]*\u1237',
E'A\u1239', 'ELMP');
test_regex
----------------------------------------------------
{0,REG_UBBS,REG_UNONPOSIX,REG_UUNPORT,REG_ULOCALE}
(1 row)
/*
* This test must be run in a database with UTF-8 encoding,
* because other encodings don't support all the characters used.
*/
SELECT getdatabaseencoding() <> 'UTF8'
AS skip_test \gset
\if :skip_test
\quit
This diff is collapsed.
/*
* This test must be run in a database with UTF-8 encoding,
* because other encodings don't support all the characters used.
*/
SELECT getdatabaseencoding() <> 'UTF8'
AS skip_test \gset
\if :skip_test
\quit
\endif
set client_encoding = utf8;
set standard_conforming_strings = on;
-- Run the Tcl test cases that require Unicode
-- expectMatch 9.44 EMP* {a[\u00fe-\u0507][\u00ff-\u0300]b} \
-- "a\u0102\u02ffb" "a\u0102\u02ffb"
select * from test_regex('a[\u00fe-\u0507][\u00ff-\u0300]b', E'a\u0102\u02ffb', 'EMP*');
-- expectMatch 13.27 P "a\\U00001234x" "a\u1234x" "a\u1234x"
select * from test_regex('a\U00001234x', E'a\u1234x', 'P');
-- expectMatch 13.28 P {a\U00001234x} "a\u1234x" "a\u1234x"
select * from test_regex('a\U00001234x', E'a\u1234x', 'P');
-- expectMatch 13.29 P "a\\U0001234x" "a\u1234x" "a\u1234x"
-- Tcl has relaxed their code to allow 1-8 hex digits, but Postgres hasn't
select * from test_regex('a\U0001234x', E'a\u1234x', 'P');
-- expectMatch 13.30 P {a\U0001234x} "a\u1234x" "a\u1234x"
-- Tcl has relaxed their code to allow 1-8 hex digits, but Postgres hasn't
select * from test_regex('a\U0001234x', E'a\u1234x', 'P');
-- expectMatch 13.31 P "a\\U000012345x" "a\u12345x" "a\u12345x"
select * from test_regex('a\U000012345x', E'a\u12345x', 'P');
-- expectMatch 13.32 P {a\U000012345x} "a\u12345x" "a\u12345x"
select * from test_regex('a\U000012345x', E'a\u12345x', 'P');
-- expectMatch 13.33 P "a\\U1000000x" "a\ufffd0x" "a\ufffd0x"
-- Tcl allows this as a standalone character, but Postgres doesn't
select * from test_regex('a\U1000000x', E'a\ufffd0x', 'P');
-- expectMatch 13.34 P {a\U1000000x} "a\ufffd0x" "a\ufffd0x"
-- Tcl allows this as a standalone character, but Postgres doesn't
select * from test_regex('a\U1000000x', E'a\ufffd0x', 'P');
-- Additional tests, not derived from Tcl
-- Exercise logic around high character ranges a bit more
select * from test_regex('a
[\u1000-\u1100]*
[\u3000-\u3100]*
[\u1234-\u25ff]+
[\u2000-\u35ff]*
[\u2600-\u2f00]*
\u1236\u1236x',
E'a\u1234\u1236\u1236x', 'xEMP');
select * from test_regex('[[:alnum:]]*[[:upper:]]*[\u1000-\u2000]*\u1237',
E'\u1500\u1237', 'ELMP');
select * from test_regex('[[:alnum:]]*[[:upper:]]*[\u1000-\u2000]*\u1237',
E'A\u1239', 'ELMP');
/* src/test/modules/test_regex/test_regex--1.0.sql */
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_regex" to load this file. \quit
CREATE FUNCTION test_regex(pattern text, string text, flags text)
RETURNS SETOF text[]
STRICT
AS 'MODULE_PATHNAME' LANGUAGE C;
This diff is collapsed.
comment = 'Test code for backend/regex/'
default_version = '1.0'
module_pathname = '$libdir/test_regex'
relocatable = true
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment