Commit 0a97edb1 authored by Tom Lane's avatar Tom Lane

Doc: improve documentation around jsonpath regular expressions.

Provide some documentation about the differences between XQuery
regular expressions and those supported by Spencer's regex engine.
Since SQL now exposes XQuery regexps with the LIKE_REGEX operator,
I made this a standalone section designed to help somebody who
has to translate a LIKE_REGEX query to Postgres.  (Eventually we might
extend Spencer's engine to allow precise implementation of XQuery,
but not today.)

Reference that in the jsonpath docs, provide definitions of the
XQuery flag letters, and add a description of the JavaScript-inspired
string literal syntax used within jsonpath.  Also point out explicitly
that backslashes used within like_regex patterns will need to be doubled.

This also syncs the docs with the decision implemented in commit
d5b90cd6 to desupport XQuery's 'x' flag for now.

Jonathan Katz and Tom Lane

Discussion: https://postgr.es/m/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg@mail.gmail.com
parent e1c8743e
...@@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); ...@@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
<!-- end re_syntax.n man page --> <!-- end re_syntax.n man page -->
<sect3 id="posix-vs-xquery">
<title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
<indexterm zone="posix-vs-xquery">
<primary><literal>LIKE_REGEX</literal></primary>
</indexterm>
<indexterm zone="posix-vs-xquery">
<primary>XQuery regular expressions</primary>
</indexterm>
<para>
Since SQL:2008, the SQL standard includes
a <literal>LIKE_REGEX</literal> operator that performs pattern
matching according to the XQuery regular expression
standard. <productname>PostgreSQL</productname> does not yet
implement this operator, but you can get very similar behavior using
the <function>regexp_match()</function> function, since XQuery
regular expressions are quite close to the ARE syntax described above.
</para>
<para>
Notable differences between the existing POSIX-based
regular-expression feature and XQuery regular expressions include:
<itemizedlist>
<listitem>
<para>
XQuery character class subtraction is not supported. An example of
this feature is using the following to match only English
consonants: <literal>[a-z-[aeiou]]</literal>.
</para>
</listitem>
<listitem>
<para>
XQuery character class shorthands <literal>\c</literal>,
<literal>\C</literal>, <literal>\i</literal>,
and <literal>\I</literal> are not supported.
</para>
</listitem>
<listitem>
<para>
XQuery character class elements
using <literal>\p{UnicodeProperty}</literal> or the
inverse <literal>\P{UnicodeProperty}</literal> are not supported.
</para>
</listitem>
<listitem>
<para>
POSIX interprets character classes such as <literal>\w</literal>
(see <xref linkend="posix-class-shorthand-escapes-table"/>)
according to the prevailing locale (which you can control by
attaching a <literal>COLLATE</literal> clause to the operator or
function). XQuery specifies these classes by reference to Unicode
character properties, so equivalent behavior is obtained only with
a locale that follows the Unicode rules.
</para>
</listitem>
<listitem>
<para>
The SQL standard (not XQuery itself) attempts to cater for more
variants of <quote>newline</quote> than POSIX does. The
newline-sensitive matching options described above consider only
ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
(a Windows-style newline), and some Unicode-only characters like
LINE SEPARATOR (U+2028) as newlines as well.
Notably, <literal>.</literal> and <literal>\s</literal> should
count <literal>\r\n</literal> as one character not two according to
SQL.
</para>
</listitem>
<listitem>
<para>
Of the character-entry escapes described in
<xref linkend="posix-character-entry-escapes-table"/>,
XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
and <literal>\t</literal>.
</para>
</listitem>
<listitem>
<para>
XQuery does not support
the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
for character classes within bracket expressions.
</para>
</listitem>
<listitem>
<para>
XQuery does not have lookahead or lookbehind constraints,
nor any of the constraint escapes described in
<xref linkend="posix-constraint-escapes-table"/>.
</para>
</listitem>
<listitem>
<para>
The metasyntax forms described in <xref linkend="posix-metasyntax"/>
do not exist in XQuery.
</para>
</listitem>
<listitem>
<para>
The regular expression flag letters defined by XQuery are
related to but not the same as the option letters for POSIX
(<xref linkend="posix-embedded-options-table"/>). While the
<literal>i</literal> and <literal>q</literal> options behave the
same, others do not:
<itemizedlist>
<listitem>
<para>
XQuery's <literal>s</literal> (allow dot to match newline)
and <literal>m</literal> (allow <literal>^</literal>
and <literal>$</literal> to match at newlines) flags provide
access to the same behaviors as
POSIX's <literal>n</literal>, <literal>p</literal>
and <literal>w</literal> flags, but they
do <emphasis>not</emphasis> match the behavior of
POSIX's <literal>s</literal> and <literal>m</literal> flags.
Note in particular that dot-matches-newline is the default
behavior in POSIX but not XQuery.
</para>
</listitem>
<listitem>
<para>
XQuery's <literal>x</literal> (ignore whitespace in pattern) flag
is noticeably different from POSIX's expanded-mode flag.
POSIX's <literal>x</literal> flag also
allows <literal>#</literal> to begin a comment in the pattern,
and POSIX will not ignore a whitespace character after a
backslash.
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
</itemizedlist>
</para>
</sect3>
</sect2> </sect2>
</sect1> </sect1>
...@@ -11793,6 +11932,14 @@ table2-mapping ...@@ -11793,6 +11932,14 @@ table2-mapping
</programlisting> </programlisting>
</para> </para>
</listitem> </listitem>
<listitem>
<para>
There are minor differences in the interpretation of regular
expression patterns used in <literal>like_regex</literal> filters, as
described in <xref linkend="jsonpath-regular-expressions"/>.
</para>
</listitem>
</itemizedlist> </itemizedlist>
<sect3 id="strict-and-lax-modes"> <sect3 id="strict-and-lax-modes">
...@@ -11872,6 +12019,63 @@ table2-mapping ...@@ -11872,6 +12019,63 @@ table2-mapping
</sect3> </sect3>
<sect3 id="jsonpath-regular-expressions">
<title>Regular Expressions</title>
<indexterm zone="jsonpath-regular-expressions">
<primary><literal>LIKE_REGEX</literal></primary>
<secondary>in SQL/JSON</secondary>
</indexterm>
<para>
SQL/JSON path expressions allow matching text to a regular expression
with the <literal>like_regex</literal> filter. For example, the
following SQL/JSON path query would case-insensitively match all
strings in an array that start with an English vowel:
<programlisting>
'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
</programlisting>
</para>
<para>
The optional <literal>flag</literal> string may include one or more of
the characters
<literal>i</literal> for case-insensitive match,
<literal>m</literal> to allow <literal>^</literal>
and <literal>$</literal> to match at newlines,
<literal>s</literal> to allow <literal>.</literal> to match a newline,
and <literal>q</literal> to quote the whole pattern (reducing the
behavior to a simple substring match).
</para>
<para>
The SQL/JSON standard borrows its definition for regular expressions
from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
XQuery standard. PostgreSQL does not currently support the
<literal>LIKE_REGEX</literal> operator. Therefore,
the <literal>like_regex</literal> filter is implemented using the
POSIX regular expression engine described in
<xref linkend="functions-posix-regexp"/>. This leads to various minor
discrepancies from standard SQL/JSON behavior, which are cataloged in
<xref linkend="posix-vs-xquery"/>.
Note, however, that the flag-letter incompatibilities described there
do not apply to SQL/JSON, as it translates the XQuery flag letters to
match what the POSIX engine expects.
</para>
<para>
Keep in mind that the pattern argument of <literal>like_regex</literal>
is a JSON path string literal, written according to the rules given in
<xref linkend="datatype-jsonpath"/>. This means in particular that any
backslashes you want to use in the regular expression must be doubled.
For example, to match strings that contain only digits:
<programlisting>
'$ ? (@ like_regex "^\\d+$")'
</programlisting>
</para>
</sect3>
<sect3 id="functions-sqljson-path-operators"> <sect3 id="functions-sqljson-path-operators">
<title>SQL/JSON Path Operators and Methods</title> <title>SQL/JSON Path Operators and Methods</title>
...@@ -12113,10 +12317,11 @@ table2-mapping ...@@ -12113,10 +12317,11 @@ table2-mapping
<row> <row>
<entry><literal>like_regex</literal></entry> <entry><literal>like_regex</literal></entry>
<entry> <entry>
Tests pattern matching with POSIX regular expressions Tests whether the first operand matches the regular expression
(see <xref linkend="functions-posix-regexp"/>). Supported flags given by the second operand, optionally with modifications
are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>, described by a string of <literal>flag</literal> characters (see
<literal>x</literal>, and <literal>q</literal>.</entry> <xref linkend="jsonpath-regular-expressions"/>)
</entry>
<entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry> <entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
<entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry> <entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
<entry><literal>"abc", "aBdC", "abdacb"</literal></entry> <entry><literal>"abc", "aBdC", "abdacb"</literal></entry>
......
...@@ -666,13 +666,32 @@ SELECT jdoc-&gt;'guid', jdoc-&gt;'name' FROM api WHERE jdoc @&gt; '{"tags": ["qu ...@@ -666,13 +666,32 @@ SELECT jdoc-&gt;'guid', jdoc-&gt;'name' FROM api WHERE jdoc @&gt; '{"tags": ["qu
</itemizedlist> </itemizedlist>
<para> <para>
An SQL/JSON path expression is an SQL character string literal, An SQL/JSON path expression is typically written in an SQL query as an
so it must be enclosed in single quotes when passed to an SQL/JSON SQL character string literal, so it must be enclosed in single quotes,
query function. Following the JavaScript and any single quotes desired within the value must be doubled
conventions, character string literals within the path expression (see <xref linkend="sql-syntax-strings"/>).
must be enclosed in double quotes. Any single quotes within this Some forms of path expressions require string literals within them.
character string literal must be escaped with a single quote These embedded string literals follow JavaScript/ECMAScript conventions:
by the SQL convention. they must be surrounded by double quotes, and backslash escapes may be
used within them to represent otherwise-hard-to-type characters.
In particular, the way to write a double quote within an embedded string
literal is <literal>\"</literal>, and to write a backslash itself, you
must write <literal>\\</literal>. Other special backslash sequences
include those recognized in JSON strings:
<literal>\b</literal>,
<literal>\f</literal>,
<literal>\n</literal>,
<literal>\r</literal>,
<literal>\t</literal>,
<literal>\v</literal>
for various ASCII control characters, and
<literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode
character identified by its 4-hex-digit code point. The backslash
syntax also includes two cases not allowed by JSON:
<literal>\x<replaceable>NN</replaceable></literal> for a character code
written with only two hex digits, and
<literal>\u{<replaceable>N...</replaceable>}</literal> for a character
code written with 1 to 6 hex digits.
</para> </para>
<para> <para>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment