Commit cf92226e authored by Tom Lane's avatar Tom Lane

Doc: improve description of regexp character classes.

Define the meanings of the POSIX-spec character classes in line,
rather than referring to the ctype(3) man page.  That man page
doesn't even exist on many modern systems, and if it does exist
it probably says the wrong things about non-ASCII characters.
Also document our non-POSIX-spec "ascii" character class.

Also, point out here that this behavior is controlled by collation or
LC_CTYPE, since the existing text explaining that is pretty far away.

Per gripe from Geert Lobbestael.  Given the lack of prior complaints,
I'm not excited about back-patching this.

Discussion: https://postgr.es/m/155837022049.1359.2948065118562813468@wrigleys.postgresql.org
parent a240570b
...@@ -5104,18 +5104,37 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; ...@@ -5104,18 +5104,37 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
<para> <para>
Within a bracket expression, the name of a character class Within a bracket expression, the name of a character class
enclosed in <literal>[:</literal> and <literal>:]</literal> stands enclosed in <literal>[:</literal> and <literal>:]</literal> stands
for the list of all characters belonging to that class. Standard for the list of all characters belonging to that class. A character
character class names are: <literal>alnum</literal>, class cannot be used as an endpoint of a range.
<literal>alpha</literal>, <literal>blank</literal>, The <acronym>POSIX</acronym> standard defines these character class
<literal>cntrl</literal>, <literal>digit</literal>, names:
<literal>graph</literal>, <literal>lower</literal>, <literal>alnum</literal> (letters and numeric digits),
<literal>print</literal>, <literal>punct</literal>, <literal>alpha</literal> (letters),
<literal>space</literal>, <literal>upper</literal>, <literal>blank</literal> (space and tab),
<literal>xdigit</literal>. These stand for the character classes <literal>cntrl</literal> (control characters),
defined in <literal>digit</literal> (numeric digits),
<citerefentry><refentrytitle>ctype</refentrytitle><manvolnum>3</manvolnum></citerefentry>. <literal>graph</literal> (printable characters except space),
A locale can provide others. A character class cannot be used as <literal>lower</literal> (lower-case letters),
an endpoint of a range. <literal>print</literal> (printable characters including space),
<literal>punct</literal> (punctuation),
<literal>space</literal> (any white space),
<literal>upper</literal> (upper-case letters),
and <literal>xdigit</literal> (hexadecimal digits).
The behavior of these standard character classes is generally
consistent across platforms for characters in the 7-bit ASCII set.
Whether a given non-ASCII character is considered to belong to one
of these classes depends on the <firstterm>collation</firstterm>
that is used for the regular-expression function or operator
(see <xref linkend="collation"/>), or by default on the
database's <envar>LC_CTYPE</envar> locale setting (see
<xref linkend="locale"/>). The classification of non-ASCII
characters can vary across platforms even in similarly-named
locales. (But the <literal>C</literal> locale never considers any
non-ASCII characters to belong to any of these classes.)
In addition to these standard character
classes, <productname>PostgreSQL</productname> defines
the <literal>ascii</literal> character class, which contains exactly
the 7-bit ASCII set.
</para> </para>
<para> <para>
...@@ -5126,8 +5145,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; ...@@ -5126,8 +5145,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
and end of a word respectively. A word is defined as a sequence and end of a word respectively. A word is defined as a sequence
of word characters that is neither preceded nor followed by word of word characters that is neither preceded nor followed by word
characters. A word character is an <literal>alnum</literal> character (as characters. A word character is an <literal>alnum</literal> character (as
defined by defined by the <acronym>POSIX</acronym> character class described above)
<citerefentry><refentrytitle>ctype</refentrytitle><manvolnum>3</manvolnum></citerefentry>)
or an underscore. This is an extension, compatible with but not or an underscore. This is an extension, compatible with but not
specified by <acronym>POSIX</acronym> 1003.2, and should be used with specified by <acronym>POSIX</acronym> 1003.2, and should be used with
caution in software intended to be portable to other systems. caution in software intended to be portable to other systems.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment