Commit 97c40ce6 authored by Tom Lane's avatar Tom Lane

Allow empty replacement strings in contrib/unaccent.

This is useful in languages where diacritic signs are represented as
separate characters; it's also one step towards letting unaccent be used
for arbitrary substring substitutions.

In passing, improve the user documentation for unaccent, which was sadly
vague about some important details.

Mohammad Alhashash, reviewed by Abhijit Menon-Sen
parent 55863274
...@@ -104,11 +104,21 @@ initTrie(char *filename) ...@@ -104,11 +104,21 @@ initTrie(char *filename)
while ((line = tsearch_readline(&trst)) != NULL) while ((line = tsearch_readline(&trst)) != NULL)
{ {
/* /*----------
* The format of each line must be "src trg" where src and trg * The format of each line must be "src" or "src trg", where
* are sequences of one or more non-whitespace characters, * src and trg are sequences of one or more non-whitespace
* separated by whitespace. Whitespace at start or end of * characters, separated by whitespace. Whitespace at start
* line is ignored. * or end of line is ignored. If trg is omitted, an empty
* string is used as the replacement.
*
* We use a simple state machine, with states
* 0 initial (before src)
* 1 in src
* 2 in whitespace after src
* 3 in trg
* 4 in whitespace after trg
* -1 syntax error detected (line will be ignored)
*----------
*/ */
int state; int state;
char *ptr; char *ptr;
...@@ -160,7 +170,14 @@ initTrie(char *filename) ...@@ -160,7 +170,14 @@ initTrie(char *filename)
} }
} }
if (state >= 3) if (state == 1 || state == 2)
{
/* trg was omitted, so use "" */
trg = "";
trglen = 0;
}
if (state > 0)
rootTrie = placeChar(rootTrie, rootTrie = placeChar(rootTrie,
(unsigned char *) src, srclen, (unsigned char *) src, srclen,
trg, trglen); trg, trglen);
......
...@@ -45,9 +45,9 @@ ...@@ -45,9 +45,9 @@
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para> <para>
Each line represents a pair, consisting of a character with accent Each line represents one translation rule, consisting of a character with
followed by a character without accent. The first is translated into accent followed by a character without accent. The first is translated
the second. For example, into the second. For example,
<programlisting> <programlisting>
&Agrave; A &Agrave; A
&Aacute; A &Aacute; A
...@@ -57,6 +57,27 @@ ...@@ -57,6 +57,27 @@
&Aring; A &Aring; A
&AElig; A &AElig; A
</programlisting> </programlisting>
The two characters must be separated by whitespace, and any leading or
trailing whitespace on a line is ignored.
</para>
</listitem>
<listitem>
<para>
Alternatively, if only one character is given on a line, instances of
that character are deleted; this is useful in languages where accents
are represented by separate characters.
</para>
</listitem>
<listitem>
<para>
As with other <productname>PostgreSQL</> text search configuration files,
the rules file must be stored in UTF-8 encoding. The data is
automatically translated into the current database's encoding when
loaded. Any lines containing untranslatable characters are silently
ignored, so that rules files can contain rules that are not applicable in
the current encoding.
</para> </para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
...@@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels') ...@@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
<para> <para>
The <function>unaccent()</> function removes accents (diacritic signs) from The <function>unaccent()</> function removes accents (diacritic signs) from
a given string. Basically, it's a wrapper around the a given string. Basically, it's a wrapper around
<filename>unaccent</> dictionary, but it can be used outside normal <filename>unaccent</>-type dictionaries, but it can be used outside normal
text search contexts. text search contexts.
</para> </para>
...@@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels') ...@@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type> unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>
</synopsis> </synopsis>
<para>
If the <replaceable class="PARAMETER">dictionary</replaceable> argument is
omitted, <literal>unaccent</> is assumed.
</para>
<para> <para>
For example: For example:
<programlisting> <programlisting>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment