Commit 0ba77c14 authored by Peter Eisentraut's avatar Peter Eisentraut

Convert more charset/locale documentation to DocBook

parent 333cbc2d
PostgreSQL Charsets README
Josef Balatka, <balatka@email.cz>
Draft v0.1, Tue Jul 20 15:49:07 CEST 1999
This document is a brief overview of the national charsets support
that PostgreSQL ver. 6.5 has implemented. Various compilation options
and setup tips are mentioned here to be helpful in the particular use.
---------------------------------------------------------------------------
Table of Contents
1. Locale awareness
2. Single-byte charsets recoding
3. Multi-byte support/recoding
4. Credits
---------------------------------------------------------------------------
1. Locale awareness
PostgreSQL server supports both locale aware and locale not aware
(default) operational modes. You can determine this mode during the
configuration stage of the installation with --enable-locale option.
If you don't use --enable-locale, the multi-language code will not be
compiled and PostgreSQL will behave as an ASCII compliant application.
This mode is useful for its speed but only provided that you don't
have to consider national specific chars.
With --enable-locale you will get a locale aware server using LC_*
environment variables to determine how to process national specifics.
In this case strcoll(3) and similar functions are used internally
so speed is somewhat lower.
Notice here that --enable-locale is sufficient when all your clients
use the same single-byte encoding as the database server does.
When your clients use encoding different from the server than you have
to use, moreover, --enable-recode or --with-mb=<encoding> options on
the server side or a particular client that does recoding itself (e.g.
there exists a PostgreSQL ODBC driver for Win32 with various Cyrillic
encoding capability). Option --with-mb=<encoding> is necessary for the
multi-byte charsets support.
2. Single-byte charsets recoding
You can set up this feature with --enable-recode option. This option
is described as 'enable Cyrillic recode support' which doesn't express
all its power. It can be used for *any* single-byte charset recoding.
This method uses charset.conf file located in the $PGDATA directory.
It's a typical configuration text file where spaces and newlines
separate items and records and # specifies comments. Three keywords
with the following syntax are recognized here:
BaseCharset <server_charset>
RecodeTable <from_charset> <to_charset> <file_name>
HostCharset <host_spec> <host_charset>
BaseCharset defines encoding of the database server. All charset
names are only used for mapping inside the charset.conf so you can
freely use typing-friendly names.
RecodeTable records specify translation table between server and client.
The file name is relative to the $PGDATA directory. Table file format
is very simple. There are no keywords and characters are represented by
a pair of decimal or hexadecimal (0x prefixed) values on single lines:
<char_value> <translated_char_value>
HostCharset records define IP address and charset. You can use a single
IP address, an IP mask range starting from the given address or an IP
interval (e.g. 127.0.0.1, 192.168.1.100/24, 192.168.1.20-192.168.1.40)
The charset.conf is always processed up to the end, so you can easily
specify exceptions from the previous rules. In the src/data you will
find charset.conf example and a few recoding tables.
As this solution is based on the client's IP address / charset mapping
there are obviously some restrictions as well. You can't use different
encoding on the same host at the same time. It's also inconvenient when
you boot your client hosts into more operating systems.
Nevertheless, when these restrictions are not limiting and you don't
need multi-byte chars than it's a simple and effective solution.
3. Multi-byte support/recoding
It's a new generation of charset encoding in PostgreSQL designed as a
more complex solution supporting both single-byte and multi-byte chars.
You can set up this feature with --with-mb=<encoding> option.
There is no IP mapping file and recoding is controlled through the new
SQL statements. Recoding tables are included in the code. Many national
charsets are already supported and further will follow.
See doc/README.mb, doc/README.mb.jp to get detailed instruction on how
to use the multibyte support. In the file doc/README.locale there is
a particular instruction on usage of the multibyte support with Cyrillic.
4. Credits
I'd like to thank the PostgreSQL development team and all contributors
for creating PostgreSQL. Thanks to Oleg Bartunov, Oleg Broytmann and
Tatsuo Ishii for opening the door into the multi-language world.
===========
1999 Jul 21
===========
Josef Balatka, <balatka@email.cz> asked us not to remove RECODE and sent me
Czech ISO-8859-2 -> WIN-1250 translation table.
RECODE is no longer contains just Cyrillic RECODE and will stay in
PostgreSQL.
He also created some bits of documentation, mostly concerning RECODE -
see README.Charsets.
===========
1999 Apr 14
===========
Tatsuo Ishii <t-ishii@sra.co.jp> updated Multibyte support extending it
to Cyrillic language. Now PostgreSQL supports KOI8-R, WIN-1251, ISO8859-5
and CP866 (ALT) encodings.
Short instruction on using this feature follows. Longer discussion of
Multibyte support is in README.mb.
WARNING! Now with Multibyte support Cyrillic RECODE declared obsolete
and will be removed from Postgres. If you are using RECODE consider
switching to Multibyte support.
Instructions on how to prepare Postgres for Cyrillic Multibyte support.
----------------------------------------------------------------------
First, you need to backup all your databases. I recommend to backup the
entire Postgres directory, including binaries and libraries - thus you can
easily restore if something goes wrong.
Dump you data: pg_dumpall > dump.db
Stop postmaster.
Configure, compile and install Postgres. (I'll mostly talk about KOI8-R
encoding, this is just to make examples a little more clear; you can use
any supported encoding.)
cd src
./configure --enable-locale --with-mb=KOI8
make
make install
Make sure you've backed up your databases. Doublecheck your backup. I
really mean it - make regular backups and test your backups sometimes by
fake restore.
Remove your data directory (better, rename or move it).
Run initdb saying your primary encoding: initdb -e KOI8. If you omit
encoding, primary encoding from configure will be taken.
Start postmaster.
Create databases: createdb -e KOI8. Again, you can omit encoding -
default encoding will be used. You are not forced to use the same encoding
for all your databases - you can create different databases with different
encodings.
Load your data from the dump you've created: psql < dump.db
That's all! Now you are ready to enjoy the full power of Multibyte
support.
To use Multibyte support you do not need to do something special - just
execute your queries. If client program does not set encoding, it will get
the data in database encoding. But client may ask Postgres to do automatic
server-to-client and client-to-server conversions. There are 2 (two) ways
client program declares its encoding:
1) client explicitly executes the query SET CLIENT_ENCODING TO 'win';
2) client started with environment variable set. Examples -
using sh syntax:
PGCLIENTENCODING='win'; export PGCLIENTENCODING
using csh syntax:
setenv PGCLIENTENCODING 'win'
Setting PGCLIENTENCODING even if you use same client encding as the
database would omit an overhead of asking the database encoding while
initiating the connection, so it is good idea to set it in any case.
Now you may run test suite and see Multibyte support in action. Go to
.../src/test/locale and run
make clean all test-koi2win
===========
1998 Nov 20
===========
I extended locale support, originally written by Oleg Bartunov
<oleg@sai.msu.su>. Now ORDER BY (if PostgreSQL configured with
--enable-locale) uses strcoll() for all text fields: char(n), varchar(n),
text.
I included test suite .../src/test/locale. I didn't include this in
the regression test because not so much people require locale support. Read
.../src/test/locale/README for details on the test suite.
Many thanks to Oleg Bartunov (oleg@sai.msu.su) and Thomas G. Lockhart
(lockhart@alumni.caltech.edu) for hints, tips, help and discussion.
Oleg.
<!--
$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.26 2000/09/12 05:37:07 thomas Exp $
$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.27 2000/09/30 16:58:20 petere Exp $
Postgres Administrator's Guide.
Derived from postgres.sgml.
......@@ -98,9 +98,9 @@ Derived from postgres.sgml.
&intro-ag;
&installation;
&installw;
&charset;
&runtime;
&client-auth;
&charset;
&manage-ag;
&user-manag;
&backup;
......
<chapter id="charset">
<title>Character Sets</title>
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.3 2000/09/30 16:58:20 petere Exp $ -->
<abstract>
<para>
Describes the available language and character set support in
<productname>Postgres</productname>.
</para>
</abstract>
<chapter id="charset">
<title>Localization</>
<abstract>
<para>
Describes the available localization features from the point of
view of the administrator.
</para>
</abstract>
<para>
<productname>Postgres</productname> supports non-ASCII character
sets with two approaches:
<productname>Postgres</productname> supports localization with
three approaches:
<itemizedlist>
<listitem>
<para>
Using locale features in underlying
system libraries. This allows single-byte character sets to be
configured with a locale-specific collation order, provided that
the underlying system supports the required locale. This
technique supports only one character set per server, and can
not support multi-byte character sets.
Using the locale features of the operating system to provide
locale-specific collation order, number formatting, and other
aspects.
</para>
</listitem>
<listitem>
<para>
Using explicit multiple-byte character sets defined in the
<productname>Postgres</productname> server. These character sets
are also known to some client libraries. The number of character
sets is fixed at the time the server is compiled, and internal
operations such as string comparisons require expansion of each
character into a 32-bit word.
<productname>Postgres</productname> server to support languages
that require more characters than will fit into a single byte,
and to provide character set recoding between client and server.
The number of supported character sets is fixed at the time the
server is compiled, and internal operations such as string
comparisons require expansion of each character into a 32-bit
word.
</para>
</listitem>
<listitem>
<para>
Single byte character recoding provides a more light-weight
solution for users of multiple, yet single-byte character sets.
</para>
</listitem>
</itemizedlist>
</para>
<sect1 id="locale">
<title>Locale Support</title>
<para>
<firstterm>Locale</> support refers to an application respecting
cultural preferences regarding alphabets, sorting, number
formatting, etc. <productname>PostgreSQL</> uses the standard ISO
C and POSIX-like locale facilities provided by the server operating
system. For additional information refer the documentation of your
system.
</para>
<sect2>
<title>Overview</>
<para>
Locale support is not build into <productname>PostgreSQL</> by
default; to enable it, supply the <option>--enable-locale</> option
to the <filename>configure</> script:
<informalexample>
<screen>
<prompt>$ </><userinput>./configure --enable-locale</>
</screen>
</informalexample>
Locale support only affects the server; all clients are compatible
with servers with or without locale support.
</para>
<para>
The information about which particular cultural rules to use is
determined by standard environment variables. If you are getting
localized behavior from other programs you probably have them set
up already. The simplest way to set the localization information
is the <envar>LANG</> variable, for example:
<programlisting>
export LANG=sv_SE
</programlisting>
This sets the locale to Swedish (<literal>sv</>) as spoken in
Sweden (<literal>SE</>). Other possibilities might be
<literal>en_US</> (U.S. English) and <literal>fr_CA</> (Canada,
French). If more than one character set can be useful for a locale
then the specifications look like this:
<literal>cs_CZ.ISO8859-2</>. What locales are available under what
names on your system depends on what was provided by the operating
system vendor and what was installed.
</para>
<para>
Occasionally it is useful to mix rules from several locales, e.g.,
use U.S. rules but Spanish messages. To do that a set of
environment variables exist that override the default of
<envar>LANG</> for a particular category:
<informaltable>
<tgroup cols="2">
<tbody>
<row>
<entry>LC_COLLATE</>
<entry>String sort order</>
</row>
<row>
<entry>LC_CTYPE</>
<entry>Character classification (What is a letter? What is the upper-case equivalent of this letter?)</>
</row>
<row>
<entry>LC_MESSAGES</>
<entry>Language of messages</>
</row>
<row>
<entry>LC_MONETARY</>
<entry>Formatting of currency amounts</>
</row>
<row>
<entry>LC_NUMERIC</>
<entry>Formatting of numbers</>
</row>
<row>
<entry>LC_TIME</>
<entry>Formatting of dates and times</>
</row>
</tbody>
</tgroup>
</informaltable>
<envar>LC_MESSAGES</> only affects the messages that come from the
operating system, not <productname>PostgreSQL</>.
</para>
<para>
If you want the system to behave as if it had no locale support,
use the special locale <literal>C</> or <literal>POSIX</>, or
simply unset all locale related variables.
</para>
<para>
Once you have chosen a set of localization rules this way you must
keep them fixed for any particular database cluster. That means
that the locales that were active when you ran <filename>initdb</>
must be kept the same when you start the postmaster. Otherwise,
the changed sort order can corrupt indexes or make your data
disappear mysteriously. It is currently not possible to change the
locales after database initialization or to use more than one set
of locales for a given database cluster.
</para>
</sect2>
<sect2>
<title>Benefits</>
<para>
Locale support influences in particular the following features:
<itemizedlist>
<listitem>
<para>
Sort order in <command>ORDER BY</> queries.
</para>
</listitem>
<listitem>
<para>
The <function>to_char</> family of functions
</para>
</listitem>
<listitem>
<para>
The <literal>LIKE</> and <literal>~</> operators for pattern
matching
</para>
</listitem>
</itemizedlist>
</para>
<para>
The only severe drawback of using the locale support in
<productname>PostgreSQL</> is its speed. So use locale only if you
actually need it.
</para>
</sect2>
<sect2>
<title>Problems</>
<para>
If locale support doesn't work in spite of the explanation above,
check that the locale support in your operating system is okay.
To check whether a given locale is installed and functional you
can use <application>Perl</>, for example. Perl has also support
for locales and if a locale is broken <command>perl -v</> will
complain something like this:
<screen>
<prompt>$</> <userinput>export LC_CTYPE='not_exist'</>
<prompt>$</> <userinput>perl -v</>
<computeroutput>
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LC_ALL = (unset),
LC_CTYPE = "not_exist",
LANG = (unset)
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
</computeroutput>
</screen>
</para>
<para>
Check that your locale files are in the right location. Possible
locations include: <filename>/usr/lib/locale</filename> (Linux,
Solaris), <filename>/usr/share/locale</filename> (Linux),
<filename>/usr/lib/nls/loc</filename> (DUX 4.0). Check the locale
man page of your system if you are not sure.
</para>
<para>
The directory <filename>src/test/locale</> contains a test suite
for <productname>PostgreSQL</>'s locale support.
</para>
</sect2>
</sect1>
<sect1 id="multibyte">
<title>Multi-byte Support</title>
<title>Multibyte Support</title>
<note>
<title>Author</title>
......@@ -53,7 +244,7 @@
</note>
<para>
Multi-byte (<acronym>MB</acronym>) support is intended to allow
Multibyte (<acronym>MB</acronym>) support is intended to allow
<productname>Postgres</productname> to handle
multiple-byte character sets such as EUC (Extended Unix Code), Unicode and
Mule internal code. With <acronym>MB</acronym> enabled you can use multi-byte
......@@ -680,7 +871,78 @@ SET CLIENT_ENCODING = 'WIN1250';
</procedure>
</sect2>
</sect1>
</chapter>
<sect1 id="recode">
<title>Single-byte character set recoding</>
<!-- formerly in README.charsets, by Josef Balatka, <balatka@email.cz> -->
<para>
You can set up this feature with the <option>--enable-recode</> option
to <filename>configure</>. This option was formerly described as
<quote>Cyrillic recode support</> which doesn't express all its
power. It can be used for <emphasis>any</> single-byte character
set recoding.
</para>
<para>
This method uses a file <filename>charset.conf</> file located in
the database directory (<envar>PGDATA</>). It's a typical
configuration text file where spaces and newlines separate items
and records and # specifies comments. Three keywords with the
following syntax are recognized here:
<synopsis>
BaseCharset <replaceable>server_charset</>
RecodeTable <replaceable>from_charset</> <replaceable>to_charset</> <replaceable>file_name</>
HostCharset <replaceable>host_spec</> <replaceable>host_charset</>
</synopsis>
</para>
<para>
<token>BaseCharset</> defines the encoding of the database server.
All character set names are only used for mapping inside of
<filename>charset.conf</> so you can freely use typing-friendly
names.
</para>
<para>
<token>RecodeTable</> records specify translation tables between
server and client. The file name is relative to the
<envar>PGDATA</> directory. The table file format is very
simple. There are no keywords and characters are represented by a
pair of decimal or hexadecimal (0x prefixed) values on single
lines:
<synopsis>
<replaceable>char_value</> <replaceable>translated_char_value</>
</synopsis>
</para>
<para>
<token>HostCharset</> records define the client character set by IP
address. You can use a single IP address, an IP mask range starting
from the given address or an IP interval (e.g., 127.0.0.1,
192.168.1.100/24, 192.168.1.20-192.168.1.40).
</para>
<para>
The <filename>charset.conf</> file is always processed up to the
end, so you can easily specify exceptions from the previous
rules. In the src/data you will find charset.conf example and a few
recoding tables.
</para>
<para>
As this solution is based on the client's IP address and character
set mapping there are obviously some restrictions as well. You
cannot use different encodings on the same host at the same
time. It is also inconvenient when you boot your client hosts into
more operating systems. Nevertheless, when these restrictions are
not limiting and you do not need multi-byte characters than it is a
simple and effective solution.
</para>
</sect1>
</chapter>
<!-- Keep this comment at the end of the file
Local variables:
......
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/installation.sgml,v 1.21 2000/09/29 20:21:34 petere Exp $ -->
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/installation.sgml,v 1.22 2000/09/30 16:58:20 petere Exp $ -->
<chapter id="installation">
<title><![%flattext-install-include[<productname>PostgreSQL</> ]]>Installation Instructions</title>
......@@ -447,8 +447,9 @@ su - postgres
<term>--enable-recode</term>
<listitem>
<para>
Enables character set recode support. See
<filename>doc/README.Charsets</> for details on this feature.
Enables single-byte character set recode support. See
<![%flattext-install-include[the <citetitle>Administrator's Guide</citetitle>]]>
<![%flattext-install-ignore[<xref linkend="recode">]]> about this feature.
</para>
</listitem>
</varlistentry>
......@@ -459,7 +460,10 @@ su - postgres
<para>
Allows the use of multibyte character encodings. This is
primarily for languages like Japanese, Korean, and Chinese.
Read <filename>doc/README.mb</> for details.
Read
<![%flattext-install-include[the <citetitle>Administrator's Guide</citetitle>]]>
<![%flattext-install-ignore[<xref linkend="multibyte">]]>
for details.
</para>
</listitem>
</varlistentry>
......
<!--
$Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.41 2000/09/12 05:37:09 thomas Exp $
$Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.42 2000/09/30 16:58:20 petere Exp $
-->
<!doctype set PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [
......@@ -173,9 +173,9 @@ $Header: /cvsroot/pgsql/doc/src/sgml/postgres.sgml,v 1.41 2000/09/12 05:37:09 th
-->
&installation;
&installw;
&charset;
&runtime;
&client-auth;
&charset;
&manage-ag;
&user-manag;
&backup;
......
<!--
$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.25 2000/09/29 20:21:34 petere Exp $
$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.26 2000/09/30 16:58:20 petere Exp $
-->
<Chapter Id="runtime">
......@@ -1553,126 +1553,6 @@ set semsys:seminfo_semmsl=32
</sect1>
<sect1 id="locale">
<title>Locale Support</title>
<note>
<title>Acknowledgement</title>
<para>
Written by Oleg Bartunov. See <ulink
url="http://www.sai.msu.su/~megera/postgres/">Oleg's web
page</ulink> for additional information on locale and Russian
language support.
</para>
</note>
<para>
While doing a project for a company in Moscow, Russia, I
encountered the problem that <productname>Postgres</> had no
support of national alphabets. After looking for possible
workarounds I decided to develop support of locale myself. I'm not
a C programmer but already had some experience with locale
programming when I work with <productname>Perl</> (debugging) and
<productname>Glimpse</>. After several days of digging through the
<productname>Postgres</> source tree I made very minor corections
to <filename>src/backend/utils/adt/varlena.c</> and
<filename>src/backend/main/main.c</> and got what I needed! I did
support only for <envar>LC_CTYPE</envar> and
<envar>LC_COLLATE</envar>, but later <envar>LC_MONETARY</envar> was
added by others. I got many messages from people about this patch
so I decided to send it to developers and (to my surprise) it was
incorporated into the <productname>Postgres</> distribution.
</para>
<para>
People often complain that locale doesn't work for them. There are
several common mistakes:
<itemizedlist>
<listitem>
<para>
Didn't properly configure <productname>Postgres</> before
compilation. You must run <filename>configure</> with the
<option>--enable-locale</> option to enable locale support.
</para>
</listitem>
<listitem>
<para>
Didn't setup environment correctly when starting postmaster. You
must define environment variables <envar>LC_CTYPE</envar> and
<envar>LC_COLLATE</envar> before running postmaster because
backend gets information about locale from environment. I use
following shell script:
<programlisting>
#!/bin/sh
export LC_CTYPE=koi8-r
export LC_COLLATE=koi8-r
postmaster -B 1024 -S -D/usr/local/pgsql/data/ -o '-Fe'
</programlisting>
</para>
</listitem>
<listitem>
<para>
Broken locale support in the operating system (for example,
locale support in libc under Linux several times has changed and
this caused a lot of problems). Perl has also support of locale
and if locale is broken <command>perl -v</> will complain
something like:
<screen>
<prompt>$</> <userinput>export LC_CTYPE='not_exist'</>
<prompt>$</> <userinput>perl -v</>
<computeroutput>
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LC_ALL = (unset),
LC_CTYPE = "not_exist",
LANG = (unset)
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
</computeroutput>
</screen>
</para>
</listitem>
<listitem>
<para>
Wrong location of locale files. Possible locations include:
<filename>/usr/lib/locale</filename> (Linux, Solaris),
<filename>/usr/share/locale</filename> (Linux),
<filename>/usr/lib/nls/loc</filename> (DUX 4.0).
Check <command>man locale</command> to find the correct
location. Under Linux I made a symbolic link between
<filename>/usr/lib/locale</filename> and
<filename>/usr/share/locale</filename> to be sure that the next
libc will not break my locale.
</para>
</listitem>
</itemizedlist>
</para>
<formalpara>
<title>What are the Benefits?</title>
<para>
You can use ~* and order by operators for strings contain
characters from national alphabets. Non-english users definitely
need that.
</para>
</formalpara>
<formalpara>
<title>What are the Drawbacks?</title>
<para>
There is one evident drawback of using locale - its speed! So, use
locale only if you really need it.
</para>
</formalpara>
</sect1>
<sect1 id="postmaster-shutdown">
<title>Shutting down the server</title>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment