charset.sgml 33.6 KB
Newer Older
1
<!-- $PostgreSQL: pgsql/doc/src/sgml/charset.sgml,v 2.51 2005/03/13 01:26:30 momjian Exp $ -->
2

3 4 5
<chapter id="charset">
 <title>Localization</>

6 7 8 9
 <para>
  This chapter describes the available localization features from the
  point of view of the administrator.
  <productname>PostgreSQL</productname> supports localization with
10
  two approaches:
11 12 13 14

   <itemizedlist>
    <listitem>
     <para>
15
      Using the locale features of the operating system to provide
Peter Eisentraut's avatar
Peter Eisentraut committed
16 17
      locale-specific collation order, number formatting, translated
      messages, and other aspects.
18 19 20 21 22
     </para>
    </listitem>

    <listitem>
     <para>
23 24 25
      Providing a number of different character sets defined in the
      <productname>PostgreSQL</productname> server, including
      multiple-byte character sets, to support storing text in all
26
      kinds of languages, and providing character set translation between
27
      client and server.
28 29
     </para>
    </listitem>
30 31 32
   </itemizedlist>
  </para>

33 34 35 36

 <sect1 id="locale">
  <title>Locale Support</title>
  
37 38
  <indexterm zone="locale"><primary>locale</></>

39 40 41 42
  <para>
   <firstterm>Locale</> support refers to an application respecting
   cultural preferences regarding alphabets, sorting, number
   formatting, etc.  <productname>PostgreSQL</> uses the standard ISO
43
   C and <acronym>POSIX</acronym> locale facilities provided by the server operating
44
   system.  For additional information refer to the documentation of your
45 46 47 48 49 50
   system.
  </para>

  <sect2>
   <title>Overview</>

51 52 53 54
   <para>
    Locale support is automatically initialized when a database
    cluster is created using <command>initdb</command>.
    <command>initdb</command> will initialize the database cluster
55 56 57 58 59 60 61
    with the locale setting of its execution environment by default,
    so if your system is already set to use the locale that you want
    in your database cluster then there is nothing else you need to
    do.  If you want to use a different locale (or you are not sure
    which locale your system is set to), you can instruct
    <command>initdb</command> exactly which locale to use by
    specifying the <option>--locale</option> option. For example:
62
<screen>
63
initdb --locale=sv_SE
64 65 66
</screen>
   </para>

Peter Eisentraut's avatar
Peter Eisentraut committed
67
   <para>
68 69 70 71 72
    This example sets the locale to Swedish (<literal>sv</>) as spoken
    in Sweden (<literal>SE</>).  Other possibilities might be
    <literal>en_US</> (U.S. English) and <literal>fr_CA</> (French
    Canadian).  If more than one character set can be useful for a
    locale then the specifications look like this:
73 74
    <literal>cs_CZ.ISO8859-2</>. What locales are available under what
    names on your system depends on what was provided by the operating
75 76
    system vendor and what was installed.  (On most systems, the command
    <literal>locale -a</> will provide a list of available locales.)
77 78 79 80
   </para>

   <para>
    Occasionally it is useful to mix rules from several locales, e.g.,
81
    use English collation rules but Spanish messages.  To support that, a
82
    set of locale subcategories exist that control only a certain
83
    aspect of the localization rules:
84 85 86 87 88

    <informaltable>
     <tgroup cols="2">
      <tbody>
       <row>
89
        <entry><envar>LC_COLLATE</></>
90 91 92
        <entry>String sort order</>
       </row>
       <row>
93
        <entry><envar>LC_CTYPE</></>
94
        <entry>Character classification (What is a letter? Its upper-case equivalent?)</>
95 96
       </row>
       <row>
97
        <entry><envar>LC_MESSAGES</></>
98 99 100
        <entry>Language of messages</>
       </row>
       <row>
101
        <entry><envar>LC_MONETARY</></>
102 103 104
        <entry>Formatting of currency amounts</>
       </row>
       <row>
105
        <entry><envar>LC_NUMERIC</></>
106 107 108
        <entry>Formatting of numbers</>
       </row>
       <row>
109
        <entry><envar>LC_TIME</></>
110 111 112 113 114 115
        <entry>Formatting of dates and times</>
       </row>
      </tbody>
     </tgroup>
    </informaltable>

116 117 118 119 120
    The category names translate into names of
    <command>initdb</command> options to override the locale choice
    for a specific category.  For instance, to set the locale to
    French Canadian, but use U.S. rules for formatting currency, use
    <literal>initdb --locale=fr_CA --lc-monetary=en_US</literal>.
121 122 123 124
   </para>

   <para>
    If you want the system to behave as if it had no locale support,
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
    use the special locale <literal>C</> or <literal>POSIX</>.
   </para>

   <para>
    The nature of some locale categories is that their value has to be
    fixed for the lifetime of a database cluster.  That is, once
    <command>initdb</command> has run, you cannot change them anymore.
    <literal>LC_COLLATE</literal> and <literal>LC_CTYPE</literal> are
    those categories.  They affect the sort order of indexes, so they
    must be kept fixed, or indexes on text columns will become corrupt.
    <productname>PostgreSQL</productname> enforces this by recording
    the values of <envar>LC_COLLATE</> and <envar>LC_CTYPE</> that are
    seen by <command>initdb</>.  The server automatically adopts
    those two values when it is started.
   </para>

   <para>
    The other locale categories can be changed as desired whenever the
143
    server is running by setting the run-time configuration variables
144
    that have the same name as the locale categories (see <xref
145
    linkend="runtime-config-client-format"> for details).  The defaults that are
146 147 148 149 150
    chosen by <command>initdb</command> are actually only written into
    the configuration file <filename>postgresql.conf</filename> to
    serve as defaults when the server is started.  If you delete the
    assignments from <filename>postgresql.conf</filename> then the
    server will inherit the settings from the execution environment.
151 152 153
   </para>

   <para>
Peter Eisentraut's avatar
Peter Eisentraut committed
154 155
    Note that the locale behavior of the server is determined by the
    environment variables seen by the server, not by the environment
156
    of any client.  Therefore, be careful to configure the correct locale settings
Peter Eisentraut's avatar
Peter Eisentraut committed
157
    before starting the server.  A consequence of this is that if
158
    client and server are set up in different locales, messages may
Peter Eisentraut's avatar
Peter Eisentraut committed
159
    appear in different languages depending on where they originated.
160 161
   </para>

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
   <note>
    <para>
     When we speak of inheriting the locale from the execution
     environment, this means the following on most operating systems:
     For a given locale category, say the collation, the following
     environment variables are consulted in this order until one is
     found to be set: <envar>LC_ALL</envar>, <envar>LC_COLLATE</envar>
     (the variable corresponding to the respective category),
     <envar>LANG</envar>.  If none of these environment variables are
     set then the locale defaults to <literal>C</literal>.
    </para>

    <para>
     Some message localization libraries also look at the environment
     variable <envar>LANGUAGE</envar> which overrides all other locale
     settings for the purpose of setting the language of messages.  If
     in doubt, please refer to the documentation of your operating
179 180
     system, in particular the documentation about
     <application>gettext</>, for more information.
181 182 183
    </para>
   </note>

184
   <para>
185
    To enable messages to be translated to the user's preferred language,
186 187
    <acronym>NLS</acronym> must have been enabled at build time.  This
    choice is independent of the other locale support.
188 189 190 191
   </para>
  </sect2>

  <sect2>
192
   <title>Behavior</>
193 194

   <para>
195
    Locale support influences the following features:
196 197 198 199

    <itemizedlist>
     <listitem>
      <para>
200
       Sort order in queries using <literal>ORDER BY</>
Peter Eisentraut's avatar
Peter Eisentraut committed
201
       <indexterm><primary>ORDER BY</><secondary>and locales</></indexterm>
202 203 204
      </para>
     </listitem>

205 206 207 208 209 210 211
     <listitem>
      <para>
       The ability to use indexes with <literal>LIKE</> clauses
       <indexterm><primary>LIKE</><secondary>and locales</></indexterm>
      </para>
     </listitem>

212 213 214 215 216 217 218 219 220
     <listitem>
      <para>
       The <function>to_char</> family of functions
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
221 222 223 224 225
    The drawback of using locales other than <literal>C</> or
    <literal>POSIX</> in <productname>PostgreSQL</> is its performance
    impact. It slows character handling and prevents ordinary indexes
    from being used by <literal>LIKE</>. For this reason use locales
    only if you actually need them.
226 227 228 229 230 231 232 233
   </para>
  </sect2>

  <sect2>
   <title>Problems</>

   <para>
    If locale support doesn't work in spite of the explanation above,
234 235 236 237
    check that the locale support in your operating system is
    correctly configured.  To check what locales are installed on your
    system, you may use the command <literal>locale -a</literal> if
    your operating system provides it.
238 239 240
   </para>

   <para>
241 242 243 244 245 246
    Check that <productname>PostgreSQL</> is actually using the locale
    that you think it is.  <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
    settings are determined at <command>initdb</> time and cannot be
    changed without repeating <command>initdb</>.  Other locale
    settings including <envar>LC_MESSAGES</> and <envar>LC_MONETARY</>
    are initially determined by the environment the server is started
247 248
    in, but can be changed on-the-fly.  You can check the active locale
    settings using the <command>SHOW</> command.
249 250
   </para>

251
   <para>
252 253 254
    The directory <filename>src/test/locale</> in the source
    distribution contains a test suite for
    <productname>PostgreSQL</>'s locale support.
255
   </para>
Peter Eisentraut's avatar
Peter Eisentraut committed
256 257 258 259

   <para>
    Client applications that handle server-side errors by parsing the
    text of the error message will obviously have problems when the
Peter Eisentraut's avatar
Peter Eisentraut committed
260 261 262
    server's messages are in a different language.  Authors of such
    applications are advised to make use of the error code scheme
    instead.
Peter Eisentraut's avatar
Peter Eisentraut committed
263 264 265 266 267 268
   </para>

   <para>
    Maintaining catalogs of message translations requires the on-going
    efforts of many volunteers that want to see
    <productname>PostgreSQL</> speak their preferred language well.
269
    If messages in your language are currently not available or not fully
Peter Eisentraut's avatar
Peter Eisentraut committed
270
    translated, your assistance would be appreciated.  If you want to
271
    help, refer to <xref linkend="nls"> or write to the developers'
272
    mailing list.
Peter Eisentraut's avatar
Peter Eisentraut committed
273
   </para>
274 275 276 277
  </sect2>
 </sect1>


278 279
 <sect1 id="multibyte">
  <title>Character Set Support</title>
280

281
  <indexterm zone="multibyte"><primary>character set</></>
282

283 284 285 286 287
  <para>
   The character set support in <productname>PostgreSQL</productname>
   allows you to store text in a variety of character sets, including
   single-byte character sets such as the ISO 8859 series and
   multiple-byte character sets such as <acronym>EUC</> (Extended Unix
288
   Code), UTF8, and Mule internal code.  All character sets can be
289 290 291 292 293 294 295 296 297
   used transparently throughout the server.  (If you use extension
   functions from other sources, it depends on whether they wrote
   their code correctly.)  The default character set is selected while
   initializing your <productname>PostgreSQL</productname> database
   cluster using <command>initdb</>.  It can be overridden when you
   create a database using <command>createdb</command> or by using the
   SQL command <command>CREATE DATABASE</>. So you can have multiple
   databases each with a different character set.
  </para>
298

299
   <sect2 id="multibyte-charset-supported">
300
    <title>Supported Character Sets</title>
301

302
    <para>
303 304 305
     <xref linkend="charset-table"> shows the character sets available
     for use in the server.
    </para>
306

307 308
     <table id="charset-table">
      <title>Server Character Sets</title>
309 310
      <tgroup cols="2">
       <thead>
311 312 313 314
        <row>
         <entry>Name</entry>
         <entry>Description</entry>
        </row>
315 316
       </thead>
       <tbody>
317 318 319 320
        <row>
         <entry><literal>SQL_ASCII</literal></entry>
         <entry><acronym>ASCII</acronym></entry>
        </row>
321 322 323 324 325
        <row>
         <entry><literal>BIG5</literal></entry>
         <entry>Chinese</entry>
         <entry>Aliases: WIN950, Windows950</entry>
        </row>
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345
        <row>
         <entry><literal>EUC_JP</literal></entry>
         <entry>Japanese <acronym>EUC</></entry>
        </row>
        <row>
         <entry><literal>EUC_CN</literal></entry>
         <entry>Chinese <acronym>EUC</></entry>
        </row>
        <row>
         <entry><literal>EUC_KR</literal></entry>
         <entry>Korean <acronym>EUC</></entry>
        </row>
        <row>
         <entry><literal>JOHAB</literal></entry>
         <entry>Korean <acronym>EUC</> (Hangle base)</entry>
        </row>
        <row>
         <entry><literal>EUC_TW</literal></entry>
         <entry>Taiwan <acronym>EUC</acronym></entry>
        </row>
346 347 348 349 350 351 352 353 354
        <row>
         <entry><literal>GBK</literal></entry>
         <entry>Chinese <acronym>EUC</acronym></entry>
         <entry>Aliases: WIN936, Windows936</entry>
        </row>
        <row>
         <entry><literal>GB18030</literal></entry>
         <entry>Chinese </entry>
        </row>
355
        <row>
356 357
         <entry><literal>UTF8</literal></entry>
         <entry>UTF-8 (Unicode, 8-bit)</entry>
358
         <entry>Aliases: Unicode</entry>
359 360 361
        </row>
        <row>
         <entry><literal>MULE_INTERNAL</literal></entry>
362
         <entry>Mule internal code (Multi-lingual Emacs)</entry>
363 364 365
        </row>
        <row>
         <entry><literal>LATIN1</literal></entry>
366
         <entry>ISO 8859-1/<acronym>ECMA</> 94 (Western European)</entry>
367
         <entry>Aliases: ISO88591</entry>
368 369 370
        </row>
        <row>
         <entry><literal>LATIN2</literal></entry>
371
         <entry>ISO 8859-2/<acronym>ECMA</> 94 (Central European)</entry>
372
         <entry>Aliases: ISO88592</entry>
373 374 375
        </row>
        <row>
         <entry><literal>LATIN3</literal></entry>
376
         <entry>ISO 8859-3/<acronym>ECMA</> 94 (South European)</entry>
377
         <entry>Aliases: ISO88593</entry>
378 379 380
        </row>
        <row>
         <entry><literal>LATIN4</literal></entry>
381
         <entry>ISO 8859-4/<acronym>ECMA</> 94 (North European)</entry>
382
         <entry>Aliases: ISO88594</entry>
383 384 385
        </row>
        <row>
         <entry><literal>LATIN5</literal></entry>
386
         <entry>ISO 8859-9/<acronym>ECMA</> 128 (Turkish)</entry>
387
         <entry>Aliases: ISO88599</entry>
388 389 390
        </row>
        <row>
         <entry><literal>LATIN6</literal></entry>
391
         <entry>ISO 8859-10/<acronym>ECMA</> 144 (Nordic)</entry>
392
         <entry>Aliases: ISO885910</entry>
393 394 395
        </row>
        <row>
         <entry><literal>LATIN7</literal></entry>
396
         <entry>ISO 8859-13 (Baltic)</entry>
397
         <entry>Aliases: ISO885913</entry>
398 399 400
        </row>
        <row>
         <entry><literal>LATIN8</literal></entry>
401
         <entry>ISO 8859-14 (Celtic)</entry>
402
         <entry>Aliases: ISO885914</entry>
403 404 405
        </row>
        <row>
         <entry><literal>LATIN9</literal></entry>
406
         <entry>ISO 8859-15 (LATIN1 with Euro and accents)</entry>
407
         <entry>Aliases: ISO885915</entry>
408 409 410
        </row>
        <row>
         <entry><literal>LATIN10</literal></entry>
411
         <entry>ISO 8859-16/<acronym>ASRO</> SR 14111 (Romanian)</entry>
412
         <entry>Aliases: ISO885916</entry>
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431
        </row>
        <row>
         <entry><literal>ISO_8859_5</literal></entry>
         <entry>ISO 8859-5/<acronym>ECMA</> 113 (Latin/Cyrillic)</entry>
        </row>
        <row>
         <entry><literal>ISO_8859_6</literal></entry>
         <entry>ISO 8859-6/<acronym>ECMA</> 114 (Latin/Arabic)</entry>
        </row>
        <row>
         <entry><literal>ISO_8859_7</literal></entry>
         <entry>ISO 8859-7/<acronym>ECMA</> 118 (Latin/Greek)</entry>
        </row>
        <row>
         <entry><literal>ISO_8859_8</literal></entry>
         <entry>ISO 8859-8/<acronym>ECMA</> 121 (Latin/Hebrew)</entry>
        </row>
        <row>
         <entry><literal>KOI8</literal></entry>
432
         <entry><acronym>KOI</acronym>8-R(U) (Cyrillic)</entry>
433 434 435 436 437 438 439 440 441 442 443
         <entry>Aliases: KOI8</entry>
        </row>
        <row>
         <entry><literal>SJIS</literal></entry>
         <entry>SJIS (Japanese)</entry>
         <entry>Aliases: Mskanji, ShiftJIS, WIN932, Windows932</entry>
        </row>
        <row>
         <entry><literal>UHC</literal></entry>
         <entry>Unified Hangul Code (Korean)</entry>
         <entry>Aliases: WIN949, Windows949</entry>
444 445
        </row>
        <row>
446
         <entry><literal>WIN866</literal></entry>
447
         <entry>Windows CP866 (Cyrillic)</entry>
448
         <entry>Aliases: ALT</entry>
449 450 451 452 453 454 455
        </row>
        <row>
         <entry><literal>WIN874</literal></entry>
         <entry>Windows CP874 (Thai)</entry>
        </row>
        <row>
         <entry><literal>WIN1250</literal></entry>
456
         <entry>Windows CP1250 (Central European)</entry>
457 458
        </row>
        <row>
459
         <entry><literal>WIN1251</literal></entry>
460
         <entry>Windows CP1251 (Cyrillic)</entry>
461
         <entry>Aliases: WIN</entry>
462 463 464 465 466 467
        </row>
        <row>
         <entry><literal>WIN1256</literal></entry>
         <entry>Windows CP1256 (Arabic)</entry>
        </row>
        <row>
468 469
         <entry><literal>WIN1258</literal></entry>
         <entry>Windows CP1258 (Vietnamese)/<acronym>TCVN</>-5712</entry>
470
         <entry>Aliases: ABC, TCVN, TCVN5712, VSCII</entry>
471
        </row>
472 473 474 475
       </tbody>
      </tgroup>
     </table>

Peter Eisentraut's avatar
Peter Eisentraut committed
476 477
    <important>
     <para>
478 479 480 481
      Before <productname>PostgreSQL</> 7.2, <literal>LATIN5</>
      mistakenly meant ISO 8859-5.  From 7.2 on, <literal>LATIN5</>
      means ISO 8859-9. If you have a <literal>LATIN5</> database
      created on 7.1 or earlier and want to migrate to 7.2 or later,
Peter Eisentraut's avatar
Peter Eisentraut committed
482
      you should be careful about this change.
Peter Eisentraut's avatar
Peter Eisentraut committed
483 484
     </para>
    </important>
485

Peter Eisentraut's avatar
Peter Eisentraut committed
486
     <para>
487
      Not all <acronym>API</>s support all the listed character sets. For example, the
Peter Eisentraut's avatar
Peter Eisentraut committed
488 489 490 491
      <productname>PostgreSQL</>
      JDBC driver does not support <literal>MULE_INTERNAL</>, <literal>LATIN6</>,
      <literal>LATIN8</>, and <literal>LATIN10</>.
     </para>
492
    </sect2>
493
    
494
   <sect2>
495
    <title>Setting the Character Set</title>
496 497

    <para>
498 499
     <command>initdb</> defines the default character set
     for a <productname>PostgreSQL</productname> cluster. For example,
500

Peter Eisentraut's avatar
Peter Eisentraut committed
501
<screen>
502
initdb -E EUC_JP
Peter Eisentraut's avatar
Peter Eisentraut committed
503
</screen>
504

505 506 507 508
     sets the default character set (encoding) to
     <literal>EUC_JP</literal> (Extended Unix Code for Japanese).  You
     can use <option>--encoding</option> instead of
     <option>-E</option> if you prefer to type longer option strings.
509
     If no <option>-E</> or <option>--encoding</option> option is
510
     given, <literal>SQL_ASCII</> is used.
511 512 513
    </para>

    <para>
514
     You can create a database with a different character set:
515

Peter Eisentraut's avatar
Peter Eisentraut committed
516
<screen>
517
createdb -E EUC_KR korean
Peter Eisentraut's avatar
Peter Eisentraut committed
518
</screen>
519

520 521 522
     This will create a database named <literal>korean</literal> that
     uses the character set <literal>EUC_KR</literal>.  Another way to
     accomplish this is to use this SQL command:
523

Peter Eisentraut's avatar
Peter Eisentraut committed
524
<programlisting>
525
CREATE DATABASE korean WITH ENCODING 'EUC_KR';
Peter Eisentraut's avatar
Peter Eisentraut committed
526
</programlisting>
527

528 529 530 531
     The encoding for a database is stored in the system catalog
     <literal>pg_database</literal>.  You can see that by using the
     <option>-l</option> option or the <command>\l</command> command
     of <command>psql</command>.
532

Peter Eisentraut's avatar
Peter Eisentraut committed
533 534
<screen>
$ <userinput>psql -l</userinput>
535 536 537 538 539 540 541 542 543 544 545
            List of databases
   Database    |  Owner  |   Encoding    
---------------+---------+---------------
 euc_cn        | t-ishii | EUC_CN
 euc_jp        | t-ishii | EUC_JP
 euc_kr        | t-ishii | EUC_KR
 euc_tw        | t-ishii | EUC_TW
 mule_internal | t-ishii | MULE_INTERNAL
 regression    | t-ishii | SQL_ASCII
 template1     | t-ishii | EUC_JP
 test          | t-ishii | EUC_JP
546
 utf8          | t-ishii | UTF8
547
(9 rows)
Peter Eisentraut's avatar
Peter Eisentraut committed
548
</screen>
549
    </para>
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574

    <important>
     <para>
      Although you can specify any encoding you want for a database, it is
      unwise to choose an encoding that is not what is expected by the locale
      you have selected.  The <literal>LC_COLLATE</literal> and
      <literal>LC_CTYPE</literal> settings imply a particular encoding,
      and locale-dependent operations (such as sorting) are likely to
      misinterpret data that is in an incompatible encoding.
     </para>

     <para>
      Since these locale settings are frozen by <command>initdb</>, the
      apparent flexibility to use different encodings in different databases
      of a cluster is more theoretical than real.  It is likely that these
      mechanisms will be revisited in future versions of
      <productname>PostgreSQL</productname>.
     </para>

     <para>
      One way to use multiple encodings safely is to set the locale to
      <literal>C</> or <literal>POSIX</> during <command>initdb</>, thus
      disabling any real locale awareness.
     </para>
    </important>
575 576 577
   </sect2>

   <sect2>
578
    <title>Automatic Character Set Conversion Between Server and Client</title>
579 580

    <para>
581 582 583 584 585 586 587
     <productname>PostgreSQL</productname> supports automatic
     character set conversion between server and client for certain
     character sets. The conversion information is stored in the
     <literal>pg_conversion</> system catalog. You can create a new
     conversion by using the SQL command <command>CREATE
     CONVERSION</command>. <productname>PostgreSQL</> comes with some
     predefined conversions. They are listed in <xref
588
     linkend="multibyte-translation-table">.
Peter Eisentraut's avatar
Peter Eisentraut committed
589
    </para>
590

591 592
     <table id="multibyte-translation-table">
      <title>Client/Server Character Set Conversions</title>
593 594
      <tgroup cols="2">
       <thead>
595 596 597 598
        <row>
         <entry>Server Character Set</entry>
         <entry>Available Client Character Sets</entry>
        </row>
599 600
       </thead>
       <tbody>
601 602
        <row>
         <entry><literal>SQL_ASCII</literal></entry>
603
         <entry><literal>SQL_ASCII</literal>, <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
604 605 606 607 608
         </entry>
        </row>
        <row>
         <entry><literal>EUC_JP</literal></entry>
         <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>,
609
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
610 611 612 613
         </entry>
        </row>
        <row>
         <entry><literal>EUC_CN</literal></entry>
614
         <entry><literal>EUC_CN</literal>, <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
615 616 617 618
         </entry>
        </row>
        <row>
         <entry><literal>EUC_KR</literal></entry>
619
         <entry><literal>EUC_KR</literal>, <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
620 621 622 623
         </entry>
        </row>
        <row>
         <entry><literal>JOHAB</literal></entry>
624
         <entry><literal>JOHAB</literal>, <literal>UTF8</literal>
625 626 627 628 629
         </entry>
        </row>
        <row>
         <entry><literal>EUC_TW</literal></entry>
         <entry><literal>EUC_TW</literal>, <literal>BIG5</literal>,
630
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
631 632 633 634
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN1</literal></entry>
635
         <entry><literal>LATIN1</literal>, <literal>UTF8</literal>
636 637 638 639 640 641
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN2</literal></entry>
         <entry><literal>LATIN2</literal>, <literal>WIN1250</literal>,
642
         <literal>UTF8</literal>,
643 644 645 646 647
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN3</literal></entry>
648
         <entry><literal>LATIN3</literal>, <literal>UTF8</literal>,
649 650 651 652 653
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN4</literal></entry>
654
         <entry><literal>LATIN4</literal>, <literal>UTF8</literal>,
655 656 657 658 659
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN5</literal></entry>
660
         <entry><literal>LATIN5</literal>, <literal>UTF8</literal>
661 662 663 664
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN6</literal></entry>
665
         <entry><literal>LATIN6</literal>, <literal>UTF8</literal>,
666 667 668 669 670
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN7</literal></entry>
671
         <entry><literal>LATIN7</literal>, <literal>UTF8</literal>,
672 673 674 675 676
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN8</literal></entry>
677
         <entry><literal>LATIN8</literal>, <literal>UTF8</literal>,
678 679 680 681 682
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN9</literal></entry>
683
         <entry><literal>LATIN9</literal>, <literal>UTF8</literal>,
684 685 686 687 688
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>  
         <entry><literal>LATIN10</literal></entry>
689
         <entry><literal>LATIN10</literal>, <literal>UTF8</literal>,
690 691 692 693 694 695
         <literal>MULE_INTERNAL</literal>
         </entry>
        </row>
        <row>
         <entry><literal>ISO_8859_5</literal></entry>
         <entry><literal>ISO_8859_5</literal>,
696
         <literal>UTF8</literal>,
697
         <literal>MULE_INTERNAL</literal>,
698 699
         <literal>WIN1251</literal>,
         <literal>WIN866</literal>,
700 701 702 703 704 705
         <literal>KOI8</literal>
         </entry>
        </row>
        <row>
         <entry><literal>ISO_8859_6</literal></entry>
         <entry><literal>ISO_8859_6</literal>,
706
         <literal>UTF8</literal>
707 708 709 710 711
         </entry>
        </row>
        <row>
         <entry><literal>ISO_8859_7</literal></entry>
         <entry><literal>ISO_8859_7</literal>,
712
         <literal>UTF8</literal>
713 714 715 716 717
         </entry>
        </row>
        <row>
         <entry><literal>ISO_8859_8</literal></entry>
         <entry><literal>ISO_8859_8</literal>,
718
         <literal>UTF8</literal>
719 720 721
         </entry>
        </row>
        <row>
722
         <entry><literal>UTF8</literal></entry>
723 724 725 726 727 728 729 730 731 732
         <entry>
         <literal>EUC_JP</literal>, <literal>SJIS</literal>, 
         <literal>EUC_KR</literal>, <literal>UHC</literal>, <literal>JOHAB</literal>,
         <literal>EUC_CN</literal>, <literal>GBK</literal>,
         <literal>EUC_TW</literal>, <literal>BIG5</literal>, 
         <literal>LATIN1</literal> to <literal>LATIN10</literal>, 
         <literal>ISO_8859_5</literal>, 
         <literal>ISO_8859_6</literal>,
         <literal>ISO_8859_7</literal>, 
         <literal>ISO_8859_8</literal>, 
733
         <literal>WIN1251</literal>, <literal>WIN866</literal>, 
734 735
         <literal>KOI8</literal>, 
         <literal>WIN1256</literal>,
736
         <literal>WIN1258</literal>,
737 738 739 740 741 742 743 744 745
         <literal>WIN874</literal>,
         <literal>GB18030</literal>,
         <literal>WIN1250</literal>
         </entry>
        </row>
        <row>
         <entry><literal>MULE_INTERNAL</literal></entry>
         <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>, <literal>EUC_KR</literal>, <literal>EUC_CN</literal>, 
          <literal>EUC_TW</literal>, <literal>BIG5</literal>, <literal>LATIN1</literal> to <literal>LATIN5</literal>, 
746
          <literal>WIN1251</literal>, <literal>WIN866</literal>,
747 748 749 750 751
         <literal>WIN1250</literal>,
          <literal>BIG5</literal>, <literal>ISO_8859_5</literal>, <literal>KOI8</literal></entry>
        </row>
        <row>
         <entry><literal>KOI8</literal></entry>
752 753 754
         <entry><literal>ISO_8859_5</literal>, <literal>WIN1251</literal>, 
         <literal>WIN866</literal>, <literal>KOI8</literal>,
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
755 756 757
         </entry>
        </row>
        <row>
758 759 760 761
         <entry><literal>WIN866</literal></entry>
         <entry><literal>ISO_8859_5</literal>, <literal>WIN1251</literal>, 
         <literal>WIN866</literal>, <literal>KOI8</literal>,
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
762 763 764 765 766
         </entry>
        </row>
        <row>
         <entry><literal>WIN874</literal></entry>
         <entry><literal>WIN874</literal>,
767
         <literal>UTF8</literal>
768 769 770 771 772
         </entry>
        </row>
        <row>
         <entry><literal>WIN1250</literal></entry>
         <entry><literal>LATIN2</literal>, <literal>WIN1250</literal>,
773
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
774 775 776
         </entry>
        </row>
        <row>
777 778 779 780
         <entry><literal>WIN1251</literal></entry>
         <entry><literal>ISO_8859_5</literal>, <literal>WIN1251</literal>, 
         <literal>WIN866</literal>, <literal>KOI8</literal>,
         <literal>UTF8</literal>, <literal>MULE_INTERNAL</literal>
781 782 783 784 785
         </entry>
        </row>
        <row>
         <entry><literal>WIN1256</literal></entry>
         <entry><literal>WIN1256</literal>,
786
         <literal>UTF8</literal>
787 788 789
         </entry>
        </row>
        <row>
790 791 792
         <entry><literal>WIN1258</literal></entry>
         <entry><literal>WIN1258</literal>,
         <literal>UTF8</literal>
793 794
         </entry>
        </row>
795 796 797 798 799
       </tbody>
      </tgroup>
     </table>

    <para>
800 801 802 803
     To enable the automatic character set conversion, you have to
     tell <productname>PostgreSQL</productname> the character set
     (encoding) you would like to use in the client. There are several
     ways to accomplish this:
804 805 806 807

     <itemizedlist>
      <listitem>
       <para>
808 809 810 811 812
        Using the <command>\encoding</command> command in
        <application>psql</application>.
        <command>\encoding</command> allows you to change client
        encoding on the fly. For
        example, to change the encoding to <literal>SJIS</literal>, type:
813

Peter Eisentraut's avatar
Peter Eisentraut committed
814
<programlisting>
815
\encoding SJIS
Peter Eisentraut's avatar
Peter Eisentraut committed
816
</programlisting>
817 818 819 820 821
       </para>
      </listitem>

      <listitem>
       <para>
822 823 824
        Using <application>libpq</> functions.
        <command>\encoding</command> actually calls
        <function>PQsetClientEncoding()</function> for its purpose.
825

Peter Eisentraut's avatar
Peter Eisentraut committed
826
<synopsis>
827
int PQsetClientEncoding(PGconn *<replaceable>conn</replaceable>, const char *<replaceable>encoding</replaceable>);
Peter Eisentraut's avatar
Peter Eisentraut committed
828
</synopsis>
829

830 831 832 833 834
        where <replaceable>conn</replaceable> is a connection to the server,
        and <replaceable>encoding</replaceable> is the encoding you
        want to use. If the function successfully sets the encoding, it returns 0,
        otherwise -1. The current encoding for this connection can be determined by
        using:
835

Peter Eisentraut's avatar
Peter Eisentraut committed
836
<synopsis>
837
int PQclientEncoding(const PGconn *<replaceable>conn</replaceable>);
Peter Eisentraut's avatar
Peter Eisentraut committed
838
</synopsis>
839

840 841 842
        Note that it returns the encoding ID, not a symbolic string
        such as <literal>EUC_JP</literal>. To convert an encoding ID to an encoding name, you
        can use:
843

Peter Eisentraut's avatar
Peter Eisentraut committed
844
<synopsis>
845
char *pg_encoding_to_char(int <replaceable>encoding_id</replaceable>);
Peter Eisentraut's avatar
Peter Eisentraut committed
846
</synopsis>
847 848 849 850 851
       </para>
      </listitem>

      <listitem>
       <para>
852
        Using <command>SET client_encoding TO</command>.
853

854
        Setting the client encoding can be done with this SQL command:
855

Peter Eisentraut's avatar
Peter Eisentraut committed
856
<programlisting>
857
SET CLIENT_ENCODING TO '<replaceable>value</>';
Peter Eisentraut's avatar
Peter Eisentraut committed
858
</programlisting>
859

860
        Also you can use the more standard SQL syntax <literal>SET NAMES</literal> for this purpose:
861

Peter Eisentraut's avatar
Peter Eisentraut committed
862
<programlisting>
863
SET NAMES '<replaceable>value</>';
Peter Eisentraut's avatar
Peter Eisentraut committed
864
</programlisting>
865

866
        To query the current client encoding:
867

Peter Eisentraut's avatar
Peter Eisentraut committed
868
<programlisting>
869
SHOW client_encoding;
Peter Eisentraut's avatar
Peter Eisentraut committed
870
</programlisting>
871

872
        To return to the default encoding:
873

Peter Eisentraut's avatar
Peter Eisentraut committed
874
<programlisting>
875
RESET client_encoding;
Peter Eisentraut's avatar
Peter Eisentraut committed
876
</programlisting>
877 878
       </para>
      </listitem>
879 880 881

      <listitem>
       <para>
882
        Using <envar>PGCLIENTENCODING</envar>. If the environment variable
883 884 885 886 887
        <envar>PGCLIENTENCODING</envar> is defined in the client's
        environment, that client encoding is automatically selected
        when a connection to the server is made.  (This can
        subsequently be overridden using any of the other methods
        mentioned above.)
888 889
       </para>
      </listitem>
890 891 892

      <listitem>
      <para>
893 894 895 896 897 898
       Using the configuration variable <xref
       linkend="guc-client-encoding">. If the
       <varname>client_encoding</> variable is set, that client
       encoding is automatically selected when a connection to the
       server is made.  (This can subsequently be overridden using any
       of the other methods mentioned above.)
899 900 901
       </para>
      </listitem>

902 903 904 905
     </itemizedlist>
    </para>

    <para>
906 907 908 909 910 911 912
     If the conversion of a particular character is not possible
     &mdash; suppose you chose <literal>EUC_JP</literal> for the
     server and <literal>LATIN1</literal> for the client, then some
     Japanese characters cannot be converted to
     <literal>LATIN1</literal> &mdash; it is transformed to its
     hexadecimal byte values in parentheses, e.g.,
     <literal>(826C)</literal>.
913 914 915 916
    </para>
   </sect2>

   <sect2>
917
    <title>Further Reading</title>
918 919

    <para>
920
     These are good sources to start learning about various kinds of encoding
921 922
     systems.

923 924 925 926 927 928 929 930 931 932 933 934
     <variablelist>
      <varlistentry>
       <term><ulink url="http://www.i18ngurus.com/docs/984813247.html"></ulink></term>

       <listitem>
        <para>
         An extensive collection of documents about character sets, encodings,
         and code pages.
        </para>
       </listitem>
      </varlistentry>

935 936
     <variablelist>
      <varlistentry>
937
       <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
938 939 940 941 942 943 944 945 946 947 948

       <listitem>
        <para>
         Detailed explanations of <literal>EUC_JP</literal>,
         <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
         <literal>EUC_TW</literal> appear in section 3.2.
        </para>
       </listitem>
      </varlistentry>

      <varlistentry>
949
       <term><ulink url="http://www.unicode.org/"></ulink></term>
950 951 952 953 954 955 956 957 958 959 960 961 962

       <listitem>
        <para>
         The web site of the Unicode Consortium
        </para>
       </listitem>
      </varlistentry>

      <varlistentry>
       <term>RFC 2044</term>

       <listitem>
        <para>
963
         <acronym>UTF</acronym>-8 is defined here.
964 965 966 967
        </para>
       </listitem>
      </varlistentry>
     </variablelist>
968 969 970 971
    </para>
   </sect2>

  </sect1>
972 973

</chapter>
974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990

<!-- Keep this comment at the end of the file
Local variables:
mode:sgml
sgml-omittag:nil
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-default-dtd-file:"./reference.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:("/usr/lib/sgml/catalog")
sgml-local-ecat-files:nil
End:
-->