tsearch2-guide.html 38.5 KB
Newer Older
Teodor Sigaev's avatar
Teodor Sigaev committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
<title>tsearch2 guide</title>
</head>
<body>
<h1 align=center>The tsearch2 Guide</h1>

<p align=center>
Brandon Craig Rhodes<br>30 June 2003
<p>
This Guide introduces the reader to the PostgreSQL tsearch2 module,
version&nbsp;2.
More formal descriptions of the module's types and functions
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
which is a companion to this document.
You can retrieve a beta copy of the tsearch2 module from the
<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
page &mdash; look under the section entitled <i>Development History</i>
for the current version.
<p>
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
and how they are used to search documents;
next, we will use them to build a simple search engine in&nbsp;SQL;
and finally, we will study the internals of document conversion
and how you might tune the internals to accommodate various searching needs.
<p>
Once you have tsearch2 working with PostgreSQL,
you should be able to run the examples here exactly as they are typed.
<p>
<hr>
<h2>Table of Contents</h2>
<blockquote>
<a href="#vectors_queries">Vectors and Queries</a><br>
<a href="#simple_search">A Simple Search Engine</a><br>
<a href="#weights">Ranking and Position Weights</a><br>
<a href="#casting">Casting Vectors and Queries</a><br>
<a href="#parsing_lexing">Parsing and Lexing</a><br>
</blockquote>

<hr>

<h2><a name=vectors_queries>Vectors and Queries</a></h2>

<blockquote>
<i>This section introduces
the two data types upon which tsearch2 search engines are based,
and illustrates their interaction using the simplest possible case.
The complex examples we present later on
are merely variations and elaborations of this basic mechanism.</i>
</blockquote>
<p>
The tsearch2 module allows you to index documents by the words they contain,
and then perform very efficient searches
for documents that contain a given combination of words.
Preparing your document index involves two steps:
<ul>
<li><b>Making a list of the words each document contains.</b>
 You must reduce each document to a <tt>tsvector</tt>
 which lists each word that appears in the document.
 This process offers many options,
 because there is no requirement
 that you must copy words into the vector
 exactly as they appear in the document.
 For example,
 many developers omit frequent and content-free <b>stop words</b>
 like <i>the</i> to reduce the size of their index;
 others reduce different forms of the same word
 (<i>forked</i>, <i>forking</i>, <i>forks</i>)
 to a common form (<i>fork</i>)
 to make search results independent of tense and case.
 Because words are very often stored in a modified form,
 we use the special term <b>lexemes</b>
 for the word forms we actually store in the vector.
<li><b>Creating an index of the documents by lexeme.</b>
 This is managed automatically by tsearch2
 when you creat a <tt>gist()</tt> index
 on the <tt>tsvector</tt> column of a table,
 which implements a form of the Berkeley
 <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
</ul>
Once your documents are indexed,
performing a search involves:
<ul>
<li><b>Reducing the search terms to lexemes.</b>
 You must express each search you want to perform
 as a <tt>tsquery</tt> specifying a boolean combination of lexemes.
 Note that tsearch2 only finds <i>exact</i> matches
 between the lexemes in your query and the ones in each vector &mdash;
 even capitalization counts as a difference
 (which is why all lexemes are usually kept lowercase).
 So you must process search words the same way you processed document words;
 if <i>forking</i> became <i>fork</i> in the document's <tt>tsvector</tt>,
 then the search term <i>forking</i> must also become <i>fork</i>
 or the search will not find the document.
<li><b>Retrieving the documents that match the query.</b>
 Running a <tt>SELECT</tt> ... <tt>WHERE</tt>
 <tt><i>query</i></tt> <tt>@@</tt> <tt><i>vector</i></tt>
 on the table with the <tt><i>vector</i></tt> column
 will return the documents that match your query.
<li><b>Presenting your results.</b>
 This final stage offers as many options
 as turning documents into vectors.
 You can order documents by how well they matched the search terms;
 create a headline for each document
 showing some of the phrases in which it uses the search terms;
 and restrict the number of results retrieved.
 You will of course want some way to identify each document,
 so the user can ask for the full text of the ones he wants to read.
</ul>
And beyond deciding upon rules for turning documents into vectors
and for presenting search results to users,
you have to decide <i>where</i> to perform these operations &mdash;
whether one database server
will parse documents, perform searches, and prepare search results,
or whether to spread the load of these operations across several machines.
These are complicated design issues
which we will explore later;
in this section and the next,
we will illustrate what can be accomplished
using a single database server.
<p>
The <tt>default</tt> tsearch2 configuration,
which we will learn more about later,
provides a good example of a process for reducing documents to vectors:

<pre>
=# <b>SELECT set_curcfg('default')</b>
=# <b>SELECT to_tsvector('The air smells of sea water.')</b>
             to_tsvector             
-------------------------------------
 'air':2 'sea':5 'smell':3 'water':6
(1 row)
</pre>

Note the complex relationship between this document and its vector.
The vector lists only words from the document &mdash;
spaces and punctuation have disappeared.
Common words like <i>the</i> and <i>of</i> have been eliminated.
The&nbsp;<i>-s</i> that makes <i>smells</i> a plural has been removed,
leaving a lexeme that represents the word in its simplest form.
And finally,
though the vector remembers the positions in which each word appeared,
it does not store the lexemes in that order.
<p>
Keeping word positions in your vectors is optional, by the way.
The positions are necessary for the tsearch2 ranking functions,
which you can use to prioritize documents
based on how often each document uses the search terms
and whether they appear in close proximity.
But if you do not perform ranking,
or use your own process that ignores the word positions stored in the vector,
then you can save space by stripping them from your vectors:

<pre>
=# <b>SELECT strip(to_tsvector('The air smells of sea water.'))</b>
            strip            
-----------------------------
 'air' 'sea' 'smell' 'water'
(1 row)
</pre>

Now that we have a procedure for creating vectors,
we can build an indexed table of vectors very simply:

<pre>
=# <b>CREATE TABLE vectors ( vector tsvector )</b>
=# <b>CREATE INDEX vector_index ON vectors USING gist(vector)</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('The path forks here'))</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('A crawl leads west'))</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast'))</b>
=# <b>SELECT * FROM vectors</b>
                  vector                  
------------------------------------------
 'fork':3 'path':2
 'lead':3 'west':4 'crawl':2
 'fork':3 'lead':4 'left':2 'northeast':5
(3 rows)
</pre>

Now we can search this collection of document vectors
using the <tt>@@</tt> operator and a <tt>tsquery</tt>
that specifies the combination of lexemes we are looking for.
Note that while vectors simply list lexemes,
queries always combine them with the operators
&lsquo;<tt>&amp;</tt>&rsquo;&nbsp;and,
&lsquo;<tt>|</tt>&rsquo;&nbsp;or,
and &nbsp;&lsquo;<tt>!</tt>&rsquo;&nbsp;not,
plus parentheses for grouping.
Some examples of the query syntax:
<table align=center>
<tr>
 <td>&lsquo;find documents with the word <i>forks</i> in them&rsquo;<br>
 <td><tt>'forks'</tt>
<tr>
 <td>&lsquo;... with both <i>forks</i> and <i>leads</i>&rsquo;<br>
 <td><tt>'forks & leads'</tt>
<tr>
 <td>&lsquo;... with either <i>forks</i> or <i>leads</i>&rsquo;<br>
 <td><tt>'forks | leads'</tt>
<tr>
 <td>&lsquo;... with either <i>forks</i> or <i>leads</i>,
  but without <i>crawl</i>&rsquo;<br>
 <td><tt>'(forks|leads) & !crawl'</tt>
</table>
The tsearch2 module
provides a <tt>to_tsquery()</tt> function for creating queries
that uses the same process as <tt>to_tsvector()</tt> uses
to reduce words to lexemes.
For instance,
it will remove the&nbsp;<i>-s</i> from the plurals in the last example above:

<pre>
=# <b>SELECT to_tsquery('(leads|forks) & !crawl')</b>
           to_tsquery           
--------------------------------
 ( 'lead' | 'fork' ) & !'crawl'
(1 row)
</pre>

Again,
this is critically important because the search operator <tt>@@</tt>
only finds <i>exact</i> matches
between the words in a query and the words in a vector;
if the document vector lists the lexeme <i>fork</i>
but the query looks for the plural form <i>forks</i>,
the query would not match that document.
Thanks to the symmetry between our process
for producing vectors and queries, however,
the above searches return correct results:

<pre>
=# <b>SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl')</b>
                  vector                  
------------------------------------------
 'fork':3 'path':2
 'fork':3 'lead':4 'left':2 'northeast':5
(2 rows)
</pre>

You may want to try the other queries shown above,
and perhaps invent some of your own.
<p>
You should not include stop words in a query,
since you cannot search for words you have discarded.
If you throw out the word <i>the</i> when building vectors, for example,
your index will obviously not know which documents included it.
The <tt>to_tsquery()</tt> function will automatically detect this
and give you an error to prevent this mistake:

<pre>
=# <b>SELECT to_tsquery('the')</b>
NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignored
 to_tsquery 
------------
 
(1 row)
</pre>

But if you every build vectors and queries using your own routines,
a possibility we will discuss later,
then you will need to enforce this rule yourself.

<blockquote><i>
Now that you understand how vectors and queries work together,
you are prepared to tackle many additional topics:
how to distribute searching across many servers;
how to customize the process
by which tsearch2 turns documents and queries into lexemes,
or use a process of your own;
and how to sort and display search results to your users.
But before discussing these detailed questions,
we will build a simple search engine
to see how easily its basic features work together.
</i></blockquote>

<h2><a name=simple_search>A Simple Search Engine</a></h2>

<blockquote><i>
In this section we build a simple search engine out of SQL functions
that use the vector and query types described in the previous section.
While this example is simpler
than a search engine that has to interface with the outside world,
it will illustrate the basic principles of building a search engine,
and better prepare you for developing your own.
</i></blockquote>
Building a search engine involves only a few improvements
upon the rudimentary vector searches described in the last section.
<ul>
<li>Because the user wants to read documents, not vectors,
 you must provide some way
 for the full text of each document to be accessed &mdash;
 either by storing the entire text of each document in the database,
 or storing an identifier
 like a URL, file name, or document routing number
 that lets you fetch the document from other storage.
<li>You can make it easier for user interface code to refer to each document
 by providing a unique identifier for each document,
 perhaps with a <tt>SERIAL</tt> column.
<li>Search results should be ordered by relevance.
 If you leave word positions in your vectors,
 you can either have PostgreSQL <tt>ORDER</tt> your results
 <tt>BY</tt> a ranking function,
 or you can fetch the vectors yourself and perform your own sort.
 If you choose to ignore word positions or strip them from your vectors,
 you will have to determine relevance yourself,
 using either the full text of the document
 or other information about each document you may possess.
<li>For each document returned by a search,
 you will usually want to display a summary called a <i>headline</i>
 that shows short excerpts
 illustrating how the document uses the query words.
 Headlines are usually generated from the full text of the document,
 not from position information in the <tt>tsvector</tt>,
 since excerpts lacking stop words, punctuation, and suffixes
 would not be comprehensible.
 If you store the full text of each document in the database,
 headlines can be generated very simply by a tsearch2 function.
 If you store your documents elsewhere,
 then you will either have to transmit each document to the database
 every time you want to run the headline function on it,
 or use your own headline code outside of the database.
</ul>
<p>
We can easily construct a simple search engine
that accomplishes these goals.
First we build a table that, for each document,
stores a unique identifier, the full text of the document,
and its <tt>tsvector</tt>:

<pre>
=# <b>CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )</b>
=# <b>CREATE INDEX docs_index ON docs USING gist(vector);</b>
</pre>

Note that although searches will still work
on tables where you have neglected
to create a <tt>gist()</tt> index over your vectors,
they will run much more slowly
since they will have to compare the query
against every document vector in the table.
<p>
Because the table we have created
stores each document in two different ways &mdash;
both as text and as a vector &mdash;
our <tt>INSERT</tt> statements must provide the document in both forms.
While more advanced PostgreSQL programmers
might accomplish this with a database trigger or rule,
for this simple example we will use a small SQL function:

<pre>
=# <b>CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
  'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'</b>
</pre>

Now, by calling <tt>insdoc()</tt> several times,
we can populate our table with documents:

<pre>
=# <b>SELECT insdoc('A low crawl over cobbles leads inward to the west.')</b>
=# <b>SELECT insdoc('The canyon runs into a mass of boulders -- dead end.')</b>
=# <b>SELECT insdoc('You are crawling over cobbles in a low passage.')</b>
=# <b>SELECT insdoc('Cavernous passages lead east, north, and south.')</b>
=# <b>SELECT insdoc('To the east a low wide crawl slants up.')</b>
=# <b>SELECT insdoc('You are in the south side chamber.')</b>
=# <b>SELECT insdoc('The passage here is blocked by a recent cave-in.')</b>
=# <b>SELECT insdoc('You are in a splendid chamber thirty feet high.')</b>
</pre>

Now we can build a search function.
Its <tt>SELECT</tt> statement is based upon
the same <tt>@@</tt> operation illustrated in the previous section.
But instead of returning matching vectors,
we return for each document
its <tt>SERIAL</tt> identifier, so the user can retrieve it later;
a headline that illustrates its use of the search terms;
and a ranking with which we also order the results.
Our search operation can be coded as a single <tt>SELECT</tt> statement
returning its own kind of table row,
which we call a&nbsp;<tt>finddoc_t</tt>:

<pre>
=# <b>CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)</b>
=# <b>CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
   SELECT id, headline(doc, q), rank(vector, q)
     FROM docs, to_tsquery($1) AS q
     WHERE vector @@ q ORDER BY rank(vector, q) DESC'</b>
</pre>

This function is a rather satisfactory search engine.
Here is one example search,
after which the user fetches the top-ranking document itself;
with similar commands you can try queries of your own:

<pre>
=# <b>SELECT * FROM finddoc('passage|crawl')</b>
 id |                       headline                        | rank 
----+-------------------------------------------------------+------
  3 | &lt;b&gt;crawling&lt;/b&gt; over cobbles in a low &lt;b&gt;passage&lt;/b&gt;. | 0.19
  1 | &lt;b&gt;crawl&lt;/b&gt; over cobbles leads inward to the west.   |  0.1
  4 | &lt;b&gt;passages&lt;/b&gt; lead east, north, and south.          |  0.1
  5 | &lt;b&gt;crawl&lt;/b&gt; slants up.                               |  0.1
  7 | &lt;b&gt;passage&lt;/b&gt; here is blocked by a recent  cave-in.  |  0.1
(5 rows)
=# <b>SELECT doc FROM docs WHERE id = 3</b>
                       doc                       
-------------------------------------------------
 You are crawling over cobbles in a low passage.
(1 row)
</pre>

While by default the <tt>headline()</tt> function
surrounds matching words with <tt>&lt;b&gt;</tt> and <tt>&lt;/b&gt;</tt>
in order to distinguish them from the surrounding text,
you can provide options that change its behavior;
consult the tsearch2 Reference for more details about
<a href="tsearch2-ref.html#headlines">Headline Functions</a>.
<p>
Though a search may match hundreds or thousands of documents,
you will usually present only ten or twenty results to the user at a time.
This can be most easily accomplished
by limiting your query with a <tt>LIMIT</tt>
and an <tt>OFFSET</tt> clause &mdash;
to display results ten at a time, for example,
your would generate your first page of results
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>0</tt>,
your second page
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>10</tt>,
your third page
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>20</tt>,
and so forth.
There are two problems with this approach, however.
<p>
The first problem is the strain of running the query over again
for every page of results the user views.
For small document collections or lightly loaded servers,
this may not be a problem;
but the impact can be high
when a search must repeatedly rank and sort
the same ten thousand results
on an already busy server.
So instead of selecting only one page of results,
you will probably use <tt>LIMIT</tt> and <tt>OFFSET</tt>
to return a few dozen or few hundred results,
which you can cache and display to the user one page at a time.
Whether a result cache rewards your effort
will depend principally on the behavior of your users &mdash;
how often they even view the second page of results, for instance.
<p>
The second issue solved by caching involves consistency.
If the database is changing while the user browses their results,
then documents might appear and disappear as they page through them.
In some cases the user might even miss a particular result &mdash;
perhaps the one they were looking for &mdash;
if, say, its rank improves from 31th to 30th
after they load results 21&ndash;30 but before they view results 31&ndash;40.
While many databases are static or infrequently updated,
and will not present this problem,
users searching very dymanic document collections
might benefit from the stable results that caches yield.

<blockquote><i>
Having seen the features of a search engine
implemented entirely within the database,
we will learn about some specific tsearch2 features.
First we will look in more detail at document ranking.
</i></blockquote>

<h2><a name=weights>Ranking and Position Weights</a></h2>

<blockquote><i>
When we built our simple search engine,
we used the </i><tt>rank()</tt><i> function to order our results.
Here we describe tsearch2 ranking in more detail.
</i></blockquote>

There are two functions with which tsearch2 can rank search results.
They both use the lexeme positions listed in the <tt>tsvector</tt>,
so you cannot rank vectors
from which these have been removed with <tt>strip()</tt>.
The <tt>rank()</tt> function existed in older versions of OpenFTS,
and has the feature that you can assign different weights
to words from different sections of your document.
The <tt>rank_cd()</tt> uses a recent technique for weighting results
but does not allow different weight to be given
to different sections of your document.
<p>
Both ranking functions allow you to specify,
as an optional last argument,
whether you want their results <i>normalized</i> &mdash;
whether the rank returned should be adjusted for document length.
Specifying a last argument of <tt>0</tt> (zero) makes no adjustment;
<tt>1</tt> (one) divides the document rank
by the logarithm of the document length;
and <tt>2</tt> divides it by the plain length.
In all of these examples we omit this optional argument,
which is the same as specifying zero &mdash;
we are making no adjustment for document length.
<p>
The <tt>rank_cd()</tt> function uses an experimental measurement
called <i>cover density ranking</i> that rewards documents
when they make frequent use of the search terms
that are close together in the document.
You can read about the algorithm in more detail
in Clarke&nbsp;et&nbsp;al.,
 &ldquo;<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
>Relevance Ranking for One to Three Term Queries</a>.&rdquo;
An optional first argument allows you to tune their formula;
for details
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
in the Reference.
<p>
The <tt>rank()</tt> function offers more flexibility
because it pays attention to the <i>weights</i>
with which you have labelled lexeme positions.
Currently tsearch2 supports four different weight labels:
<tt>'D'</tt>, the default weight;
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
All vectors created with <tt>to_tsvector()</tt>
assign the weight <tt>'D'</tt> to each position,
which as the default is not displayed when you print a vector out.
<p>
If you want positions with weights other than <tt>'D'</tt>,
you have two options:
either you can author a vector directly through the <tt>::tsvector</tt>
casting operation,
as described in the following section,
which lets you give each position whichever weight you want;
or you can pass a vector through the <tt>setweight()</tt> function
which sets all of its position weights to a single value.
An example of the latter:


<pre>
=# <b>SELECT vector FROM docs WHERE id = 3</b>
                 vector                 
----------------------------------------
 'low':8 'cobbl':5 'crawl':3 'passag':9
(1 row)
=# <b>SELECT setweight(vector, 'A') FROM docs WHERE id = 3</b>
                 setweight                  
--------------------------------------------
 'low':8A 'cobbl':5A 'crawl':3A 'passag':9A
(1 row)
</pre>


Merely changing all of the weights in a vector is not very useful,
of course,
since this results still in all words having the same weight.
But if we parse different parts of a document separately,
giving each section its own weight,
and then concatenate the vectors of each part into a single vector,
the result can be very useful.
We can construct a simple example
in which document titles are given greater weight
that text in the body of the document:


<pre>
=# <b>CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector )</b>
=# <b>CREATE INDEX tdocs_index ON tdocs USING gist(vector);</b>
=# <b>CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS
  'INSERT INTO tdocs (title, doc, vector)
   VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'</b>
</pre>


Now words from a document title will be weighted differently
than those in the main text
if we provide the title and body as separate arguments:


<pre>
=# <b>SELECT instdoc('Spendid Chamber',
 'The walls are frozen rivers of orange stone.')</b>
 instdoc 
---------
 
(1 row)
=# <b>SELECT vector FROM tdocs</b>
                                    vector                                    
------------------------------------------------------------------------------
 'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A
(1 row)
</pre>


Note that although the necessity is unusual,
you can constrain search terms
to only match words from certain sections
by following them with a colon
and a list of the sections in which the word can occur;
by default this list is <tt>'ABCD'</tt>
so that search terms match words from all sections.
For example,
here we search for a word both generally,
and then looking only for specific weights:


<pre>
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid')</b>
      title      |                     doc                      
-----------------+----------------------------------------------
 Spendid Chamber | The walls are frozen rivers of orange stone.
(1 row)
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A')</b>
      title      |                     doc                      
-----------------+----------------------------------------------
 Spendid Chamber | The walls are frozen rivers of orange stone.
(1 row)
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D')</b>
 title | doc 
-------+-----
(0 rows)
</pre>




<blockquote><i>
Our examples so far use tsearch2 to parse our documents into vectors.
When your application needs absolute control over vector content,
you will want to use direct type casting,
which is described in the next section.
</i></blockquote>

<h2><a name=casting>Casting Vectors and Queries</a></h2>

<blockquote><i>
While tsearch2 has powerful and flexible ways
to process documents and turn them into document vectors,
you will sometimes want to parse documents on your own
and place the results directly in vectors.
Here we show you how.
</i></blockquote>

In the preceding examples,
we used the <tt>to_tsvector()</tt> function
when we needed a document's text reduced to a document vector.
We saw that the function stripped whitespace and punctuation,
eliminated common words,
and altered suffixes to reduce words to a common form.
While these operations are often desirable,
and while in the sections below
we will gain precise control over this process,
there are occasions on which
you want to avoid the changes that <tt>to_tsvector()</tt> makes to text
and specify explicitly the words that you want in your vectors.
Or you may want to create queries directly
rather than through <tt>to_tsquery()</tt>.
<p>
For example,
you may have already developed your own routine
for reducing your documents to searchable lexemes,
and do not want your carefully generated terms altered
by passing them through <tt>to_tsvector()</tt>.
Or you might be developing and debugging parsing routines of your own
that you are not ready to load into the database.
In either case,
you will find that direct insertion is easily accomplished
if you simply follow some simple rules.
<p>
Vectors are created directly
when you cast a string of whitespace separated lexemes
to the <tt>tsvector</tt> type:


<pre>
=# <b>select 'the only exit is the way you came in'::tsvector</b>
                     tsvector                     
--------------------------------------------------
 'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only'
(1 row)
</pre>


Notice that the conversion interpreted the string
simply as a list of lexemes to be included in the vector.
Their order was lost,
as was the number of times each lexeme appeared.
You must keep in mind that directly creating vectors with casting
is <i>not</i> an alternate means of parsing;
it is a way of directly entering lexemes into a vector <i>without</i> parsing.
<p>
Queries can also be created through casting,
if you separate lexemes with boolean operators
rather than with whitespace.
When creating your own vectors and queries,
remember that the search operator <tt>@@</tt>
finds only <i>exact</i> matches between query lexemes and vector lexemes
&mdash;
if they are not exactly the same string,
they will not be considered a match.
<p>
To include lexeme positions in your vector,
write the positions exactly the way tsearch2 displays them
when it prints vectors:
by following each lexeme with a colon
and a comma-separated list of integer positions.
If you list a lexeme more than once,
then all the positions listed for it are combined into a single list.
For example,
here are two ways of writing the same vector,
depending on whether you mention &lsquo;<tt>the</tt>&rsquo; twice
or combine its positions into a list yourself:


<pre>
=# <b>select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector</b>
                              tsvector                              
--------------------------------------------------------------------
 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
(1 row)
=# <b>select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector</b>
                              tsvector                              
--------------------------------------------------------------------
 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
(1 row)
</pre>


Things can get slightly tricky
if you want to include apostrophes, backslashes, or spaces
inside your lexemes
(wanting to include either of the latter would be unusual,
but they can be included if you follow the rules).
The main problem is that the apostrophe and backslash
are important <i>both</i> to PostgreSQL when it is interpreting a string,
<i>and</i> to the <tt>tsvector</tt> conversion function.
You may want to review section
<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1,
&ldquo;String Constants&rdquo;</a>
in the PostgreSQL documentation before proceeding.
<p>
When you cast strings directly into vectors:
<ul>
<li>The string is interpreted as a whitespace-separated list of lexemes,
 any of which can be suffixed with a colon and a list of positions.
<li>A lexeme can be quoted by preceding it with an apostrophe,
 in which case it runs until the next apostrophe;
 otherwise a lexeme ends with the first whitespace or colon encountered.
<li>Any character preceded by a backslash,
 including whitespace, the apostrophe, the colon, and the backslash itself,
 loses its normal meaning and is treated as a letter.
 Backslashes are effective
 both inside and outside of apostrophe-quoted lexemes.
<li>A lexeme can be suffixed with a list of positions
 by appending a colon and a comma-separated list of integers,
 each of which can itself be followed by a letter
 to designate a position weight
 (position weights are <a href="#weights">described below</a>).
</ul>

Here are some example strings,
showing the lexeme you want to insert
together with the string that the <tt>::tsvector</tt> operator
needs to see,
and how you would type that string at the PostgreSQL prompt:

<table align=center>
<tr>
<td><i>For the lexeme...</i>
<td><i>you need the string...</i>
<td><i>which you can type as:</i>
<tr>
<td><tt>nugget</tt>
<td><tt>nugget</tt>
<td><tt>'nugget'</tt>
<tr>
<td><tt>won't</tt>
<td><tt>won't</tt>
<td><tt>'won''t'</tt>
<tr>
<td><tt>pinin'</tt>
<td><tt>pinin'</tt>
<td><tt>'pinin'''</tt>
<tr>
<td><tt>'bout</tt>
<td><tt>\'bout</tt>
<td><tt>'\\''bout'</tt>
<tr>
<td><tt>white mist</tt>
<td><tt>white\ mist</tt>
<td><tt>'white\\ mist'</tt>
<tr>
<td align=right><tt><i>or:</i></tt>
<td><tt>'white mist'</tt>
<td><tt>'''white mist'''</tt>
<tr>
<td><tt>won't budge</tt>
<td><tt>won\'t\ budge</tt>
<td><tt>'won\\''t\\ budge'</tt>
<tr>
<td align=right><tt><i>or:</i></tt>
<td><tt>'won\'t budge'</tt>
<td><tt>'''won\\''t budge'''</tt>
<tr>
<td><tt>back\slashed</tt>
<td><tt>back\\slashed</tt>
<td><tt>'back\\\\slashed'</tt>
</table>

Remember to use the quoted quoting shown at the right
only when typing in strings as part of a PostgreSQL query.
If you are providing strings through a library
that automatically quotes them
or provides them in binary form to PostgreSQL,
then you can use the strings in the middle instead &mdash;
suitably quoted in the language you are using, of course.
<p>
Position weights are <a href="#weights">described below</a>
and can be written exactly as they will be displayed
when you select a weighted vector:

<pre>
=# <b>select 'weighty:1,3A trivial:2B,4'::tsvector</b>
           tsvector            
-------------------------------
 'trivial':2B,4 'weighty':1,3A
(1 row)
</pre>

<p>
Note that if you are composing SQL queries
in a scripting language like Perl or Python,
that itself considers quotes and backslashes special,
then you may have another quoting layer to deal with
on top of the two layers already shown above.
In such cases you may want to write a function
that performs the necessary quoting for you.

<blockquote><i>
Having seen how to create vectors of your own,
it is time to learn how the native tsearch2 parser
reduces documents to vectors.
</i></blockquote>

<h2><a name=parsing_lexing>Parsing and Lexing</a></h2>

<blockquote><i>
The previous section
described how you can bypass the parser provided by tsearch2
and populate your table of documents
with vectors of your own devising.
But for those interested in the native tsearch2 facilities,
we present here an overview of how it goes about
reducing documents to vectors.
</i></blockquote>

The <tt>to_tsvector()</tt> function reduces documents to vectors
in two stages.
First, a <i>parser</i> breaks the input document
into short sequences of text called <i>tokens</i>.
Each token is usually a word, space, or piece of punctuation,
though some parsers return larger and more exotic items
like HTML tags as single tokens.
Each token returned by the parser
is either discarded
or passed to a <i>dictionary</i> that converts it into a lexeme.
The resulting lexemes are collected into a vector and returned.
<p>
The choice of which parser and dictionaries <tt>to_tsvector()</tt> should use
is controlled by your choice of <i>configuration</i>.
The tsearch2 module comes with several configurations,
and you can define more of your own;
in fact the creation of a new configuration is illustrated below,
in the section on position weights.
<p>
To learn about parsing in more detail,
we will study this example:

<pre>
=# <b>select to_tsvector('default',
     'The walls extend upward for well over 100 feet.')</b>
                       to_tsvector                        
----------------------------------------------------------
 '100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4
(1 row)
</pre>

Unlike the <tt>to_tsvector()</tt> calls used in the above examples,
this one specifies the <tt>'default'</tt> configuration explicitly.
When we called <tt>to_tsvector()</tt> in earlier examples
with only one argument,
it used the <i>current</i> configuration,
which is chosen automatically based on your <tt>LOCALE</tt>
if that locale is mentioned in the <tt>pg_ts_cfg</tt> table
(which is shown under the first bullet in the description below).
If your locale is not listed in the table,
your attempts to use the current configuration will return:

<pre>
ERROR:  Can't find tsearch2 config by locale
</pre>

You can always change the current configuration manually
by calling the <tt>set_curcfg()</tt> function
described in the section on
<a href="tsearch2-ref.html#configurations">Configurations</a>
in the Reference.
<p>
Each configuration serves as an index into two different tables:
in <tt>pg_ts_cfg</tt> it determines
which parser will break our text into tokens,
and in <tt>pg_ts_cfgmap</tt>
it directs each token to a dictionary for processing.
The steps in detail are:

<ul>
<li class=big>
<p>First, our text is parsed,
using the parser listed for our configuration in the <tt>pg_ts_cfg</tt> table.
We are using the <tt>'default'</tt> configuration,
so the table tells us to use the <tt>'default'</tt> parser:

<pre>
=# <b>SELECT * FROM pg_ts_cfg WHERE ts_name = 'default'</b>
 ts_name | prs_name | locale 
---------+----------+--------
 default | default  | C
(1 row)
</pre>

So our text will be parsed as though we had called:

<pre>
=# <b>select * from parse('default',
     'The walls extend upward for well over 100 feet.')</b>
</pre>

This breaks the text into a list of tokens
which are each labelled with an integer type:
<p align=center>
The<sub>1</sub>&diams;<sub>12</sub
>walls<sub>1</sub>&diams;<sub>12</sub
>extend<sub>1</sub>&diams;<sub>12</sub
>upward<sub>1</sub>&diams;<sub>12</sub
>for<sub>1</sub>&diams;<sub>12</sub
>well<sub>1</sub>&diams;<sub>12</sub
>over<sub>1</sub>&diams;<sub>12</sub
>100<sub>22</sub>&diams;<sub>12</sub
>feet<sub>1</sub>.<sub>12</sub>
<p>
Each word has been assigned type&nbsp;1;
each space (represented here by a diamond) and the period, type&nbsp;12;
and the number one hundred, type&nbsp;22.
We can retrieve the alias for each type
through the <tt>token_type</tt> function:

<pre>
=# <b>select * from token_type('default')
     where tokid = 1 or tokid = 12 or tokid = 22</b>
 tokid | alias |      descr       
-------+-------+------------------
     1 | lword | Latin word
    12 | blank | Space symbols
    22 | uint  | Unsigned integer
(3 rows)
</pre>


<li class=big>
Next, the tokens are assigned to dictionaries
by looking up their type aliases in <tt>pg_ts_cfgmap</tt>
to determine which dictionary should process each token.
Since we are using the <tt>'default'</tt> configuration:

<pre>
=# <b>select * from pg_ts_cfgmap where ts_name = 'default' and
      (tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint')</b>
 ts_name | tok_alias | dict_name 
---------+-----------+-----------
 default | lword     | {en_stem}
 default | uint      | {simple}
(2 rows)
</pre>

Since this map provides no dictionary for <tt>blank</tt> tokens,
the spaces and period are simply discarded,
leaving nine tokens,
which are then numbered by their position:
<p align=center>
The<sup>1</sup>
walls<sup>2</sup>
extend<sup>3</sup>
upward<sup>4</sup>
for<sup>5</sup>
well<sup>6</sup>
over<sup>7</sup>
100<sup>8</sup>
feet<sup>9</sup>

<li class=big>
Finally, the words are reduced to lexemes by their respective dictionaries.
The <tt>100</tt> is submitted to the <tt>simple</tt> dictionary,
which returns tokens unaltered except for making them lowercase:

<pre>
=# <b>select lexize('simple', '100')</b>
 lexize 
--------
 {100}
(1 row)
</pre>

The other words are submitted to <tt>en_stem</tt>
which reduces each English word to a linguistic stem,
and then discards stems which belong to its list of stop words;
you can see the list of stop words
in the file whose path is in the <tt>dict_initoption</tt> field
of the <tt>pg_ts_dict</tt> table entry for <tt>en_stem</tt>.
The first three words of our text illustrate respectively
an <tt>en_stem</tt> stop word,
a word which <tt>en_stem</tt> alters by stemming,
and a word which <tt>en_stem</tt> leaves alone:

<pre>
=# <b>select lexize('en_stem', 'The')</b>
 lexize 
--------
 {}
(1 row)
=# <b>select lexize('en_stem', 'walls')</b>
 lexize 
--------
 {wall}
(1 row)
=# <b>select lexize('en_stem', 'extend')</b>
  lexize  
----------
 {extend}
(1 row)
</pre>

Once <tt>en_stem</tt> is done discarding stop words and stemming the rest,
we are left with:
<p align=center>
wall<sup>2</sup>
extend<sup>3</sup>
upward<sup>4</sup>
well<sup>6</sup>
100<sup>8</sup>
feet<sup>9</sup>
<p>
Which is precisely the result of the example that began this section.
</ul>
Query words are stemmed by the <tt>to_tsquery()</tt> function
using the same scheme to determine the dictionary for each token,
with the difference that the query parser recognizes as special
the boolean operators that separate query words.

</body>
</html>