Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
Postgres FD Implementation
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Abuhujair Javed
Postgres FD Implementation
Commits
092ed294
Commit
092ed294
authored
Nov 08, 2006
by
Teodor Sigaev
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
New README, forgotten when docs was updated
parent
0c96e427
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
167 additions
and
166 deletions
+167
-166
contrib/tsearch2/README.tsearch2
contrib/tsearch2/README.tsearch2
+167
-166
No files found.
contrib/tsearch2/README.tsearch2
View file @
092ed294
Tsearch2 - full text search extension for PostgreSQL
Tsearch2 - full text search extension for PostgreSQL
[10][Online version] of this document is available
[1]Online version of this document is available
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
Tsearch2 - is the full text engine, fully integrated into PostgreSQL
RDBMS.
Notice: This version is fully incompatible with old tsearch (V1),
which was deprecated in 7.4 and obsoleted in 8.0.
Main features
The Tsearch2 contrib module contains an implementation of a new data
* Full online update
type tsvector - a searchable data type with indexed access. In a
* Supports multiple table driven configurations
nutshell, tsvector is a set of unique words along with their
* flexible and rich linguistic support (dictionaries, stop words),
positional information in the document, organized in a special
thesaurus
structure optimized for fast access and lookup. Actually, each word
* full multibyte (UTF-8) support
entry, besides its position in the document, could have a weight
* Sophisticated ranking functions with support of proximity and
attribute, describing importance of this word (at a specific) position
structure information (rank, rank_cd)
in document. A set of bit-signatures of a fixed length, representing
* Index support (GiST and Gin) with concurrency and recovery support
tsvectors, are stored in a search tree (developed using PostgreSQL
* Rich query language with query rewriting support
GiST), which provides online update of full text index and fast query
* Headline support (text fragments with highlighted search terms)
lookup. The module provides indexed access methods, queries,
* Ability to plug-in custom dictionaries and parsers
operations and supporting routines for the tsvector data type and easy
* Template generator for tsearch2 dictionaries with [2]snowball
conversion of text data to tsvector. Table driven configuration allows
stemmer support
creation of custom configuration optimized for specific searches using
* It is mature (5 years of development)
Tsearch2, in a nutshell, provides FTS operator (contains) for the new
data types, representing document (tsvector) and query (tsquery).
Table driven configuration allows creation of custom searches using
standard SQL commands.
standard SQL commands.
Configuration allows you to:
tsvector is a searchable data type, representing document. It is a set
* specify the type of lexemes to be indexed and the way they are
of unique words along with their positional information in the
processed.
document, organized in a special structure optimized for fast access
* specify dictionaries to be used along with stop words recognition.
and lookup. Each entry could be labelled to reflect its importance in
* specify the parser used to process a document.
document.
See [11]Documentation Roadmap for links to documentation.
tsquery is a data type for textual queries with support of boolean
operators. It consists of lexemes (optionally labelled) with boolean
operators between.
Table driven configuration allows to specify:
* parser, which used to break document onto lexemes
* what lexemes to index and the way they are processed
* dictionaries to be used along with stop words recognition.
OpenFTS vs Tsearch2
OpenFTS vs Tsearch2
OpenFTS is a middleware between application and database, so it uses
[3]OpenFTS is a middleware between application and database. OpenFTS
tsearch2 as a storage, while database engine is used as a query executor
uses tsearch2 as a storage and database engine as a query executor
(searching). Everything else (parsing of documents, query processing,
(searching). Everything else, i.e. parsing of documents, query
linguistics) carry outs on client side. That's why OpenFTS has its own
processing, linguistics, carry outs on client side. That's why OpenFTS
configuration table (fts_conf) and works with its own set of dictionaries.
has its own configuration table (fts_conf) and works with its own set
OpenFTS is more flexible, because it could be used in multi-server
of dictionaries. OpenFTS is more flexible, because it could be used in
architecture with separated machines for repository of documents
multi-server architecture with separate machines for repository of
(documents could be stored in file system), database and query engine.
documents (documents could be stored in filesystem), database and
query engine.
See [4]Documentation Roadmap for links to documentation.
Authors
Authors
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,
Delta-Soft Ltd.
,Russia
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,
Moscow University
,Russia
Contributors
Contributors
* Robert
John Shepherd and Andrew J. Kopciuch
submitted
* Robert
John Shepherd and Andrew J. Kopciuch
submitted
"Introduction
to
tsearch" (Robert - tsearch v1, Andrew - tsearch
"Introduction
to
tsearch" (Robert - tsearch v1, Andrew - tsearch
v2)
v2)
* Brandon
Craig Rhodes wrote "Tsearch2 Guide" and
"Tsearch2
* Brandon
Craig Rhodes wrote "Tsearch2 Guide" and
"Tsearch2
Reference" and proposed new naming convention for tsearch V2
Reference" and proposed new naming convention for tsearch V2
Features Added with Tsearch2
* Relevance ranking of search results
Sponsors
* Table driven configuration
* Morphology support (ispell dictionaries, snowball stemmers)
* ABC Startsiden - compound words support
* Headline support (text fragments with highlighted search terms)
* University of Mannheim for UTF-8 support (in 8.2)
* Ability to plug-in custom dictionaries and parsers
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
* Synonym dictionary
Inverted index (in 8.2)
* Generator of templates for dictionaries (built-in snowball stemmer
* Georgia Public Library Service and LibLime, Inc. for Thesaurus
support)
dictionary
* Statistics of indexed words is available
* PostGIS community - GiST Concurrency and Recovery
The authors are grateful to the Russian Foundation for Basic Research
and Delta-Soft Ltd., Moscow, Russia for support.
Limitations
Limitations
* Lexeme should be not longer than 2048 bytes
* Length of lexeme < 2K
* The number of lexemes is limited by 2^32. Note, that actual
* Length of tsvector (lexemes + positions) < 1Mb
capacity of tsvector is depends on whether positional information
* The number of lexemes < 4^32
is stored or not.
* 0< Positional information < 16383
* tsvector - the size is limited by approximately 2^20 bytes.
* No more than 256 positions per lexeme
* tsquery - the number of entries (lexemes and operations) < 32768
* The number of nodes ( lexemes + operations) in tsquery < 32768
* Positional information
+ maximal position of lexeme < 2^14 (16384)
+ lexeme could have maximum 256 positions
References
References
* GiST development site -
* GiST development site -
[12]http://www.sai.msu.su/~megera/postgres/gist
[6]http://www.sai.msu.su/~megera/postgres/gist
* OpenFTS home page - [13]http://openfts.sourceforge.net/
* GiN development - [7]http://www.sigaev.ru/gin/
* OpenFTS home page - [8]http://openfts.sourceforge.net/
* Mailing list -
* Mailing list -
[14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
eral
ral
[15]Documentation Roadmap
Documentation Roadmap
Documentation Roadmap
* Several docs are available from docs/ subdirectory
* Several docs are available from docs/ subdirectory
...
@@ -97,113 +108,103 @@ Documentation Roadmap
...
@@ -97,113 +108,103 @@ Documentation Roadmap
+ "Tsearch2 Guide" by Brandon Rhodes
+ "Tsearch2 Guide" by Brandon Rhodes
+ "Tsearch2 Reference" by Brandon Rhodes
+ "Tsearch2 Reference" by Brandon Rhodes
* Readme.gendict in gendict/ subdirectory
* Readme.gendict in gendict/ subdirectory
+ [16][Gendict tutorial]
+ Also, check [10]Gendict tutorial
* Check [11]tsearch2 Wiki pages for various documentation
Online version of documentation is always available from Tsearch V2
home page -
[17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
Support
Support
Authors urgently recommend people to use [18][openfts-general] or
Authors urgently recommend people to use [12]openfts-general or
[19][pgsql-general] mailing lists for questions and discussions.
[13]pgsql-general mailing lists for questions and discussions.
Caution
In spite of apparent easy full text searching with our tsearch module
(authors hope it's so), any serious search engine require profound
study of various aspects, such as stop words, dictionaries, special
parsers. Tsearch module was designed to facilitate both those cases.
Development History
Development History
Latest news
To the PostgreSQL 8.2 release we added:
* multibyte (UTF-8) support
* Thesaurus dictionary
* Query rewriting
* rank_cd relevation function now support different weights of
lexemes
* GiN support adds scalability of tsearch2
Pre-tsearch era
Pre-tsearch era
Development
of OpenFTS
began in 2000 after realizing that we
Development
of OpenFTS
began in 2000 after realizing that we
need
ed a search engine optimized for online updates and able to
need
a search engine optimized for online updates with access
access metadata from the
database. This is essential for online
to metadata from the
database. This is essential for online
news agencies, web portals, digital libraries, etc. Most search
news agencies, web portals, digital libraries, etc. Most search
engines available utilize an inverted index which is very fast
engines available utilize an inverted index which is very fast
for searching but very slow for online updates. Incremental
for searching but very slow for online updates. Incremental
updates of an inverted index is a complex engineering task
updates of an inverted index is a complex engineering task
while we needed something light, free and with the ability to
while we needed something light, free and with the ability to
access metadata from the database. The last requirement is very
access metadata from the database. The last requirement was
important because in a real life application a search engine
very important because in a real life application search engine
should always consult metadata ( topic, permissions, date
should always consult metadata ( topic, permissions, date
range, version, etc.). We extensively use PostgreSQL as a
range, version, etc.). We extensively use PostgreSQL as a
database backend and have no intention to move from it, so the
database backend and have no intention to move from it, so the
problem was to find a data structure and a fast way to access
problem was to find a data structure and a fast way to access
it. PostgreSQL has rather unique data type for storing sets
it. PostgreSQL has rather unique data type for storing sets
(think about words) - arrays, but lacks index access to them. A
(think about words) - arrays, but lacks index access to them.
document is parsed into lexemes, which are identified in
During our research we found a paper of Joseph Hellerstein, who
various ways (e.g. stemming, morphology, dictionary), and as a
introduced an interesting data structure suitable for sets -
result is reduced to an array of integer numbers. During our
RD-tree (Russian Doll tree). Further research lead us to the
research we found a paper of Joseph Hellerstein which
idea to use GiST for implementing RD-tree, but at that time the
introduced an interesting data structure suitable for sets -
GiST code was intouched for a long time and contained several
RD-tree (Russian Doll tree). It looked very attractive, but
bugs. After work on improving GiST for version 7.0.3 of
implementing it in PostgreSQL seemed difficult because of our
PostgreSQL was done, we were able to implement RD-Tree and use
ignorance of database internals. Further research lead us to
it for index access to arrays of integers. This implementation
the idea to use GiST for implementing RD-tree, but at that time
was ideally suited for small arrays and eliminated complex
the GiST code had for a long while remained untouched and
joins, but was practically useless for indexing large arrays.
contained several bugs. After work on improving GiST for
The next improvement came from an idea to represent a document
version 7.0.3 of PostgreSQL was done, we were able to implement
by a single bit-signature, a so-called superimposed signature
RD-Tree and use it for index access to arrays of integers. This
(see "Index Structures for Databases Containing Data Items with
implementation was ideally suited for small arrays and
Set-valued Attributes", 1997, Sven Helmer for details). We
eliminated complex joins, but was practically useless for
developeded the contrib/intarray module and used it for full
indexing large arrays. The next improvement came from an idea
text indexing.
to represent a document by a single bit-signature, a so-called
superimposed signature (see "Index Structures for Databases
Containing Data Items with Set-valued Attributes", 1997, Sven
Helmer for details). We developeded the contrib/intarray module
and used it for full text indexing.
tsearch v1
tsearch v1
It was inconvenient to use integer id's instead of words, so we
It was inconvenient to use integer id's instead of words, so we
introduced a new data type called 'txtidx' - a searchable data
introduced
a new data type called 'txtidx' - a searchable data
type
(textual) with
indexed access. This was a first step of
type
(textual) with
indexed access. This was a first step of
our
work on an
implementation of a built-in PostgreSQL full
our
work on an
implementation of a built-in PostgreSQL full
text search engine. Even though tsearch v1 had many features of
text search engine. Even though tsearch v1 had many features of
a
search
engine it lacked configuration support and relevance
a
search
engine it lacked configuration support and relevance
ranking. People were encouraged to use OpenFTS, which provided
ranking.
People were encouraged to use OpenFTS, which provided
relevance
ranking based on coordinate
information and flexible
relevance
ranking based on positional
information and flexible
configuration.
OpenFTS v.0.34 is the
last version based on
configuration.
OpenFTS v.0.34 is the
last version based on
tsearch v1.
tsearch v1.
tsearch V2
tsearch V2
People recognized tsearch as a powerful tool for full text
People recognized tsearch as a powerful tool for full text
searching and insisted on adding ranking support, better
searching and insisted on adding ranking support, better
configurability, etc. We already thought about moving most of
configurability, etc. We already thought about moving most of
the features of OpenFTS to tsearch, and in the early 2003 we
the features of OpenFTS to tsearch, and in the early 2003 we
decided to work on a new version of tsearch - tsearch v2. We've
decided to work on a new version of tsearch. We abandoned
abandoned auxiliary index tables which were used by OpenFTS to
auxiliary index tables which were used by OpenFTS to store
store coordinate information and modified the txtidx type to
positional information and modified the txtidx type to store
store them internally. Also, we've added table-driven
them internally. We added table-driven configuration, support
configuration, support of ispell dictionaries, snowball
of ispell dictionaries, snowball stemmers and the ability to
stemmers and the ability to specify which types of lexemes to
specify which types of lexemes to index. Now, it's possible to
index. Also, it's now possible to generate headlines of
generate headlines of documents with highlighted search terms.
documents with highlighted search terms. These changes make
These changes make tsearch more user friendly and turn it into
tsearch more user friendly and turn it into a really powerful
a really powerful full text search engine. Brandon Rhodes
full text search engine. After announcing the alpha version, we
proposed to rename tsearch functions for consistency and we
received a proposal from Brandon Rhodes to rename tsearch
renamed txtidx type to tsvector and other things as well. To
functions to be more consistent. So, we have renamed txtidx
allow users of tsearch v1 smooth upgrade, we named the module
type to tsvector and other things as well.
as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
To allow users of tsearch v1 smooth upgrade, we named the module as
tsearch2.
Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
people could download it from OpenFTS CVS (see link from [20][OpenFTS
page]
References
References
10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
2. http://snowball.tartarus.org/
12. http://www.sai.msu.su/~megera/postgres/gist
3. http://openfts.sourceforge.net/
13. http://openfts.sourceforge.net/
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
5. http:www.jfg-networks.com/
15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
6. http://www.sai.msu.su/~megera/postgres/gist
16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
7. http://www.sigaev.ru/gin/
17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
8. http://openfts.sourceforge.net/
18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
19. http://archives.postgresql.org/pgsql-general/
10. http://www.sai.msu.su/~megera/wiki/Gendict
20. http://openfts.sourceforge.net/
11. http://www.sai.msu.su/~megera/wiki/Tsearch2
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
13. http://archives.postgresql.org/pgsql-general/
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment