|
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p>
- Dictionaries are used to eliminate words that should not be considered in a
- search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so
- that different derived forms of the same word will match. A successfully
- normalized word is called a <em class="firstterm">lexeme</em>. Aside from
- improving search quality, normalization and removal of stop words reduce the
- size of the <code class="type">tsvector</code> representation of a document, thereby
- improving performance. Normalization does not always have linguistic meaning
- and usually depends on application semantics.
- </p><p>
- Some examples of normalization:
-
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- Linguistic - Ispell dictionaries try to reduce input words to a
- normalized form; stemmer dictionaries remove word endings
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <acronym class="acronym">URL</acronym> locations can be canonicalized to make
- equivalent URLs match:
-
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- http://www.pgsql.ru/db/mw/index.html
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- http://www.pgsql.ru/db/mw/
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- http://www.pgsql.ru/db/../db/mw/index.html
- </p></li></ul></div><p>
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- Color names can be replaced by their hexadecimal values, e.g.,
- <code class="literal">red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</code>
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- If indexing numbers, we can
- remove some fractional digits to reduce the range of possible
- numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359,
- <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same
- after normalization if only two digits are kept after the decimal point.
- </p></li></ul></div><p>
-
- </p><p>
- A dictionary is a program that accepts a token as
- input and returns:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- an array of lexemes if the input token is known to the dictionary
- (notice that one token can produce more than one lexeme)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace
- the original token with a new token to be passed to subsequent
- dictionaries (a dictionary that does this is called a
- <em class="firstterm">filtering dictionary</em>)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- an empty array if the dictionary knows the token, but it is a stop word
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <code class="literal">NULL</code> if the dictionary does not recognize the input token
- </p></li></ul></div><p>
- </p><p>
- <span class="productname">PostgreSQL</span> provides predefined dictionaries for
- many languages. There are also several predefined templates that can be
- used to create new dictionaries with custom parameters. Each predefined
- dictionary template is described below. If no existing
- template is suitable, it is possible to create new ones; see the
- <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution
- for examples.
- </p><p>
- A text search configuration binds a parser together with a set of
- dictionaries to process the parser's output tokens. For each token
- type that the parser can return, a separate list of dictionaries is
- specified by the configuration. When a token of that type is found
- by the parser, each dictionary in the list is consulted in turn,
- until some dictionary recognizes it as a known word. If it is identified
- as a stop word, or if no dictionary recognizes the token, it will be
- discarded and not indexed or searched for.
- Normally, the first dictionary that returns a non-<code class="literal">NULL</code>
- output determines the result, and any remaining dictionaries are not
- consulted; but a filtering dictionary can replace the given word
- with a modified word, which is then passed to subsequent dictionaries.
- </p><p>
- The general rule for configuring a list of dictionaries
- is to place first the most narrow, most specific dictionary, then the more
- general dictionaries, finishing with a very general dictionary, like
- a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which
- recognizes everything. For example, for an astronomy-specific search
- (<code class="literal">astro_en</code> configuration) one could bind token type
- <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical
- terms, a general English dictionary and a <span class="application">Snowball</span> English
- stemmer:
-
- </p><pre class="programlisting">
- ALTER TEXT SEARCH CONFIGURATION astro_en
- ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
- </pre><p>
- </p><p>
- A filtering dictionary can be placed anywhere in the list, except at the
- end where it'd be useless. Filtering dictionaries are useful to partially
- normalize words to simplify the task of later dictionaries. For example,
- a filtering dictionary could be used to remove accents from accented
- letters, as is done by the <a class="xref" href="unaccent.html" title="F.43. unaccent">unaccent</a> module.
- </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words</h3></div></div></div><p>
- Stop words are words that are very common, appear in almost every
- document, and have no discrimination value. Therefore, they can be ignored
- in the context of full text searching. For example, every English text
- contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is
- useless to store them in an index. However, stop words do affect the
- positions in <code class="type">tsvector</code>, which in turn affect ranking:
-
- </p><pre class="screen">
- SELECT to_tsvector('english','in the list of stop words');
- to_tsvector
- ----------------------------
- 'list':3 'stop':5 'word':6
- </pre><p>
-
- The missing positions 1,2,4 are because of stop words. Ranks
- calculated for documents with and without stop words are quite different:
-
- </p><pre class="screen">
- SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list & stop'));
- ts_rank_cd
- ------------
- 0.05
-
- SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list & stop'));
- ts_rank_cd
- ------------
- 0.1
- </pre><p>
-
- </p><p>
- It is up to the specific dictionary how it treats stop words. For example,
- <code class="literal">ispell</code> dictionaries first normalize words and then
- look at the list of stop words, while <code class="literal">Snowball</code> stemmers
- first check the list of stop words. The reason for the different
- behavior is an attempt to decrease noise.
- </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary</h3></div></div></div><p>
- The <code class="literal">simple</code> dictionary template operates by converting the
- input token to lower case and checking it against a file of stop words.
- If it is found in the file then an empty array is returned, causing
- the token to be discarded. If not, the lower-cased form of the word
- is returned as the normalized lexeme. Alternatively, the dictionary
- can be configured to report non-stop-words as unrecognized, allowing
- them to be passed on to the next dictionary in the list.
- </p><p>
- Here is an example of a dictionary definition using the <code class="literal">simple</code>
- template:
-
- </p><pre class="programlisting">
- CREATE TEXT SEARCH DICTIONARY public.simple_dict (
- TEMPLATE = pg_catalog.simple,
- STOPWORDS = english
- );
- </pre><p>
-
- Here, <code class="literal">english</code> is the base name of a file of stop words.
- The file's full name will be
- <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>,
- where <code class="literal">$SHAREDIR</code> means the
- <span class="productname">PostgreSQL</span> installation's shared-data directory,
- often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config
- --sharedir</code> to determine it if you're not sure).
- The file format is simply a list
- of words, one per line. Blank lines and trailing spaces are ignored,
- and upper case is folded to lower case, but no other processing is done
- on the file contents.
- </p><p>
- Now we can test our dictionary:
-
- </p><pre class="screen">
- SELECT ts_lexize('public.simple_dict','YeS');
- ts_lexize
- -----------
- {yes}
-
- SELECT ts_lexize('public.simple_dict','The');
- ts_lexize
- -----------
- {}
- </pre><p>
- </p><p>
- We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased
- word, if it is not found in the stop words file. This behavior is
- selected by setting the dictionary's <code class="literal">Accept</code> parameter to
- <code class="literal">false</code>. Continuing the example:
-
- </p><pre class="screen">
- ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
-
- SELECT ts_lexize('public.simple_dict','YeS');
- ts_lexize
- -----------
-
-
- SELECT ts_lexize('public.simple_dict','The');
- ts_lexize
- -----------
- {}
- </pre><p>
- </p><p>
- With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>,
- it is only useful to place a <code class="literal">simple</code> dictionary at the end
- of a list of dictionaries, since it will never pass on any token to
- a following dictionary. Conversely, <code class="literal">Accept</code> = <code class="literal">false</code>
- is only useful when there is at least one following dictionary.
- </p><div class="caution"><h3 class="title">Caution</h3><p>
- Most types of dictionaries rely on configuration files, such as files of
- stop words. These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding.
- They will be translated to the actual database encoding, if that is
- different, when they are read into the server.
- </p></div><div class="caution"><h3 class="title">Caution</h3><p>
- Normally, a database session will read a dictionary configuration file
- only once, when it is first used within the session. If you modify a
- configuration file and want to force existing sessions to pick up the
- new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command
- on the dictionary. This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't
- actually change any parameter values.
- </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary</h3></div></div></div><p>
- This dictionary template is used to create dictionaries that replace a
- word with a synonym. Phrases are not supported (use the thesaurus
- template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that). A synonym
- dictionary can be used to overcome linguistic problems, for example, to
- prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to
- <span class="quote">“<span class="quote">pari</span>”</span>. It is enough to have a <code class="literal">Paris paris</code> line in the
- synonym dictionary and put it before the <code class="literal">english_stem</code>
- dictionary. For example:
-
- </p><pre class="screen">
- SELECT * FROM ts_debug('english', 'Paris');
- alias | description | token | dictionaries | dictionary | lexemes
- -----------+-----------------+-------+----------------+--------------+---------
- asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
-
- CREATE TEXT SEARCH DICTIONARY my_synonym (
- TEMPLATE = synonym,
- SYNONYMS = my_synonyms
- );
-
- ALTER TEXT SEARCH CONFIGURATION english
- ALTER MAPPING FOR asciiword
- WITH my_synonym, english_stem;
-
- SELECT * FROM ts_debug('english', 'Paris');
- alias | description | token | dictionaries | dictionary | lexemes
- -----------+-----------------+-------+---------------------------+------------+---------
- asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
- </pre><p>
- </p><p>
- The only parameter required by the <code class="literal">synonym</code> template is
- <code class="literal">SYNONYMS</code>, which is the base name of its configuration file
- — <code class="literal">my_synonyms</code> in the above example.
- The file's full name will be
- <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code>
- (where <code class="literal">$SHAREDIR</code> means the
- <span class="productname">PostgreSQL</span> installation's shared-data directory).
- The file format is just one line
- per word to be substituted, with the word followed by its synonym,
- separated by white space. Blank lines and trailing spaces are ignored.
- </p><p>
- The <code class="literal">synonym</code> template also has an optional parameter
- <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>. When
- <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file
- are folded to lower case, as are input tokens. When it is
- <code class="literal">true</code>, words and tokens are not folded to lower case,
- but are compared as-is.
- </p><p>
- An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym
- in the configuration file. This indicates that the synonym is a prefix.
- The asterisk is ignored when the entry is used in
- <code class="function">to_tsvector()</code>, but when it is used in
- <code class="function">to_tsquery()</code>, the result will be a query item with
- the prefix match marker (see
- <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>).
- For example, suppose we have these entries in
- <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>:
- </p><pre class="programlisting">
- postgres pgsql
- postgresql pgsql
- postgre pgsql
- gogle googl
- indices index*
- </pre><p>
- Then we will get these results:
- </p><pre class="screen">
- mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
- mydb=# SELECT ts_lexize('syn','indices');
- ts_lexize
- -----------
- {index}
- (1 row)
-
- mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
- mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
- mydb=# SELECT to_tsvector('tst','indices');
- to_tsvector
- -------------
- 'index':1
- (1 row)
-
- mydb=# SELECT to_tsquery('tst','indices');
- to_tsquery
- ------------
- 'index':*
- (1 row)
-
- mydb=# SELECT 'indexes are very useful'::tsvector;
- tsvector
- ---------------------------------
- 'are' 'indexes' 'useful' 'very'
- (1 row)
-
- mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
- ?column?
- ----------
- t
- (1 row)
- </pre><p>
- </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary</h3></div></div></div><p>
- A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is
- a collection of words that includes information about the relationships
- of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower
- terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related
- terms, etc.
- </p><p>
- Basically a thesaurus dictionary replaces all non-preferred terms by one
- preferred term and, optionally, preserves the original terms for indexing
- as well. <span class="productname">PostgreSQL</span>'s current implementation of the
- thesaurus dictionary is an extension of the synonym dictionary with added
- <em class="firstterm">phrase</em> support. A thesaurus dictionary requires
- a configuration file of the following format:
-
- </p><pre class="programlisting">
- # this is a comment
- sample word(s) : indexed word(s)
- more sample word(s) : more indexed word(s)
- ...
- </pre><p>
-
- where the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a
- phrase and its replacement.
- </p><p>
- A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which
- is specified in the dictionary's configuration) to normalize the input
- text before checking for phrase matches. It is only possible to select one
- subdictionary. An error is reported if the subdictionary fails to
- recognize a word. In that case, you should remove the use of the word or
- teach the subdictionary about it. You can place an asterisk
- (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying
- the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known
- to the subdictionary.
- </p><p>
- The thesaurus dictionary chooses the longest match if there are multiple
- phrases matching the input, and ties are broken by using the last
- definition.
- </p><p>
- Specific stop words recognized by the subdictionary cannot be
- specified; instead use <code class="literal">?</code> to mark the location where any
- stop word can appear. For example, assuming that <code class="literal">a</code> and
- <code class="literal">the</code> are stop words according to the subdictionary:
-
- </p><pre class="programlisting">
- ? one ? two : swsw
- </pre><p>
-
- matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>;
- both would be replaced by <code class="literal">swsw</code>.
- </p><p>
- Since a thesaurus dictionary has the capability to recognize phrases it
- must remember its state and interact with the parser. A thesaurus dictionary
- uses these assignments to check if it should handle the next word or stop
- accumulation. The thesaurus dictionary must be configured
- carefully. For example, if the thesaurus dictionary is assigned to handle
- only the <code class="literal">asciiword</code> token, then a thesaurus dictionary
- definition like <code class="literal">one 7</code> will not work since token type
- <code class="literal">uint</code> is not assigned to the thesaurus dictionary.
- </p><div class="caution"><h3 class="title">Caution</h3><p>
- Thesauruses are used during indexing so any change in the thesaurus
- dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing.
- For most other dictionary types, small changes such as adding or
- removing stopwords does not force reindexing.
- </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration</h4></div></div></div><p>
- To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code>
- template. For example:
-
- </p><pre class="programlisting">
- CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
- TEMPLATE = thesaurus,
- DictFile = mythesaurus,
- Dictionary = pg_catalog.english_stem
- );
- </pre><p>
-
- Here:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- <code class="literal">thesaurus_simple</code> is the new dictionary's name
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <code class="literal">mythesaurus</code> is the base name of the thesaurus
- configuration file.
- (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>,
- where <code class="literal">$SHAREDIR</code> means the installation shared-data
- directory.)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here,
- a Snowball English stemmer) to use for thesaurus normalization.
- Notice that the subdictionary will have its own
- configuration (for example, stop words), which is not shown here.
- </p></li></ul></div><p>
-
- Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code>
- to the desired token types in a configuration, for example:
-
- </p><pre class="programlisting">
- ALTER TEXT SEARCH CONFIGURATION russian
- ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
- WITH thesaurus_simple;
- </pre><p>
- </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example</h4></div></div></div><p>
- Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>,
- which contains some astronomical word combinations:
-
- </p><pre class="programlisting">
- supernovae stars : sn
- crab nebulae : crab
- </pre><p>
-
- Below we create a dictionary and bind some token types to
- an astronomical thesaurus and English stemmer:
-
- </p><pre class="programlisting">
- CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
- TEMPLATE = thesaurus,
- DictFile = thesaurus_astro,
- Dictionary = english_stem
- );
-
- ALTER TEXT SEARCH CONFIGURATION russian
- ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
- WITH thesaurus_astro, english_stem;
- </pre><p>
-
- Now we can see how it works.
- <code class="function">ts_lexize</code> is not very useful for testing a thesaurus,
- because it treats its input as a single token. Instead we can use
- <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code>
- which will break their input strings into multiple tokens:
-
- </p><pre class="screen">
- SELECT plainto_tsquery('supernova star');
- plainto_tsquery
- -----------------
- 'sn'
-
- SELECT to_tsvector('supernova star');
- to_tsvector
- -------------
- 'sn':1
- </pre><p>
-
- In principle, one can use <code class="function">to_tsquery</code> if you quote
- the argument:
-
- </p><pre class="screen">
- SELECT to_tsquery('''supernova star''');
- to_tsquery
- ------------
- 'sn'
- </pre><p>
-
- Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae
- stars</code> in <code class="literal">thesaurus_astro</code> because we specified
- the <code class="literal">english_stem</code> stemmer in the thesaurus definition.
- The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>.
- </p><p>
- To index the original phrase as well as the substitute, just include it
- in the right-hand part of the definition:
-
- </p><pre class="screen">
- supernovae stars : sn supernovae stars
-
- SELECT plainto_tsquery('supernova star');
- plainto_tsquery
- -----------------------------
- 'sn' & 'supernova' & 'star'
- </pre><p>
- </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary</h3></div></div></div><p>
- The <span class="application">Ispell</span> dictionary template supports
- <em class="firstterm">morphological dictionaries</em>, which can normalize many
- different linguistic forms of a word into the same lexeme. For example,
- an English <span class="application">Ispell</span> dictionary can match all declensions and
- conjugations of the search term <code class="literal">bank</code>, e.g.,
- <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>,
- <code class="literal">banks'</code>, and <code class="literal">bank's</code>.
- </p><p>
- The standard <span class="productname">PostgreSQL</span> distribution does
- not include any <span class="application">Ispell</span> configuration files.
- Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>.
- Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO < 2.0.1)
- and <a class="ulink" href="https://sourceforge.net/projects/hunspell/" target="_top">Hunspell</a>
- (OO >= 2.0.2). A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice
- Wiki</a>.
- </p><p>
- To create an <span class="application">Ispell</span> dictionary perform these steps:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- download dictionary configuration files. <span class="productname">OpenOffice</span>
- extension files have the <code class="filename">.oxt</code> extension. It is necessary
- to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change
- extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some
- dictionary files it is also needed to convert characters to the UTF-8
- encoding with commands (for example, for a Norwegian language dictionary):
- </p><pre class="programlisting">
- iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
- iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
- </pre><p>
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- load files into PostgreSQL with the following command:
- </p><pre class="programlisting">
- CREATE TEXT SEARCH DICTIONARY english_hunspell (
- TEMPLATE = ispell,
- DictFile = en_us,
- AffFile = en_us,
- Stopwords = english);
- </pre><p>
- </p></li></ul></div><p>
- Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code>
- specify the base names of the dictionary, affixes, and stop-words files.
- The stop-words file has the same format explained above for the
- <code class="literal">simple</code> dictionary type. The format of the other files is
- not specified here but is available from the above-mentioned web sites.
- </p><p>
- Ispell dictionaries usually recognize a limited set of words, so they
- should be followed by another broader dictionary; for
- example, a Snowball dictionary, which recognizes everything.
- </p><p>
- The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following
- structure:
- </p><pre class="programlisting">
- prefixes
- flag *A:
- . > RE # As in enter > reenter
- suffixes
- flag T:
- E > ST # As in late > latest
- [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
- [AEIOU]Y > EST # As in gray > grayest
- [^EY] > EST # As in small > smallest
- </pre><p>
- </p><p>
- And the <code class="filename">.dict</code> file has the following structure:
- </p><pre class="programlisting">
- lapse/ADGRS
- lard/DGRS
- large/PRTY
- lark/MRS
- </pre><p>
- </p><p>
- Format of the <code class="filename">.dict</code> file is:
- </p><pre class="programlisting">
- basic_form/affix_class_name
- </pre><p>
- </p><p>
- In the <code class="filename">.affix</code> file every affix flag is described in the
- following format:
- </p><pre class="programlisting">
- condition > [-stripping_letters,] adding_affix
- </pre><p>
- </p><p>
- Here, condition has a format similar to the format of regular expressions.
- It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>.
- For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word
- is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>,
- <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>.
- <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code>
- nor <code class="literal">"y"</code>.
- </p><p>
- Ispell dictionaries support splitting compound words;
- a useful feature.
- Notice that the affix file should specify a special flag using the
- <code class="literal">compoundwords controlled</code> statement that marks dictionary
- words that can participate in compound formation:
-
- </p><pre class="programlisting">
- compoundwords controlled z
- </pre><p>
-
- Here are some examples for the Norwegian language:
-
- </p><pre class="programlisting">
- SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
- {over,buljong,terning,pakk,mester,assistent}
- SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
- {sjokoladefabrikk,sjokolade,fabrikk}
- </pre><p>
- </p><p>
- <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>.
- The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following
- structure:
- </p><pre class="programlisting">
- PFX A Y 1
- PFX A 0 re .
- SFX T N 4
- SFX T 0 st e
- SFX T y iest [^aeiou]y
- SFX T 0 est [aeiou]y
- SFX T 0 est [^ey]
- </pre><p>
- </p><p>
- The first line of an affix class is the header. Fields of an affix rules are
- listed after the header:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- parameter name (PFX or SFX)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- flag (name of the affix class)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- stripping characters from beginning (at prefix) or end (at suffix) of the
- word
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- adding affix
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- condition that has a format similar to the format of regular expressions.
- </p></li></ul></div><p>
- The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of
- <span class="application">Ispell</span>:
- </p><pre class="programlisting">
- larder/M
- lardy/RT
- large/RSPMYT
- largehearted
- </pre><p>
- </p><div class="note"><h3 class="title">Note</h3><p>
- <span class="application">MySpell</span> does not support compound words.
- <span class="application">Hunspell</span> has sophisticated support for compound words. At
- present, <span class="productname">PostgreSQL</span> implements only the basic
- compound word operations of Hunspell.
- </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary</h3></div></div></div><p>
- The <span class="application">Snowball</span> dictionary template is based on a project
- by Martin Porter, inventor of the popular Porter's stemming algorithm
- for the English language. Snowball now provides stemming algorithms for
- many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball
- site</a> for more information). Each algorithm understands how to
- reduce common variant forms of words to a base, or stem, spelling within
- its language. A Snowball dictionary requires a <code class="literal">language</code>
- parameter to identify which stemmer to use, and optionally can specify a
- <code class="literal">stopword</code> file name that gives a list of words to eliminate.
- (<span class="productname">PostgreSQL</span>'s standard stopword lists are also
- provided by the Snowball project.)
- For example, there is a built-in definition equivalent to
-
- </p><pre class="programlisting">
- CREATE TEXT SEARCH DICTIONARY english_stem (
- TEMPLATE = snowball,
- Language = english,
- StopWords = english
- );
- </pre><p>
-
- The stopword file format is the same as already explained.
- </p><p>
- A <span class="application">Snowball</span> dictionary recognizes everything, whether
- or not it is able to simplify the word, so it should be placed
- at the end of the dictionary list. It is useless to have it
- before any other dictionary because a token will never pass through it to
- the next dictionary.
- </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html>
|