|
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.1. Introduction</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="textsearch.html" title="Chapter 12. Full Text Search" /><link rel="next" href="textsearch-tables.html" title="12.2. Tables and Indexes" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.1. Introduction</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-INTRO"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.1. Introduction</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-DOCUMENT">12.1.1. What Is a Document?</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-MATCHING">12.1.2. Basic Text Matching</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-INTRO-CONFIGURATIONS">12.1.3. Configurations</a></span></dt></dl></div><p>
- Full Text Searching (or just <em class="firstterm">text search</em>) provides
- the capability to identify natural-language <em class="firstterm">documents</em> that
- satisfy a <em class="firstterm">query</em>, and optionally to sort them by
- relevance to the query. The most common type of search
- is to find all documents containing given <em class="firstterm">query terms</em>
- and return them in order of their <em class="firstterm">similarity</em> to the
- query. Notions of <code class="varname">query</code> and
- <code class="varname">similarity</code> are very flexible and depend on the specific
- application. The simplest search considers <code class="varname">query</code> as a
- set of words and <code class="varname">similarity</code> as the frequency of query
- words in the document.
- </p><p>
- Textual search operators have existed in databases for years.
- <span class="productname">PostgreSQL</span> has
- <code class="literal">~</code>, <code class="literal">~*</code>, <code class="literal">LIKE</code>, and
- <code class="literal">ILIKE</code> operators for textual data types, but they lack
- many essential properties required by modern information systems:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- There is no linguistic support, even for English. Regular expressions
- are not sufficient because they cannot easily handle derived words, e.g.,
- <code class="literal">satisfies</code> and <code class="literal">satisfy</code>. You might
- miss documents that contain <code class="literal">satisfies</code>, although you
- probably would like to find them when searching for
- <code class="literal">satisfy</code>. It is possible to use <code class="literal">OR</code>
- to search for multiple derived forms, but this is tedious and error-prone
- (some words can have several thousand derivatives).
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- They provide no ordering (ranking) of search results, which makes them
- ineffective when thousands of matching documents are found.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- They tend to be slow because there is no index support, so they must
- process all documents for every search.
- </p></li></ul></div><p>
- Full text indexing allows documents to be <span class="emphasis"><em>preprocessed</em></span>
- and an index saved for later rapid searching. Preprocessing includes:
- </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: none; "><li class="listitem" style="list-style-type: none"><p>
- <span class="emphasis"><em>Parsing documents into <em class="firstterm">tokens</em></em></span>. It is
- useful to identify various classes of tokens, e.g., numbers, words,
- complex words, email addresses, so that they can be processed
- differently. In principle token classes depend on the specific
- application, but for most purposes it is adequate to use a predefined
- set of classes.
- <span class="productname">PostgreSQL</span> uses a <em class="firstterm">parser</em> to
- perform this step. A standard parser is provided, and custom parsers
- can be created for specific needs.
- </p></li><li class="listitem" style="list-style-type: none"><p>
- <span class="emphasis"><em>Converting tokens into <em class="firstterm">lexemes</em></em></span>.
- A lexeme is a string, just like a token, but it has been
- <em class="firstterm">normalized</em> so that different forms of the same word
- are made alike. For example, normalization almost always includes
- folding upper-case letters to lower-case, and often involves removal
- of suffixes (such as <code class="literal">s</code> or <code class="literal">es</code> in English).
- This allows searches to find variant forms of the
- same word, without tediously entering all the possible variants.
- Also, this step typically eliminates <em class="firstterm">stop words</em>, which
- are words that are so common that they are useless for searching.
- (In short, then, tokens are raw fragments of the document text, while
- lexemes are words that are believed useful for indexing and searching.)
- <span class="productname">PostgreSQL</span> uses <em class="firstterm">dictionaries</em> to
- perform this step. Various standard dictionaries are provided, and
- custom ones can be created for specific needs.
- </p></li><li class="listitem" style="list-style-type: none"><p>
- <span class="emphasis"><em>Storing preprocessed documents optimized for
- searching</em></span>. For example, each document can be represented
- as a sorted array of normalized lexemes. Along with the lexemes it is
- often desirable to store positional information to use for
- <em class="firstterm">proximity ranking</em>, so that a document that
- contains a more <span class="quote">“<span class="quote">dense</span>”</span> region of query words is
- assigned a higher rank than one with scattered query words.
- </p></li></ul></div><p>
- Dictionaries allow fine-grained control over how tokens are normalized.
- With appropriate dictionaries, you can:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- Define stop words that should not be indexed.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- Map synonyms to a single word using <span class="application">Ispell</span>.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- Map phrases to a single word using a thesaurus.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- Map different variations of a word to a canonical form using
- an <span class="application">Ispell</span> dictionary.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- Map different variations of a word to a canonical form using
- <span class="application">Snowball</span> stemmer rules.
- </p></li></ul></div><p>
- A data type <code class="type">tsvector</code> is provided for storing preprocessed
- documents, along with a type <code class="type">tsquery</code> for representing processed
- queries (<a class="xref" href="datatype-textsearch.html" title="8.11. Text Search Types">Section 8.11</a>). There are many
- functions and operators available for these data types
- (<a class="xref" href="functions-textsearch.html" title="9.13. Text Search Functions and Operators">Section 9.13</a>), the most important of which is
- the match operator <code class="literal">@@</code>, which we introduce in
- <a class="xref" href="textsearch-intro.html#TEXTSEARCH-MATCHING" title="12.1.2. Basic Text Matching">Section 12.1.2</a>. Full text searches can be accelerated
- using indexes (<a class="xref" href="textsearch-indexes.html" title="12.9. GIN and GiST Index Types">Section 12.9</a>).
- </p><div class="sect2" id="TEXTSEARCH-DOCUMENT"><div class="titlepage"><div><div><h3 class="title">12.1.1. What Is a Document?</h3></div></div></div><a id="id-1.5.11.4.10.2" class="indexterm"></a><p>
- A <em class="firstterm">document</em> is the unit of searching in a full text search
- system; for example, a magazine article or email message. The text search
- engine must be able to parse documents and store associations of lexemes
- (key words) with their parent document. Later, these associations are
- used to search for documents that contain query words.
- </p><p>
- For searches within <span class="productname">PostgreSQL</span>,
- a document is normally a textual field within a row of a database table,
- or possibly a combination (concatenation) of such fields, perhaps stored
- in several tables or obtained dynamically. In other words, a document can
- be constructed from different parts for indexing and it might not be
- stored anywhere as a whole. For example:
-
- </p><pre class="programlisting">
- SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
- FROM messages
- WHERE mid = 12;
-
- SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
- FROM messages m, docs d
- WHERE m.mid = d.did AND m.mid = 12;
- </pre><p>
- </p><div class="note"><h3 class="title">Note</h3><p>
- Actually, in these example queries, <code class="function">coalesce</code>
- should be used to prevent a single <code class="literal">NULL</code> attribute from
- causing a <code class="literal">NULL</code> result for the whole document.
- </p></div><p>
- Another possibility is to store the documents as simple text files in the
- file system. In this case, the database can be used to store the full text
- index and to execute searches, and some unique identifier can be used to
- retrieve the document from the file system. However, retrieving files
- from outside the database requires superuser permissions or special
- function support, so this is usually less convenient than keeping all
- the data inside <span class="productname">PostgreSQL</span>. Also, keeping
- everything inside the database allows easy access
- to document metadata to assist in indexing and display.
- </p><p>
- For text search purposes, each document must be reduced to the
- preprocessed <code class="type">tsvector</code> format. Searching and ranking
- are performed entirely on the <code class="type">tsvector</code> representation
- of a document — the original text need only be retrieved
- when the document has been selected for display to a user.
- We therefore often speak of the <code class="type">tsvector</code> as being the
- document, but of course it is only a compact representation of
- the full document.
- </p></div><div class="sect2" id="TEXTSEARCH-MATCHING"><div class="titlepage"><div><div><h3 class="title">12.1.2. Basic Text Matching</h3></div></div></div><p>
- Full text searching in <span class="productname">PostgreSQL</span> is based on
- the match operator <code class="literal">@@</code>, which returns
- <code class="literal">true</code> if a <code class="type">tsvector</code>
- (document) matches a <code class="type">tsquery</code> (query).
- It doesn't matter which data type is written first:
-
- </p><pre class="programlisting">
- SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
- ?column?
- ----------
- t
-
- SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
- ?column?
- ----------
- f
- </pre><p>
- </p><p>
- As the above example suggests, a <code class="type">tsquery</code> is not just raw
- text, any more than a <code class="type">tsvector</code> is. A <code class="type">tsquery</code>
- contains search terms, which must be already-normalized lexemes, and
- may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
- (For syntax details see <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>.) There are
- functions <code class="function">to_tsquery</code>, <code class="function">plainto_tsquery</code>,
- and <code class="function">phraseto_tsquery</code>
- that are helpful in converting user-written text into a proper
- <code class="type">tsquery</code>, primarily by normalizing words appearing in
- the text. Similarly, <code class="function">to_tsvector</code> is used to parse and
- normalize a document string. So in practice a text search match would
- look more like this:
-
- </p><pre class="programlisting">
- SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
- ?column?
- ----------
- t
- </pre><p>
-
- Observe that this match would not succeed if written as
-
- </p><pre class="programlisting">
- SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat');
- ?column?
- ----------
- f
- </pre><p>
-
- since here no normalization of the word <code class="literal">rats</code> will occur.
- The elements of a <code class="type">tsvector</code> are lexemes, which are assumed
- already normalized, so <code class="literal">rats</code> does not match <code class="literal">rat</code>.
- </p><p>
- The <code class="literal">@@</code> operator also
- supports <code class="type">text</code> input, allowing explicit conversion of a text
- string to <code class="type">tsvector</code> or <code class="type">tsquery</code> to be skipped
- in simple cases. The variants available are:
-
- </p><pre class="programlisting">
- tsvector @@ tsquery
- tsquery @@ tsvector
- text @@ tsquery
- text @@ text
- </pre><p>
- </p><p>
- The first two of these we saw already.
- The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">tsquery</code>
- is equivalent to <code class="literal">to_tsvector(x) @@ y</code>.
- The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">text</code>
- is equivalent to <code class="literal">to_tsvector(x) @@ plainto_tsquery(y)</code>.
- </p><p>
- Within a <code class="type">tsquery</code>, the <code class="literal">&</code> (AND) operator
- specifies that both its arguments must appear in the document to have a
- match. Similarly, the <code class="literal">|</code> (OR) operator specifies that
- at least one of its arguments must appear, while the <code class="literal">!</code> (NOT)
- operator specifies that its argument must <span class="emphasis"><em>not</em></span> appear in
- order to have a match.
- For example, the query <code class="literal">fat & ! rat</code> matches documents that
- contain <code class="literal">fat</code> but not <code class="literal">rat</code>.
- </p><p>
- Searching for phrases is possible with the help of
- the <code class="literal"><-></code> (FOLLOWED BY) <code class="type">tsquery</code> operator, which
- matches only if its arguments have matches that are adjacent and in the
- given order. For example:
-
- </p><pre class="programlisting">
- SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error');
- ?column?
- ----------
- t
-
- SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error');
- ?column?
- ----------
- f
- </pre><p>
-
- There is a more general version of the FOLLOWED BY operator having the
- form <code class="literal"><<em class="replaceable"><code>N</code></em>></code>,
- where <em class="replaceable"><code>N</code></em> is an integer standing for the difference between
- the positions of the matching lexemes. <code class="literal"><1></code> is
- the same as <code class="literal"><-></code>, while <code class="literal"><2></code>
- allows exactly one other lexeme to appear between the matches, and so
- on. The <code class="literal">phraseto_tsquery</code> function makes use of this
- operator to construct a <code class="literal">tsquery</code> that can match a multi-word
- phrase when some of the words are stop words. For example:
-
- </p><pre class="programlisting">
- SELECT phraseto_tsquery('cats ate rats');
- phraseto_tsquery
- -------------------------------
- 'cat' <-> 'ate' <-> 'rat'
-
- SELECT phraseto_tsquery('the cats ate the rats');
- phraseto_tsquery
- -------------------------------
- 'cat' <-> 'ate' <2> 'rat'
- </pre><p>
- </p><p>
- A special case that's sometimes useful is that <code class="literal"><0></code>
- can be used to require that two patterns match the same word.
- </p><p>
- Parentheses can be used to control nesting of the <code class="type">tsquery</code>
- operators. Without parentheses, <code class="literal">|</code> binds least tightly,
- then <code class="literal">&</code>, then <code class="literal"><-></code>,
- and <code class="literal">!</code> most tightly.
- </p><p>
- It's worth noticing that the AND/OR/NOT operators mean something subtly
- different when they are within the arguments of a FOLLOWED BY operator
- than when they are not, because within FOLLOWED BY the exact position of
- the match is significant. For example, normally <code class="literal">!x</code> matches
- only documents that do not contain <code class="literal">x</code> anywhere.
- But <code class="literal">!x <-> y</code> matches <code class="literal">y</code> if it is not
- immediately after an <code class="literal">x</code>; an occurrence of <code class="literal">x</code>
- elsewhere in the document does not prevent a match. Another example is
- that <code class="literal">x & y</code> normally only requires that <code class="literal">x</code>
- and <code class="literal">y</code> both appear somewhere in the document, but
- <code class="literal">(x & y) <-> z</code> requires <code class="literal">x</code>
- and <code class="literal">y</code> to match at the same place, immediately before
- a <code class="literal">z</code>. Thus this query behaves differently from
- <code class="literal">x <-> z & y <-> z</code>, which will match a
- document containing two separate sequences <code class="literal">x z</code> and
- <code class="literal">y z</code>. (This specific query is useless as written,
- since <code class="literal">x</code> and <code class="literal">y</code> could not match at the same place;
- but with more complex situations such as prefix-match patterns, a query
- of this form could be useful.)
- </p></div><div class="sect2" id="TEXTSEARCH-INTRO-CONFIGURATIONS"><div class="titlepage"><div><div><h3 class="title">12.1.3. Configurations</h3></div></div></div><p>
- The above are all simple text search examples. As mentioned before, full
- text search functionality includes the ability to do many more things:
- skip indexing certain words (stop words), process synonyms, and use
- sophisticated parsing, e.g., parse based on more than just white space.
- This functionality is controlled by <em class="firstterm">text search
- configurations</em>. <span class="productname">PostgreSQL</span> comes with predefined
- configurations for many languages, and you can easily create your own
- configurations. (<span class="application">psql</span>'s <code class="command">\dF</code> command
- shows all available configurations.)
- </p><p>
- During installation an appropriate configuration is selected and
- <a class="xref" href="runtime-config-client.html#GUC-DEFAULT-TEXT-SEARCH-CONFIG">default_text_search_config</a> is set accordingly
- in <code class="filename">postgresql.conf</code>. If you are using the same text search
- configuration for the entire cluster you can use the value in
- <code class="filename">postgresql.conf</code>. To use different configurations
- throughout the cluster but the same configuration within any one database,
- use <code class="command">ALTER DATABASE ... SET</code>. Otherwise, you can set
- <code class="varname">default_text_search_config</code> in each session.
- </p><p>
- Each text search function that depends on a configuration has an optional
- <code class="type">regconfig</code> argument, so that the configuration to use can be
- specified explicitly. <code class="varname">default_text_search_config</code>
- is used only when this argument is omitted.
- </p><p>
- To make it easier to build custom text search configurations, a
- configuration is built up from simpler database objects.
- <span class="productname">PostgreSQL</span>'s text search facility provides
- four types of configuration-related database objects:
- </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
- <em class="firstterm">Text search parsers</em> break documents into tokens
- and classify each token (for example, as words or numbers).
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <em class="firstterm">Text search dictionaries</em> convert tokens to normalized
- form and reject stop words.
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <em class="firstterm">Text search templates</em> provide the functions underlying
- dictionaries. (A dictionary simply specifies a template and a set
- of parameters for the template.)
- </p></li><li class="listitem" style="list-style-type: disc"><p>
- <em class="firstterm">Text search configurations</em> select a parser and a set
- of dictionaries to use to normalize the tokens produced by the parser.
- </p></li></ul></div><p>
- Text search parsers and templates are built from low-level C functions;
- therefore it requires C programming ability to develop new ones, and
- superuser privileges to install one into a database. (There are examples
- of add-on parsers and templates in the <code class="filename">contrib/</code> area of the
- <span class="productname">PostgreSQL</span> distribution.) Since dictionaries and
- configurations just parameterize and connect together some underlying
- parsers and templates, no special privilege is needed to create a new
- dictionary or configuration. Examples of creating custom dictionaries and
- configurations appear later in this chapter.
- </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-tables.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 12. Full Text Search </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 12.2. Tables and Indexes</td></tr></table></div></body></html>
|