gooderp18绿色标准版
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1176 lines
92KB

  1. <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>9.7. Pattern Matching</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="functions-bitstring.html" title="9.6. Bit String Functions and Operators" /><link rel="next" href="functions-formatting.html" title="9.8. Data Type Formatting Functions" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">9.7. Pattern Matching</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="functions-bitstring.html" title="9.6. Bit String Functions and Operators">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="functions.html" title="Chapter 9. Functions and Operators">Up</a></td><th width="60%" align="center">Chapter 9. Functions and Operators</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="functions-formatting.html" title="9.8. Data Type Formatting Functions">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="FUNCTIONS-MATCHING"><div class="titlepage"><div><div><h2 class="title" style="clear: both">9.7. Pattern Matching</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="functions-matching.html#FUNCTIONS-LIKE">9.7.1. <code class="function">LIKE</code></a></span></dt><dt><span class="sect2"><a href="functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP">9.7.2. <code class="function">SIMILAR TO</code> Regular Expressions</a></span></dt><dt><span class="sect2"><a href="functions-matching.html#FUNCTIONS-POSIX-REGEXP">9.7.3. <acronym class="acronym">POSIX</acronym> Regular Expressions</a></span></dt></dl></div><a id="id-1.5.8.12.2" class="indexterm"></a><p>
  3. There are three separate approaches to pattern matching provided
  4. by <span class="productname">PostgreSQL</span>: the traditional
  5. <acronym class="acronym">SQL</acronym> <code class="function">LIKE</code> operator, the
  6. more recent <code class="function">SIMILAR TO</code> operator (added in
  7. SQL:1999), and <acronym class="acronym">POSIX</acronym>-style regular
  8. expressions. Aside from the basic <span class="quote">“<span class="quote">does this string match
  9. this pattern?</span>”</span> operators, functions are available to extract
  10. or replace matching substrings and to split a string at matching
  11. locations.
  12. </p><div class="tip"><h3 class="title">Tip</h3><p>
  13. If you have pattern matching needs that go beyond this,
  14. consider writing a user-defined function in Perl or Tcl.
  15. </p></div><div class="caution"><h3 class="title">Caution</h3><p>
  16. While most regular-expression searches can be executed very quickly,
  17. regular expressions can be contrived that take arbitrary amounts of
  18. time and memory to process. Be wary of accepting regular-expression
  19. search patterns from hostile sources. If you must do so, it is
  20. advisable to impose a statement timeout.
  21. </p><p>
  22. Searches using <code class="function">SIMILAR TO</code> patterns have the same
  23. security hazards, since <code class="function">SIMILAR TO</code> provides many
  24. of the same capabilities as <acronym class="acronym">POSIX</acronym>-style regular
  25. expressions.
  26. </p><p>
  27. <code class="function">LIKE</code> searches, being much simpler than the other
  28. two options, are safer to use with possibly-hostile pattern sources.
  29. </p></div><p>
  30. The pattern matching operators of all three kinds do not support
  31. nondeterministic collations. If required, apply a different collation to
  32. the expression to work around this limitation.
  33. </p><div class="sect2" id="FUNCTIONS-LIKE"><div class="titlepage"><div><div><h3 class="title">9.7.1. <code class="function">LIKE</code></h3></div></div></div><a id="id-1.5.8.12.7.2" class="indexterm"></a><pre class="synopsis">
  34. <em class="replaceable"><code>string</code></em> LIKE <em class="replaceable"><code>pattern</code></em> [<span class="optional">ESCAPE <em class="replaceable"><code>escape-character</code></em></span>]
  35. <em class="replaceable"><code>string</code></em> NOT LIKE <em class="replaceable"><code>pattern</code></em> [<span class="optional">ESCAPE <em class="replaceable"><code>escape-character</code></em></span>]
  36. </pre><p>
  37. The <code class="function">LIKE</code> expression returns true if the
  38. <em class="replaceable"><code>string</code></em> matches the supplied
  39. <em class="replaceable"><code>pattern</code></em>. (As
  40. expected, the <code class="function">NOT LIKE</code> expression returns
  41. false if <code class="function">LIKE</code> returns true, and vice versa.
  42. An equivalent expression is
  43. <code class="literal">NOT (<em class="replaceable"><code>string</code></em> LIKE
  44. <em class="replaceable"><code>pattern</code></em>)</code>.)
  45. </p><p>
  46. If <em class="replaceable"><code>pattern</code></em> does not contain percent
  47. signs or underscores, then the pattern only represents the string
  48. itself; in that case <code class="function">LIKE</code> acts like the
  49. equals operator. An underscore (<code class="literal">_</code>) in
  50. <em class="replaceable"><code>pattern</code></em> stands for (matches) any single
  51. character; a percent sign (<code class="literal">%</code>) matches any sequence
  52. of zero or more characters.
  53. </p><p>
  54. Some examples:
  55. </p><pre class="programlisting">
  56. 'abc' LIKE 'abc' <em class="lineannotation"><span class="lineannotation">true</span></em>
  57. 'abc' LIKE 'a%' <em class="lineannotation"><span class="lineannotation">true</span></em>
  58. 'abc' LIKE '_b_' <em class="lineannotation"><span class="lineannotation">true</span></em>
  59. 'abc' LIKE 'c' <em class="lineannotation"><span class="lineannotation">false</span></em>
  60. </pre><p>
  61. </p><p>
  62. <code class="function">LIKE</code> pattern matching always covers the entire
  63. string. Therefore, if it's desired to match a sequence anywhere within
  64. a string, the pattern must start and end with a percent sign.
  65. </p><p>
  66. To match a literal underscore or percent sign without matching
  67. other characters, the respective character in
  68. <em class="replaceable"><code>pattern</code></em> must be
  69. preceded by the escape character. The default escape
  70. character is the backslash but a different one can be selected by
  71. using the <code class="literal">ESCAPE</code> clause. To match the escape
  72. character itself, write two escape characters.
  73. </p><div class="note"><h3 class="title">Note</h3><p>
  74. If you have <a class="xref" href="runtime-config-compatible.html#GUC-STANDARD-CONFORMING-STRINGS">standard_conforming_strings</a> turned off,
  75. any backslashes you write in literal string constants will need to be
  76. doubled. See <a class="xref" href="sql-syntax-lexical.html#SQL-SYNTAX-STRINGS" title="4.1.2.1. String Constants">Section 4.1.2.1</a> for more information.
  77. </p></div><p>
  78. It's also possible to select no escape character by writing
  79. <code class="literal">ESCAPE ''</code>. This effectively disables the
  80. escape mechanism, which makes it impossible to turn off the
  81. special meaning of underscore and percent signs in the pattern.
  82. </p><p>
  83. The key word <code class="token">ILIKE</code> can be used instead of
  84. <code class="token">LIKE</code> to make the match case-insensitive according
  85. to the active locale. This is not in the <acronym class="acronym">SQL</acronym> standard but is a
  86. <span class="productname">PostgreSQL</span> extension.
  87. </p><p>
  88. The operator <code class="literal">~~</code> is equivalent to
  89. <code class="function">LIKE</code>, and <code class="literal">~~*</code> corresponds to
  90. <code class="function">ILIKE</code>. There are also
  91. <code class="literal">!~~</code> and <code class="literal">!~~*</code> operators that
  92. represent <code class="function">NOT LIKE</code> and <code class="function">NOT
  93. ILIKE</code>, respectively. All of these operators are
  94. <span class="productname">PostgreSQL</span>-specific. You may see these
  95. operator names in <code class="command">EXPLAIN</code> output and similar
  96. places, since the parser actually translates <code class="function">LIKE</code>
  97. et al. to these operators.
  98. </p><p>
  99. The phrases <code class="function">LIKE</code>, <code class="function">ILIKE</code>,
  100. <code class="function">NOT LIKE</code>, and <code class="function">NOT ILIKE</code> are
  101. generally treated as operators
  102. in <span class="productname">PostgreSQL</span> syntax; for example they can
  103. be used in <em class="replaceable"><code>expression</code></em>
  104. <em class="replaceable"><code>operator</code></em> ANY
  105. (<em class="replaceable"><code>subquery</code></em>) constructs, although
  106. an <code class="literal">ESCAPE</code> clause cannot be included there. In some
  107. obscure cases it may be necessary to use the underlying operator names
  108. instead.
  109. </p><p>
  110. There is also the prefix operator <code class="literal">^@</code> and corresponding
  111. <code class="function">starts_with</code> function which covers cases when only
  112. searching by beginning of the string is needed.
  113. </p></div><div class="sect2" id="FUNCTIONS-SIMILARTO-REGEXP"><div class="titlepage"><div><div><h3 class="title">9.7.2. <code class="function">SIMILAR TO</code> Regular Expressions</h3></div></div></div><a id="id-1.5.8.12.8.2" class="indexterm"></a><a id="id-1.5.8.12.8.3" class="indexterm"></a><a id="id-1.5.8.12.8.4" class="indexterm"></a><pre class="synopsis">
  114. <em class="replaceable"><code>string</code></em> SIMILAR TO <em class="replaceable"><code>pattern</code></em> [<span class="optional">ESCAPE <em class="replaceable"><code>escape-character</code></em></span>]
  115. <em class="replaceable"><code>string</code></em> NOT SIMILAR TO <em class="replaceable"><code>pattern</code></em> [<span class="optional">ESCAPE <em class="replaceable"><code>escape-character</code></em></span>]
  116. </pre><p>
  117. The <code class="function">SIMILAR TO</code> operator returns true or
  118. false depending on whether its pattern matches the given string.
  119. It is similar to <code class="function">LIKE</code>, except that it
  120. interprets the pattern using the SQL standard's definition of a
  121. regular expression. SQL regular expressions are a curious cross
  122. between <code class="function">LIKE</code> notation and common regular
  123. expression notation.
  124. </p><p>
  125. Like <code class="function">LIKE</code>, the <code class="function">SIMILAR TO</code>
  126. operator succeeds only if its pattern matches the entire string;
  127. this is unlike common regular expression behavior where the pattern
  128. can match any part of the string.
  129. Also like
  130. <code class="function">LIKE</code>, <code class="function">SIMILAR TO</code> uses
  131. <code class="literal">_</code> and <code class="literal">%</code> as wildcard characters denoting
  132. any single character and any string, respectively (these are
  133. comparable to <code class="literal">.</code> and <code class="literal">.*</code> in POSIX regular
  134. expressions).
  135. </p><p>
  136. In addition to these facilities borrowed from <code class="function">LIKE</code>,
  137. <code class="function">SIMILAR TO</code> supports these pattern-matching
  138. metacharacters borrowed from POSIX regular expressions:
  139. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  140. <code class="literal">|</code> denotes alternation (either of two alternatives).
  141. </p></li><li class="listitem"><p>
  142. <code class="literal">*</code> denotes repetition of the previous item zero
  143. or more times.
  144. </p></li><li class="listitem"><p>
  145. <code class="literal">+</code> denotes repetition of the previous item one
  146. or more times.
  147. </p></li><li class="listitem"><p>
  148. <code class="literal">?</code> denotes repetition of the previous item zero
  149. or one time.
  150. </p></li><li class="listitem"><p>
  151. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}</code> denotes repetition
  152. of the previous item exactly <em class="replaceable"><code>m</code></em> times.
  153. </p></li><li class="listitem"><p>
  154. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,}</code> denotes repetition
  155. of the previous item <em class="replaceable"><code>m</code></em> or more times.
  156. </p></li><li class="listitem"><p>
  157. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}</code>
  158. denotes repetition of the previous item at least <em class="replaceable"><code>m</code></em> and
  159. not more than <em class="replaceable"><code>n</code></em> times.
  160. </p></li><li class="listitem"><p>
  161. Parentheses <code class="literal">()</code> can be used to group items into
  162. a single logical item.
  163. </p></li><li class="listitem"><p>
  164. A bracket expression <code class="literal">[...]</code> specifies a character
  165. class, just as in POSIX regular expressions.
  166. </p></li></ul></div><p>
  167. Notice that the period (<code class="literal">.</code>) is not a metacharacter
  168. for <code class="function">SIMILAR TO</code>.
  169. </p><p>
  170. As with <code class="function">LIKE</code>, a backslash disables the special meaning
  171. of any of these metacharacters; or a different escape character can
  172. be specified with <code class="literal">ESCAPE</code>.
  173. </p><p>
  174. Some examples:
  175. </p><pre class="programlisting">
  176. 'abc' SIMILAR TO 'abc' <em class="lineannotation"><span class="lineannotation">true</span></em>
  177. 'abc' SIMILAR TO 'a' <em class="lineannotation"><span class="lineannotation">false</span></em>
  178. 'abc' SIMILAR TO '%(b|d)%' <em class="lineannotation"><span class="lineannotation">true</span></em>
  179. 'abc' SIMILAR TO '(b|c)%' <em class="lineannotation"><span class="lineannotation">false</span></em>
  180. </pre><p>
  181. </p><p>
  182. The <code class="function">substring</code> function with three parameters
  183. provides extraction of a substring that matches an SQL
  184. regular expression pattern. The function can be written according
  185. to SQL99 syntax:
  186. </p><pre class="synopsis">
  187. substring(<em class="replaceable"><code>string</code></em> from <em class="replaceable"><code>pattern</code></em> for <em class="replaceable"><code>escape-character</code></em>)
  188. </pre><p>
  189. or as a plain three-argument function:
  190. </p><pre class="synopsis">
  191. substring(<em class="replaceable"><code>string</code></em>, <em class="replaceable"><code>pattern</code></em>, <em class="replaceable"><code>escape-character</code></em>)
  192. </pre><p>
  193. As with <code class="literal">SIMILAR TO</code>, the
  194. specified pattern must match the entire data string, or else the
  195. function fails and returns null. To indicate the part of the
  196. pattern for which the matching data sub-string is of interest,
  197. the pattern should contain
  198. two occurrences of the escape character followed by a double quote
  199. (<code class="literal">"</code>).
  200. The text matching the portion of the pattern
  201. between these separators is returned when the match is successful.
  202. </p><p>
  203. The escape-double-quote separators actually
  204. divide <code class="function">substring</code>'s pattern into three independent
  205. regular expressions; for example, a vertical bar (<code class="literal">|</code>)
  206. in any of the three sections affects only that section. Also, the first
  207. and third of these regular expressions are defined to match the smallest
  208. possible amount of text, not the largest, when there is any ambiguity
  209. about how much of the data string matches which pattern. (In POSIX
  210. parlance, the first and third regular expressions are forced to be
  211. non-greedy.)
  212. </p><p>
  213. As an extension to the SQL standard, <span class="productname">PostgreSQL</span>
  214. allows there to be just one escape-double-quote separator, in which case
  215. the third regular expression is taken as empty; or no separators, in which
  216. case the first and third regular expressions are taken as empty.
  217. </p><p>
  218. Some examples, with <code class="literal">#"</code> delimiting the return string:
  219. </p><pre class="programlisting">
  220. substring('foobar' from '%#"o_b#"%' for '#') <em class="lineannotation"><span class="lineannotation">oob</span></em>
  221. substring('foobar' from '#"o_b#"%' for '#') <em class="lineannotation"><span class="lineannotation">NULL</span></em>
  222. </pre><p>
  223. </p></div><div class="sect2" id="FUNCTIONS-POSIX-REGEXP"><div class="titlepage"><div><div><h3 class="title">9.7.3. <acronym class="acronym">POSIX</acronym> Regular Expressions</h3></div></div></div><a id="id-1.5.8.12.9.2" class="indexterm"></a><a id="id-1.5.8.12.9.3" class="indexterm"></a><a id="id-1.5.8.12.9.4" class="indexterm"></a><a id="id-1.5.8.12.9.5" class="indexterm"></a><a id="id-1.5.8.12.9.6" class="indexterm"></a><a id="id-1.5.8.12.9.7" class="indexterm"></a><a id="id-1.5.8.12.9.8" class="indexterm"></a><p>
  224. <a class="xref" href="functions-matching.html#FUNCTIONS-POSIX-TABLE" title="Table 9.15. Regular Expression Match Operators">Table 9.15</a> lists the available
  225. operators for pattern matching using POSIX regular expressions.
  226. </p><div class="table" id="FUNCTIONS-POSIX-TABLE"><p class="title"><strong>Table 9.15. Regular Expression Match Operators</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Match Operators" border="1"><colgroup><col /><col /><col /></colgroup><thead><tr><th>Operator</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td> <code class="literal">~</code> </td><td>Matches regular expression, case sensitive</td><td><code class="literal">'thomas' ~ '.*thomas.*'</code></td></tr><tr><td> <code class="literal">~*</code> </td><td>Matches regular expression, case insensitive</td><td><code class="literal">'thomas' ~* '.*Thomas.*'</code></td></tr><tr><td> <code class="literal">!~</code> </td><td>Does not match regular expression, case sensitive</td><td><code class="literal">'thomas' !~ '.*Thomas.*'</code></td></tr><tr><td> <code class="literal">!~*</code> </td><td>Does not match regular expression, case insensitive</td><td><code class="literal">'thomas' !~* '.*vadim.*'</code></td></tr></tbody></table></div></div><br class="table-break" /><p>
  227. <acronym class="acronym">POSIX</acronym> regular expressions provide a more
  228. powerful means for pattern matching than the <code class="function">LIKE</code> and
  229. <code class="function">SIMILAR TO</code> operators.
  230. Many Unix tools such as <code class="command">egrep</code>,
  231. <code class="command">sed</code>, or <code class="command">awk</code> use a pattern
  232. matching language that is similar to the one described here.
  233. </p><p>
  234. A regular expression is a character sequence that is an
  235. abbreviated definition of a set of strings (a <em class="firstterm">regular
  236. set</em>). A string is said to match a regular expression
  237. if it is a member of the regular set described by the regular
  238. expression. As with <code class="function">LIKE</code>, pattern characters
  239. match string characters exactly unless they are special characters
  240. in the regular expression language — but regular expressions use
  241. different special characters than <code class="function">LIKE</code> does.
  242. Unlike <code class="function">LIKE</code> patterns, a
  243. regular expression is allowed to match anywhere within a string, unless
  244. the regular expression is explicitly anchored to the beginning or
  245. end of the string.
  246. </p><p>
  247. Some examples:
  248. </p><pre class="programlisting">
  249. 'abc' ~ 'abc' <em class="lineannotation"><span class="lineannotation">true</span></em>
  250. 'abc' ~ '^a' <em class="lineannotation"><span class="lineannotation">true</span></em>
  251. 'abc' ~ '(b|d)' <em class="lineannotation"><span class="lineannotation">true</span></em>
  252. 'abc' ~ '^(b|c)' <em class="lineannotation"><span class="lineannotation">false</span></em>
  253. </pre><p>
  254. </p><p>
  255. The <acronym class="acronym">POSIX</acronym> pattern language is described in much
  256. greater detail below.
  257. </p><p>
  258. The <code class="function">substring</code> function with two parameters,
  259. <code class="function">substring(<em class="replaceable"><code>string</code></em> from
  260. <em class="replaceable"><code>pattern</code></em>)</code>, provides extraction of a
  261. substring
  262. that matches a POSIX regular expression pattern. It returns null if
  263. there is no match, otherwise the portion of the text that matched the
  264. pattern. But if the pattern contains any parentheses, the portion
  265. of the text that matched the first parenthesized subexpression (the
  266. one whose left parenthesis comes first) is
  267. returned. You can put parentheses around the whole expression
  268. if you want to use parentheses within it without triggering this
  269. exception. If you need parentheses in the pattern before the
  270. subexpression you want to extract, see the non-capturing parentheses
  271. described below.
  272. </p><p>
  273. Some examples:
  274. </p><pre class="programlisting">
  275. substring('foobar' from 'o.b') <em class="lineannotation"><span class="lineannotation">oob</span></em>
  276. substring('foobar' from 'o(.)b') <em class="lineannotation"><span class="lineannotation">o</span></em>
  277. </pre><p>
  278. </p><p>
  279. The <code class="function">regexp_replace</code> function provides substitution of
  280. new text for substrings that match POSIX regular expression patterns.
  281. It has the syntax
  282. <code class="function">regexp_replace</code>(<em class="replaceable"><code>source</code></em>,
  283. <em class="replaceable"><code>pattern</code></em>, <em class="replaceable"><code>replacement</code></em>
  284. [<span class="optional">, <em class="replaceable"><code>flags</code></em> </span>]).
  285. The <em class="replaceable"><code>source</code></em> string is returned unchanged if
  286. there is no match to the <em class="replaceable"><code>pattern</code></em>. If there is a
  287. match, the <em class="replaceable"><code>source</code></em> string is returned with the
  288. <em class="replaceable"><code>replacement</code></em> string substituted for the matching
  289. substring. The <em class="replaceable"><code>replacement</code></em> string can contain
  290. <code class="literal">\</code><em class="replaceable"><code>n</code></em>, where <em class="replaceable"><code>n</code></em> is 1
  291. through 9, to indicate that the source substring matching the
  292. <em class="replaceable"><code>n</code></em>'th parenthesized subexpression of the pattern should be
  293. inserted, and it can contain <code class="literal">\&amp;</code> to indicate that the
  294. substring matching the entire pattern should be inserted. Write
  295. <code class="literal">\\</code> if you need to put a literal backslash in the replacement
  296. text.
  297. The <em class="replaceable"><code>flags</code></em> parameter is an optional text
  298. string containing zero or more single-letter flags that change the
  299. function's behavior. Flag <code class="literal">i</code> specifies case-insensitive
  300. matching, while flag <code class="literal">g</code> specifies replacement of each matching
  301. substring rather than only the first one. Supported flags (though
  302. not <code class="literal">g</code>) are
  303. described in <a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>.
  304. </p><p>
  305. Some examples:
  306. </p><pre class="programlisting">
  307. regexp_replace('foobarbaz', 'b..', 'X')
  308. <em class="lineannotation"><span class="lineannotation">fooXbaz</span></em>
  309. regexp_replace('foobarbaz', 'b..', 'X', 'g')
  310. <em class="lineannotation"><span class="lineannotation">fooXX</span></em>
  311. regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
  312. <em class="lineannotation"><span class="lineannotation">fooXarYXazY</span></em>
  313. </pre><p>
  314. </p><p>
  315. The <code class="function">regexp_match</code> function returns a text array of
  316. captured substring(s) resulting from the first match of a POSIX
  317. regular expression pattern to a string. It has the syntax
  318. <code class="function">regexp_match</code>(<em class="replaceable"><code>string</code></em>,
  319. <em class="replaceable"><code>pattern</code></em> [<span class="optional">, <em class="replaceable"><code>flags</code></em> </span>]).
  320. If there is no match, the result is <code class="literal">NULL</code>.
  321. If a match is found, and the <em class="replaceable"><code>pattern</code></em> contains no
  322. parenthesized subexpressions, then the result is a single-element text
  323. array containing the substring matching the whole pattern.
  324. If a match is found, and the <em class="replaceable"><code>pattern</code></em> contains
  325. parenthesized subexpressions, then the result is a text array
  326. whose <em class="replaceable"><code>n</code></em>'th element is the substring matching
  327. the <em class="replaceable"><code>n</code></em>'th parenthesized subexpression of
  328. the <em class="replaceable"><code>pattern</code></em> (not counting <span class="quote">“<span class="quote">non-capturing</span>”</span>
  329. parentheses; see below for details).
  330. The <em class="replaceable"><code>flags</code></em> parameter is an optional text string
  331. containing zero or more single-letter flags that change the function's
  332. behavior. Supported flags are described
  333. in <a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>.
  334. </p><p>
  335. Some examples:
  336. </p><pre class="programlisting">
  337. SELECT regexp_match('foobarbequebaz', 'bar.*que');
  338. regexp_match
  339. --------------
  340. {barbeque}
  341. (1 row)
  342. SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
  343. regexp_match
  344. --------------
  345. {bar,beque}
  346. (1 row)
  347. </pre><p>
  348. In the common case where you just want the whole matching substring
  349. or <code class="literal">NULL</code> for no match, write something like
  350. </p><pre class="programlisting">
  351. SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
  352. regexp_match
  353. --------------
  354. barbeque
  355. (1 row)
  356. </pre><p>
  357. </p><p>
  358. The <code class="function">regexp_matches</code> function returns a set of text arrays
  359. of captured substring(s) resulting from matching a POSIX regular
  360. expression pattern to a string. It has the same syntax as
  361. <code class="function">regexp_match</code>.
  362. This function returns no rows if there is no match, one row if there is
  363. a match and the <code class="literal">g</code> flag is not given, or <em class="replaceable"><code>N</code></em>
  364. rows if there are <em class="replaceable"><code>N</code></em> matches and the <code class="literal">g</code> flag
  365. is given. Each returned row is a text array containing the whole
  366. matched substring or the substrings matching parenthesized
  367. subexpressions of the <em class="replaceable"><code>pattern</code></em>, just as described above
  368. for <code class="function">regexp_match</code>.
  369. <code class="function">regexp_matches</code> accepts all the flags shown
  370. in <a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>, plus
  371. the <code class="literal">g</code> flag which commands it to return all matches, not
  372. just the first one.
  373. </p><p>
  374. Some examples:
  375. </p><pre class="programlisting">
  376. SELECT regexp_matches('foo', 'not there');
  377. regexp_matches
  378. ----------------
  379. (0 rows)
  380. SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
  381. regexp_matches
  382. ----------------
  383. {bar,beque}
  384. {bazil,barf}
  385. (2 rows)
  386. </pre><p>
  387. </p><div class="tip"><h3 class="title">Tip</h3><p>
  388. In most cases <code class="function">regexp_matches()</code> should be used with
  389. the <code class="literal">g</code> flag, since if you only want the first match, it's
  390. easier and more efficient to use <code class="function">regexp_match()</code>.
  391. However, <code class="function">regexp_match()</code> only exists
  392. in <span class="productname">PostgreSQL</span> version 10 and up. When working in older
  393. versions, a common trick is to place a <code class="function">regexp_matches()</code>
  394. call in a sub-select, for example:
  395. </p><pre class="programlisting">
  396. SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
  397. </pre><p>
  398. This produces a text array if there's a match, or <code class="literal">NULL</code> if
  399. not, the same as <code class="function">regexp_match()</code> would do. Without the
  400. sub-select, this query would produce no output at all for table rows
  401. without a match, which is typically not the desired behavior.
  402. </p></div><p>
  403. The <code class="function">regexp_split_to_table</code> function splits a string using a POSIX
  404. regular expression pattern as a delimiter. It has the syntax
  405. <code class="function">regexp_split_to_table</code>(<em class="replaceable"><code>string</code></em>, <em class="replaceable"><code>pattern</code></em>
  406. [<span class="optional">, <em class="replaceable"><code>flags</code></em> </span>]).
  407. If there is no match to the <em class="replaceable"><code>pattern</code></em>, the function returns the
  408. <em class="replaceable"><code>string</code></em>. If there is at least one match, for each match it returns
  409. the text from the end of the last match (or the beginning of the string)
  410. to the beginning of the match. When there are no more matches, it
  411. returns the text from the end of the last match to the end of the string.
  412. The <em class="replaceable"><code>flags</code></em> parameter is an optional text string containing
  413. zero or more single-letter flags that change the function's behavior.
  414. <code class="function">regexp_split_to_table</code> supports the flags described in
  415. <a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>.
  416. </p><p>
  417. The <code class="function">regexp_split_to_array</code> function behaves the same as
  418. <code class="function">regexp_split_to_table</code>, except that <code class="function">regexp_split_to_array</code>
  419. returns its result as an array of <code class="type">text</code>. It has the syntax
  420. <code class="function">regexp_split_to_array</code>(<em class="replaceable"><code>string</code></em>, <em class="replaceable"><code>pattern</code></em>
  421. [<span class="optional">, <em class="replaceable"><code>flags</code></em> </span>]).
  422. The parameters are the same as for <code class="function">regexp_split_to_table</code>.
  423. </p><p>
  424. Some examples:
  425. </p><pre class="programlisting">
  426. SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo;
  427. foo
  428. -------
  429. the
  430. quick
  431. brown
  432. fox
  433. jumps
  434. over
  435. the
  436. lazy
  437. dog
  438. (9 rows)
  439. SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+');
  440. regexp_split_to_array
  441. -----------------------------------------------
  442. {the,quick,brown,fox,jumps,over,the,lazy,dog}
  443. (1 row)
  444. SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
  445. foo
  446. -----
  447. t
  448. h
  449. e
  450. q
  451. u
  452. i
  453. c
  454. k
  455. b
  456. r
  457. o
  458. w
  459. n
  460. f
  461. o
  462. x
  463. (16 rows)
  464. </pre><p>
  465. </p><p>
  466. As the last example demonstrates, the regexp split functions ignore
  467. zero-length matches that occur at the start or end of the string
  468. or immediately after a previous match. This is contrary to the strict
  469. definition of regexp matching that is implemented by
  470. <code class="function">regexp_match</code> and
  471. <code class="function">regexp_matches</code>, but is usually the most convenient behavior
  472. in practice. Other software systems such as Perl use similar definitions.
  473. </p><div class="sect3" id="POSIX-SYNTAX-DETAILS"><div class="titlepage"><div><div><h4 class="title">9.7.3.1. Regular Expression Details</h4></div></div></div><p>
  474. <span class="productname">PostgreSQL</span>'s regular expressions are implemented
  475. using a software package written by Henry Spencer. Much of
  476. the description of regular expressions below is copied verbatim from his
  477. manual.
  478. </p><p>
  479. Regular expressions (<acronym class="acronym">RE</acronym>s), as defined in
  480. <acronym class="acronym">POSIX</acronym> 1003.2, come in two forms:
  481. <em class="firstterm">extended</em> <acronym class="acronym">RE</acronym>s or <acronym class="acronym">ERE</acronym>s
  482. (roughly those of <code class="command">egrep</code>), and
  483. <em class="firstterm">basic</em> <acronym class="acronym">RE</acronym>s or <acronym class="acronym">BRE</acronym>s
  484. (roughly those of <code class="command">ed</code>).
  485. <span class="productname">PostgreSQL</span> supports both forms, and
  486. also implements some extensions
  487. that are not in the POSIX standard, but have become widely used
  488. due to their availability in programming languages such as Perl and Tcl.
  489. <acronym class="acronym">RE</acronym>s using these non-POSIX extensions are called
  490. <em class="firstterm">advanced</em> <acronym class="acronym">RE</acronym>s or <acronym class="acronym">ARE</acronym>s
  491. in this documentation. AREs are almost an exact superset of EREs,
  492. but BREs have several notational incompatibilities (as well as being
  493. much more limited).
  494. We first describe the ARE and ERE forms, noting features that apply
  495. only to AREs, and then describe how BREs differ.
  496. </p><div class="note"><h3 class="title">Note</h3><p>
  497. <span class="productname">PostgreSQL</span> always initially presumes that a regular
  498. expression follows the ARE rules. However, the more limited ERE or
  499. BRE rules can be chosen by prepending an <em class="firstterm">embedded option</em>
  500. to the RE pattern, as described in <a class="xref" href="functions-matching.html#POSIX-METASYNTAX" title="9.7.3.4. Regular Expression Metasyntax">Section 9.7.3.4</a>.
  501. This can be useful for compatibility with applications that expect
  502. exactly the <acronym class="acronym">POSIX</acronym> 1003.2 rules.
  503. </p></div><p>
  504. A regular expression is defined as one or more
  505. <em class="firstterm">branches</em>, separated by
  506. <code class="literal">|</code>. It matches anything that matches one of the
  507. branches.
  508. </p><p>
  509. A branch is zero or more <em class="firstterm">quantified atoms</em> or
  510. <em class="firstterm">constraints</em>, concatenated.
  511. It matches a match for the first, followed by a match for the second, etc;
  512. an empty branch matches the empty string.
  513. </p><p>
  514. A quantified atom is an <em class="firstterm">atom</em> possibly followed
  515. by a single <em class="firstterm">quantifier</em>.
  516. Without a quantifier, it matches a match for the atom.
  517. With a quantifier, it can match some number of matches of the atom.
  518. An <em class="firstterm">atom</em> can be any of the possibilities
  519. shown in <a class="xref" href="functions-matching.html#POSIX-ATOMS-TABLE" title="Table 9.16. Regular Expression Atoms">Table 9.16</a>.
  520. The possible quantifiers and their meanings are shown in
  521. <a class="xref" href="functions-matching.html#POSIX-QUANTIFIERS-TABLE" title="Table 9.17. Regular Expression Quantifiers">Table 9.17</a>.
  522. </p><p>
  523. A <em class="firstterm">constraint</em> matches an empty string, but matches only when
  524. specific conditions are met. A constraint can be used where an atom
  525. could be used, except it cannot be followed by a quantifier.
  526. The simple constraints are shown in
  527. <a class="xref" href="functions-matching.html#POSIX-CONSTRAINTS-TABLE" title="Table 9.18. Regular Expression Constraints">Table 9.18</a>;
  528. some more constraints are described later.
  529. </p><div class="table" id="POSIX-ATOMS-TABLE"><p class="title"><strong>Table 9.16. Regular Expression Atoms</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Atoms" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Atom</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">(</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> (where <em class="replaceable"><code>re</code></em> is any regular expression)
  530. matches a match for
  531. <em class="replaceable"><code>re</code></em>, with the match noted for possible reporting </td></tr><tr><td> <code class="literal">(?:</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> as above, but the match is not noted for reporting
  532. (a <span class="quote">“<span class="quote">non-capturing</span>”</span> set of parentheses)
  533. (AREs only) </td></tr><tr><td> <code class="literal">.</code> </td><td> matches any single character </td></tr><tr><td> <code class="literal">[</code><em class="replaceable"><code>chars</code></em><code class="literal">]</code> </td><td> a <em class="firstterm">bracket expression</em>,
  534. matching any one of the <em class="replaceable"><code>chars</code></em> (see
  535. <a class="xref" href="functions-matching.html#POSIX-BRACKET-EXPRESSIONS" title="9.7.3.2. Bracket Expressions">Section 9.7.3.2</a> for more detail) </td></tr><tr><td> <code class="literal">\</code><em class="replaceable"><code>k</code></em> </td><td> (where <em class="replaceable"><code>k</code></em> is a non-alphanumeric character)
  536. matches that character taken as an ordinary character,
  537. e.g., <code class="literal">\\</code> matches a backslash character </td></tr><tr><td> <code class="literal">\</code><em class="replaceable"><code>c</code></em> </td><td> where <em class="replaceable"><code>c</code></em> is alphanumeric
  538. (possibly followed by other characters)
  539. is an <em class="firstterm">escape</em>, see <a class="xref" href="functions-matching.html#POSIX-ESCAPE-SEQUENCES" title="9.7.3.3. Regular Expression Escapes">Section 9.7.3.3</a>
  540. (AREs only; in EREs and BREs, this matches <em class="replaceable"><code>c</code></em>) </td></tr><tr><td> <code class="literal">{</code> </td><td> when followed by a character other than a digit,
  541. matches the left-brace character <code class="literal">{</code>;
  542. when followed by a digit, it is the beginning of a
  543. <em class="replaceable"><code>bound</code></em> (see below) </td></tr><tr><td> <em class="replaceable"><code>x</code></em> </td><td> where <em class="replaceable"><code>x</code></em> is a single character with no other
  544. significance, matches that character </td></tr></tbody></table></div></div><br class="table-break" /><p>
  545. An RE cannot end with a backslash (<code class="literal">\</code>).
  546. </p><div class="note"><h3 class="title">Note</h3><p>
  547. If you have <a class="xref" href="runtime-config-compatible.html#GUC-STANDARD-CONFORMING-STRINGS">standard_conforming_strings</a> turned off,
  548. any backslashes you write in literal string constants will need to be
  549. doubled. See <a class="xref" href="sql-syntax-lexical.html#SQL-SYNTAX-STRINGS" title="4.1.2.1. String Constants">Section 4.1.2.1</a> for more information.
  550. </p></div><div class="table" id="POSIX-QUANTIFIERS-TABLE"><p class="title"><strong>Table 9.17. Regular Expression Quantifiers</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Quantifiers" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Quantifier</th><th>Matches</th></tr></thead><tbody><tr><td> <code class="literal">*</code> </td><td> a sequence of 0 or more matches of the atom </td></tr><tr><td> <code class="literal">+</code> </td><td> a sequence of 1 or more matches of the atom </td></tr><tr><td> <code class="literal">?</code> </td><td> a sequence of 0 or 1 matches of the atom </td></tr><tr><td> <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}</code> </td><td> a sequence of exactly <em class="replaceable"><code>m</code></em> matches of the atom </td></tr><tr><td> <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,}</code> </td><td> a sequence of <em class="replaceable"><code>m</code></em> or more matches of the atom </td></tr><tr><td>
  551. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}</code> </td><td> a sequence of <em class="replaceable"><code>m</code></em> through <em class="replaceable"><code>n</code></em>
  552. (inclusive) matches of the atom; <em class="replaceable"><code>m</code></em> cannot exceed
  553. <em class="replaceable"><code>n</code></em> </td></tr><tr><td> <code class="literal">*?</code> </td><td> non-greedy version of <code class="literal">*</code> </td></tr><tr><td> <code class="literal">+?</code> </td><td> non-greedy version of <code class="literal">+</code> </td></tr><tr><td> <code class="literal">??</code> </td><td> non-greedy version of <code class="literal">?</code> </td></tr><tr><td> <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}?</code> </td><td> non-greedy version of <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}</code> </td></tr><tr><td> <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,}?</code> </td><td> non-greedy version of <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,}</code> </td></tr><tr><td>
  554. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}?</code> </td><td> non-greedy version of <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}</code> </td></tr></tbody></table></div></div><br class="table-break" /><p>
  555. The forms using <code class="literal">{</code><em class="replaceable"><code>...</code></em><code class="literal">}</code>
  556. are known as <em class="firstterm">bounds</em>.
  557. The numbers <em class="replaceable"><code>m</code></em> and <em class="replaceable"><code>n</code></em> within a bound are
  558. unsigned decimal integers with permissible values from 0 to 255 inclusive.
  559. </p><p>
  560. <em class="firstterm">Non-greedy</em> quantifiers (available in AREs only) match the
  561. same possibilities as their corresponding normal (<em class="firstterm">greedy</em>)
  562. counterparts, but prefer the smallest number rather than the largest
  563. number of matches.
  564. See <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a> for more detail.
  565. </p><div class="note"><h3 class="title">Note</h3><p>
  566. A quantifier cannot immediately follow another quantifier, e.g.,
  567. <code class="literal">**</code> is invalid.
  568. A quantifier cannot
  569. begin an expression or subexpression or follow
  570. <code class="literal">^</code> or <code class="literal">|</code>.
  571. </p></div><div class="table" id="POSIX-CONSTRAINTS-TABLE"><p class="title"><strong>Table 9.18. Regular Expression Constraints</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Constraints" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Constraint</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">^</code> </td><td> matches at the beginning of the string </td></tr><tr><td> <code class="literal">$</code> </td><td> matches at the end of the string </td></tr><tr><td> <code class="literal">(?=</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> <em class="firstterm">positive lookahead</em> matches at any point
  572. where a substring matching <em class="replaceable"><code>re</code></em> begins
  573. (AREs only) </td></tr><tr><td> <code class="literal">(?!</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> <em class="firstterm">negative lookahead</em> matches at any point
  574. where no substring matching <em class="replaceable"><code>re</code></em> begins
  575. (AREs only) </td></tr><tr><td> <code class="literal">(?&lt;=</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> <em class="firstterm">positive lookbehind</em> matches at any point
  576. where a substring matching <em class="replaceable"><code>re</code></em> ends
  577. (AREs only) </td></tr><tr><td> <code class="literal">(?&lt;!</code><em class="replaceable"><code>re</code></em><code class="literal">)</code> </td><td> <em class="firstterm">negative lookbehind</em> matches at any point
  578. where no substring matching <em class="replaceable"><code>re</code></em> ends
  579. (AREs only) </td></tr></tbody></table></div></div><br class="table-break" /><p>
  580. Lookahead and lookbehind constraints cannot contain <em class="firstterm">back
  581. references</em> (see <a class="xref" href="functions-matching.html#POSIX-ESCAPE-SEQUENCES" title="9.7.3.3. Regular Expression Escapes">Section 9.7.3.3</a>),
  582. and all parentheses within them are considered non-capturing.
  583. </p></div><div class="sect3" id="POSIX-BRACKET-EXPRESSIONS"><div class="titlepage"><div><div><h4 class="title">9.7.3.2. Bracket Expressions</h4></div></div></div><p>
  584. A <em class="firstterm">bracket expression</em> is a list of
  585. characters enclosed in <code class="literal">[]</code>. It normally matches
  586. any single character from the list (but see below). If the list
  587. begins with <code class="literal">^</code>, it matches any single character
  588. <span class="emphasis"><em>not</em></span> from the rest of the list.
  589. If two characters
  590. in the list are separated by <code class="literal">-</code>, this is
  591. shorthand for the full range of characters between those two
  592. (inclusive) in the collating sequence,
  593. e.g., <code class="literal">[0-9]</code> in <acronym class="acronym">ASCII</acronym> matches
  594. any decimal digit. It is illegal for two ranges to share an
  595. endpoint, e.g., <code class="literal">a-c-e</code>. Ranges are very
  596. collating-sequence-dependent, so portable programs should avoid
  597. relying on them.
  598. </p><p>
  599. To include a literal <code class="literal">]</code> in the list, make it the
  600. first character (after <code class="literal">^</code>, if that is used). To
  601. include a literal <code class="literal">-</code>, make it the first or last
  602. character, or the second endpoint of a range. To use a literal
  603. <code class="literal">-</code> as the first endpoint of a range, enclose it
  604. in <code class="literal">[.</code> and <code class="literal">.]</code> to make it a
  605. collating element (see below). With the exception of these characters,
  606. some combinations using <code class="literal">[</code>
  607. (see next paragraphs), and escapes (AREs only), all other special
  608. characters lose their special significance within a bracket expression.
  609. In particular, <code class="literal">\</code> is not special when following
  610. ERE or BRE rules, though it is special (as introducing an escape)
  611. in AREs.
  612. </p><p>
  613. Within a bracket expression, a collating element (a character, a
  614. multiple-character sequence that collates as if it were a single
  615. character, or a collating-sequence name for either) enclosed in
  616. <code class="literal">[.</code> and <code class="literal">.]</code> stands for the
  617. sequence of characters of that collating element. The sequence is
  618. treated as a single element of the bracket expression's list. This
  619. allows a bracket
  620. expression containing a multiple-character collating element to
  621. match more than one character, e.g., if the collating sequence
  622. includes a <code class="literal">ch</code> collating element, then the RE
  623. <code class="literal">[[.ch.]]*c</code> matches the first five characters of
  624. <code class="literal">chchcc</code>.
  625. </p><div class="note"><h3 class="title">Note</h3><p>
  626. <span class="productname">PostgreSQL</span> currently does not support multi-character collating
  627. elements. This information describes possible future behavior.
  628. </p></div><p>
  629. Within a bracket expression, a collating element enclosed in
  630. <code class="literal">[=</code> and <code class="literal">=]</code> is an <em class="firstterm">equivalence
  631. class</em>, standing for the sequences of characters of all collating
  632. elements equivalent to that one, including itself. (If there are
  633. no other equivalent collating elements, the treatment is as if the
  634. enclosing delimiters were <code class="literal">[.</code> and
  635. <code class="literal">.]</code>.) For example, if <code class="literal">o</code> and
  636. <code class="literal">^</code> are the members of an equivalence class, then
  637. <code class="literal">[[=o=]]</code>, <code class="literal">[[=^=]]</code>, and
  638. <code class="literal">[o^]</code> are all synonymous. An equivalence class
  639. cannot be an endpoint of a range.
  640. </p><p>
  641. Within a bracket expression, the name of a character class
  642. enclosed in <code class="literal">[:</code> and <code class="literal">:]</code> stands
  643. for the list of all characters belonging to that class. A character
  644. class cannot be used as an endpoint of a range.
  645. The <acronym class="acronym">POSIX</acronym> standard defines these character class
  646. names:
  647. <code class="literal">alnum</code> (letters and numeric digits),
  648. <code class="literal">alpha</code> (letters),
  649. <code class="literal">blank</code> (space and tab),
  650. <code class="literal">cntrl</code> (control characters),
  651. <code class="literal">digit</code> (numeric digits),
  652. <code class="literal">graph</code> (printable characters except space),
  653. <code class="literal">lower</code> (lower-case letters),
  654. <code class="literal">print</code> (printable characters including space),
  655. <code class="literal">punct</code> (punctuation),
  656. <code class="literal">space</code> (any white space),
  657. <code class="literal">upper</code> (upper-case letters),
  658. and <code class="literal">xdigit</code> (hexadecimal digits).
  659. The behavior of these standard character classes is generally
  660. consistent across platforms for characters in the 7-bit ASCII set.
  661. Whether a given non-ASCII character is considered to belong to one
  662. of these classes depends on the <em class="firstterm">collation</em>
  663. that is used for the regular-expression function or operator
  664. (see <a class="xref" href="collation.html" title="23.2. Collation Support">Section 23.2</a>), or by default on the
  665. database's <code class="envar">LC_CTYPE</code> locale setting (see
  666. <a class="xref" href="locale.html" title="23.1. Locale Support">Section 23.1</a>). The classification of non-ASCII
  667. characters can vary across platforms even in similarly-named
  668. locales. (But the <code class="literal">C</code> locale never considers any
  669. non-ASCII characters to belong to any of these classes.)
  670. In addition to these standard character
  671. classes, <span class="productname">PostgreSQL</span> defines
  672. the <code class="literal">ascii</code> character class, which contains exactly
  673. the 7-bit ASCII set.
  674. </p><p>
  675. There are two special cases of bracket expressions: the bracket
  676. expressions <code class="literal">[[:&lt;:]]</code> and
  677. <code class="literal">[[:&gt;:]]</code> are constraints,
  678. matching empty strings at the beginning
  679. and end of a word respectively. A word is defined as a sequence
  680. of word characters that is neither preceded nor followed by word
  681. characters. A word character is an <code class="literal">alnum</code> character (as
  682. defined by the <acronym class="acronym">POSIX</acronym> character class described above)
  683. or an underscore. This is an extension, compatible with but not
  684. specified by <acronym class="acronym">POSIX</acronym> 1003.2, and should be used with
  685. caution in software intended to be portable to other systems.
  686. The constraint escapes described below are usually preferable; they
  687. are no more standard, but are easier to type.
  688. </p></div><div class="sect3" id="POSIX-ESCAPE-SEQUENCES"><div class="titlepage"><div><div><h4 class="title">9.7.3.3. Regular Expression Escapes</h4></div></div></div><p>
  689. <em class="firstterm">Escapes</em> are special sequences beginning with <code class="literal">\</code>
  690. followed by an alphanumeric character. Escapes come in several varieties:
  691. character entry, class shorthands, constraint escapes, and back references.
  692. A <code class="literal">\</code> followed by an alphanumeric character but not constituting
  693. a valid escape is illegal in AREs.
  694. In EREs, there are no escapes: outside a bracket expression,
  695. a <code class="literal">\</code> followed by an alphanumeric character merely stands for
  696. that character as an ordinary character, and inside a bracket expression,
  697. <code class="literal">\</code> is an ordinary character.
  698. (The latter is the one actual incompatibility between EREs and AREs.)
  699. </p><p>
  700. <em class="firstterm">Character-entry escapes</em> exist to make it easier to specify
  701. non-printing and other inconvenient characters in REs. They are
  702. shown in <a class="xref" href="functions-matching.html#POSIX-CHARACTER-ENTRY-ESCAPES-TABLE" title="Table 9.19. Regular Expression Character-Entry Escapes">Table 9.19</a>.
  703. </p><p>
  704. <em class="firstterm">Class-shorthand escapes</em> provide shorthands for certain
  705. commonly-used character classes. They are
  706. shown in <a class="xref" href="functions-matching.html#POSIX-CLASS-SHORTHAND-ESCAPES-TABLE" title="Table 9.20. Regular Expression Class-Shorthand Escapes">Table 9.20</a>.
  707. </p><p>
  708. A <em class="firstterm">constraint escape</em> is a constraint,
  709. matching the empty string if specific conditions are met,
  710. written as an escape. They are
  711. shown in <a class="xref" href="functions-matching.html#POSIX-CONSTRAINT-ESCAPES-TABLE" title="Table 9.21. Regular Expression Constraint Escapes">Table 9.21</a>.
  712. </p><p>
  713. A <em class="firstterm">back reference</em> (<code class="literal">\</code><em class="replaceable"><code>n</code></em>) matches the
  714. same string matched by the previous parenthesized subexpression specified
  715. by the number <em class="replaceable"><code>n</code></em>
  716. (see <a class="xref" href="functions-matching.html#POSIX-CONSTRAINT-BACKREF-TABLE" title="Table 9.22. Regular Expression Back References">Table 9.22</a>). For example,
  717. <code class="literal">([bc])\1</code> matches <code class="literal">bb</code> or <code class="literal">cc</code>
  718. but not <code class="literal">bc</code> or <code class="literal">cb</code>.
  719. The subexpression must entirely precede the back reference in the RE.
  720. Subexpressions are numbered in the order of their leading parentheses.
  721. Non-capturing parentheses do not define subexpressions.
  722. </p><div class="table" id="POSIX-CHARACTER-ENTRY-ESCAPES-TABLE"><p class="title"><strong>Table 9.19. Regular Expression Character-Entry Escapes</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Character-Entry Escapes" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Escape</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">\a</code> </td><td> alert (bell) character, as in C </td></tr><tr><td> <code class="literal">\b</code> </td><td> backspace, as in C </td></tr><tr><td> <code class="literal">\B</code> </td><td> synonym for backslash (<code class="literal">\</code>) to help reduce the need for backslash
  723. doubling </td></tr><tr><td> <code class="literal">\c</code><em class="replaceable"><code>X</code></em> </td><td> (where <em class="replaceable"><code>X</code></em> is any character) the character whose
  724. low-order 5 bits are the same as those of
  725. <em class="replaceable"><code>X</code></em>, and whose other bits are all zero </td></tr><tr><td> <code class="literal">\e</code> </td><td> the character whose collating-sequence name
  726. is <code class="literal">ESC</code>,
  727. or failing that, the character with octal value <code class="literal">033</code> </td></tr><tr><td> <code class="literal">\f</code> </td><td> form feed, as in C </td></tr><tr><td> <code class="literal">\n</code> </td><td> newline, as in C </td></tr><tr><td> <code class="literal">\r</code> </td><td> carriage return, as in C </td></tr><tr><td> <code class="literal">\t</code> </td><td> horizontal tab, as in C </td></tr><tr><td> <code class="literal">\u</code><em class="replaceable"><code>wxyz</code></em> </td><td> (where <em class="replaceable"><code>wxyz</code></em> is exactly four hexadecimal digits)
  728. the character whose hexadecimal value is
  729. <code class="literal">0x</code><em class="replaceable"><code>wxyz</code></em>
  730. </td></tr><tr><td> <code class="literal">\U</code><em class="replaceable"><code>stuvwxyz</code></em> </td><td> (where <em class="replaceable"><code>stuvwxyz</code></em> is exactly eight hexadecimal
  731. digits)
  732. the character whose hexadecimal value is
  733. <code class="literal">0x</code><em class="replaceable"><code>stuvwxyz</code></em>
  734. </td></tr><tr><td> <code class="literal">\v</code> </td><td> vertical tab, as in C </td></tr><tr><td> <code class="literal">\x</code><em class="replaceable"><code>hhh</code></em> </td><td> (where <em class="replaceable"><code>hhh</code></em> is any sequence of hexadecimal
  735. digits)
  736. the character whose hexadecimal value is
  737. <code class="literal">0x</code><em class="replaceable"><code>hhh</code></em>
  738. (a single character no matter how many hexadecimal digits are used)
  739. </td></tr><tr><td> <code class="literal">\0</code> </td><td> the character whose value is <code class="literal">0</code> (the null byte)</td></tr><tr><td> <code class="literal">\</code><em class="replaceable"><code>xy</code></em> </td><td> (where <em class="replaceable"><code>xy</code></em> is exactly two octal digits,
  740. and is not a <em class="firstterm">back reference</em>)
  741. the character whose octal value is
  742. <code class="literal">0</code><em class="replaceable"><code>xy</code></em> </td></tr><tr><td> <code class="literal">\</code><em class="replaceable"><code>xyz</code></em> </td><td> (where <em class="replaceable"><code>xyz</code></em> is exactly three octal digits,
  743. and is not a <em class="firstterm">back reference</em>)
  744. the character whose octal value is
  745. <code class="literal">0</code><em class="replaceable"><code>xyz</code></em> </td></tr></tbody></table></div></div><br class="table-break" /><p>
  746. Hexadecimal digits are <code class="literal">0</code>-<code class="literal">9</code>,
  747. <code class="literal">a</code>-<code class="literal">f</code>, and <code class="literal">A</code>-<code class="literal">F</code>.
  748. Octal digits are <code class="literal">0</code>-<code class="literal">7</code>.
  749. </p><p>
  750. Numeric character-entry escapes specifying values outside the ASCII range
  751. (0-127) have meanings dependent on the database encoding. When the
  752. encoding is UTF-8, escape values are equivalent to Unicode code points,
  753. for example <code class="literal">\u1234</code> means the character <code class="literal">U+1234</code>.
  754. For other multibyte encodings, character-entry escapes usually just
  755. specify the concatenation of the byte values for the character. If the
  756. escape value does not correspond to any legal character in the database
  757. encoding, no error will be raised, but it will never match any data.
  758. </p><p>
  759. The character-entry escapes are always taken as ordinary characters.
  760. For example, <code class="literal">\135</code> is <code class="literal">]</code> in ASCII, but
  761. <code class="literal">\135</code> does not terminate a bracket expression.
  762. </p><div class="table" id="POSIX-CLASS-SHORTHAND-ESCAPES-TABLE"><p class="title"><strong>Table 9.20. Regular Expression Class-Shorthand Escapes</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Class-Shorthand Escapes" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Escape</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">\d</code> </td><td> <code class="literal">[[:digit:]]</code> </td></tr><tr><td> <code class="literal">\s</code> </td><td> <code class="literal">[[:space:]]</code> </td></tr><tr><td> <code class="literal">\w</code> </td><td> <code class="literal">[[:alnum:]_]</code>
  763. (note underscore is included) </td></tr><tr><td> <code class="literal">\D</code> </td><td> <code class="literal">[^[:digit:]]</code> </td></tr><tr><td> <code class="literal">\S</code> </td><td> <code class="literal">[^[:space:]]</code> </td></tr><tr><td> <code class="literal">\W</code> </td><td> <code class="literal">[^[:alnum:]_]</code>
  764. (note underscore is included) </td></tr></tbody></table></div></div><br class="table-break" /><p>
  765. Within bracket expressions, <code class="literal">\d</code>, <code class="literal">\s</code>,
  766. and <code class="literal">\w</code> lose their outer brackets,
  767. and <code class="literal">\D</code>, <code class="literal">\S</code>, and <code class="literal">\W</code> are illegal.
  768. (So, for example, <code class="literal">[a-c\d]</code> is equivalent to
  769. <code class="literal">[a-c[:digit:]]</code>.
  770. Also, <code class="literal">[a-c\D]</code>, which is equivalent to
  771. <code class="literal">[a-c^[:digit:]]</code>, is illegal.)
  772. </p><div class="table" id="POSIX-CONSTRAINT-ESCAPES-TABLE"><p class="title"><strong>Table 9.21. Regular Expression Constraint Escapes</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Constraint Escapes" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Escape</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">\A</code> </td><td> matches only at the beginning of the string
  773. (see <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a> for how this differs from
  774. <code class="literal">^</code>) </td></tr><tr><td> <code class="literal">\m</code> </td><td> matches only at the beginning of a word </td></tr><tr><td> <code class="literal">\M</code> </td><td> matches only at the end of a word </td></tr><tr><td> <code class="literal">\y</code> </td><td> matches only at the beginning or end of a word </td></tr><tr><td> <code class="literal">\Y</code> </td><td> matches only at a point that is not the beginning or end of a
  775. word </td></tr><tr><td> <code class="literal">\Z</code> </td><td> matches only at the end of the string
  776. (see <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a> for how this differs from
  777. <code class="literal">$</code>) </td></tr></tbody></table></div></div><br class="table-break" /><p>
  778. A word is defined as in the specification of
  779. <code class="literal">[[:&lt;:]]</code> and <code class="literal">[[:&gt;:]]</code> above.
  780. Constraint escapes are illegal within bracket expressions.
  781. </p><div class="table" id="POSIX-CONSTRAINT-BACKREF-TABLE"><p class="title"><strong>Table 9.22. Regular Expression Back References</strong></p><div class="table-contents"><table class="table" summary="Regular Expression Back References" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Escape</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">\</code><em class="replaceable"><code>m</code></em> </td><td> (where <em class="replaceable"><code>m</code></em> is a nonzero digit)
  782. a back reference to the <em class="replaceable"><code>m</code></em>'th subexpression </td></tr><tr><td> <code class="literal">\</code><em class="replaceable"><code>mnn</code></em> </td><td> (where <em class="replaceable"><code>m</code></em> is a nonzero digit, and
  783. <em class="replaceable"><code>nn</code></em> is some more digits, and the decimal value
  784. <em class="replaceable"><code>mnn</code></em> is not greater than the number of closing capturing
  785. parentheses seen so far)
  786. a back reference to the <em class="replaceable"><code>mnn</code></em>'th subexpression </td></tr></tbody></table></div></div><br class="table-break" /><div class="note"><h3 class="title">Note</h3><p>
  787. There is an inherent ambiguity between octal character-entry
  788. escapes and back references, which is resolved by the following heuristics,
  789. as hinted at above.
  790. A leading zero always indicates an octal escape.
  791. A single non-zero digit, not followed by another digit,
  792. is always taken as a back reference.
  793. A multi-digit sequence not starting with a zero is taken as a back
  794. reference if it comes after a suitable subexpression
  795. (i.e., the number is in the legal range for a back reference),
  796. and otherwise is taken as octal.
  797. </p></div></div><div class="sect3" id="POSIX-METASYNTAX"><div class="titlepage"><div><div><h4 class="title">9.7.3.4. Regular Expression Metasyntax</h4></div></div></div><p>
  798. In addition to the main syntax described above, there are some special
  799. forms and miscellaneous syntactic facilities available.
  800. </p><p>
  801. An RE can begin with one of two special <em class="firstterm">director</em> prefixes.
  802. If an RE begins with <code class="literal">***:</code>,
  803. the rest of the RE is taken as an ARE. (This normally has no effect in
  804. <span class="productname">PostgreSQL</span>, since REs are assumed to be AREs;
  805. but it does have an effect if ERE or BRE mode had been specified by
  806. the <em class="replaceable"><code>flags</code></em> parameter to a regex function.)
  807. If an RE begins with <code class="literal">***=</code>,
  808. the rest of the RE is taken to be a literal string,
  809. with all characters considered ordinary characters.
  810. </p><p>
  811. An ARE can begin with <em class="firstterm">embedded options</em>:
  812. a sequence <code class="literal">(?</code><em class="replaceable"><code>xyz</code></em><code class="literal">)</code>
  813. (where <em class="replaceable"><code>xyz</code></em> is one or more alphabetic characters)
  814. specifies options affecting the rest of the RE.
  815. These options override any previously determined options —
  816. in particular, they can override the case-sensitivity behavior implied by
  817. a regex operator, or the <em class="replaceable"><code>flags</code></em> parameter to a regex
  818. function.
  819. The available option letters are
  820. shown in <a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>.
  821. Note that these same option letters are used in the <em class="replaceable"><code>flags</code></em>
  822. parameters of regex functions.
  823. </p><div class="table" id="POSIX-EMBEDDED-OPTIONS-TABLE"><p class="title"><strong>Table 9.23. ARE Embedded-Option Letters</strong></p><div class="table-contents"><table class="table" summary="ARE Embedded-Option Letters" border="1"><colgroup><col /><col /></colgroup><thead><tr><th>Option</th><th>Description</th></tr></thead><tbody><tr><td> <code class="literal">b</code> </td><td> rest of RE is a BRE </td></tr><tr><td> <code class="literal">c</code> </td><td> case-sensitive matching (overrides operator type) </td></tr><tr><td> <code class="literal">e</code> </td><td> rest of RE is an ERE </td></tr><tr><td> <code class="literal">i</code> </td><td> case-insensitive matching (see
  824. <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a>) (overrides operator type) </td></tr><tr><td> <code class="literal">m</code> </td><td> historical synonym for <code class="literal">n</code> </td></tr><tr><td> <code class="literal">n</code> </td><td> newline-sensitive matching (see
  825. <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a>) </td></tr><tr><td> <code class="literal">p</code> </td><td> partial newline-sensitive matching (see
  826. <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a>) </td></tr><tr><td> <code class="literal">q</code> </td><td> rest of RE is a literal (<span class="quote">“<span class="quote">quoted</span>”</span>) string, all ordinary
  827. characters </td></tr><tr><td> <code class="literal">s</code> </td><td> non-newline-sensitive matching (default) </td></tr><tr><td> <code class="literal">t</code> </td><td> tight syntax (default; see below) </td></tr><tr><td> <code class="literal">w</code> </td><td> inverse partial newline-sensitive (<span class="quote">“<span class="quote">weird</span>”</span>) matching
  828. (see <a class="xref" href="functions-matching.html#POSIX-MATCHING-RULES" title="9.7.3.5. Regular Expression Matching Rules">Section 9.7.3.5</a>) </td></tr><tr><td> <code class="literal">x</code> </td><td> expanded syntax (see below) </td></tr></tbody></table></div></div><br class="table-break" /><p>
  829. Embedded options take effect at the <code class="literal">)</code> terminating the sequence.
  830. They can appear only at the start of an ARE (after the
  831. <code class="literal">***:</code> director if any).
  832. </p><p>
  833. In addition to the usual (<em class="firstterm">tight</em>) RE syntax, in which all
  834. characters are significant, there is an <em class="firstterm">expanded</em> syntax,
  835. available by specifying the embedded <code class="literal">x</code> option.
  836. In the expanded syntax,
  837. white-space characters in the RE are ignored, as are
  838. all characters between a <code class="literal">#</code>
  839. and the following newline (or the end of the RE). This
  840. permits paragraphing and commenting a complex RE.
  841. There are three exceptions to that basic rule:
  842. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  843. a white-space character or <code class="literal">#</code> preceded by <code class="literal">\</code> is
  844. retained
  845. </p></li><li class="listitem"><p>
  846. white space or <code class="literal">#</code> within a bracket expression is retained
  847. </p></li><li class="listitem"><p>
  848. white space and comments cannot appear within multi-character symbols,
  849. such as <code class="literal">(?:</code>
  850. </p></li></ul></div><p>
  851. For this purpose, white-space characters are blank, tab, newline, and
  852. any character that belongs to the <em class="replaceable"><code>space</code></em> character class.
  853. </p><p>
  854. Finally, in an ARE, outside bracket expressions, the sequence
  855. <code class="literal">(?#</code><em class="replaceable"><code>ttt</code></em><code class="literal">)</code>
  856. (where <em class="replaceable"><code>ttt</code></em> is any text not containing a <code class="literal">)</code>)
  857. is a comment, completely ignored.
  858. Again, this is not allowed between the characters of
  859. multi-character symbols, like <code class="literal">(?:</code>.
  860. Such comments are more a historical artifact than a useful facility,
  861. and their use is deprecated; use the expanded syntax instead.
  862. </p><p>
  863. <span class="emphasis"><em>None</em></span> of these metasyntax extensions is available if
  864. an initial <code class="literal">***=</code> director
  865. has specified that the user's input be treated as a literal string
  866. rather than as an RE.
  867. </p></div><div class="sect3" id="POSIX-MATCHING-RULES"><div class="titlepage"><div><div><h4 class="title">9.7.3.5. Regular Expression Matching Rules</h4></div></div></div><p>
  868. In the event that an RE could match more than one substring of a given
  869. string, the RE matches the one starting earliest in the string.
  870. If the RE could match more than one substring starting at that point,
  871. either the longest possible match or the shortest possible match will
  872. be taken, depending on whether the RE is <em class="firstterm">greedy</em> or
  873. <em class="firstterm">non-greedy</em>.
  874. </p><p>
  875. Whether an RE is greedy or not is determined by the following rules:
  876. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  877. Most atoms, and all constraints, have no greediness attribute (because
  878. they cannot match variable amounts of text anyway).
  879. </p></li><li class="listitem"><p>
  880. Adding parentheses around an RE does not change its greediness.
  881. </p></li><li class="listitem"><p>
  882. A quantified atom with a fixed-repetition quantifier
  883. (<code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}</code>
  884. or
  885. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">}?</code>)
  886. has the same greediness (possibly none) as the atom itself.
  887. </p></li><li class="listitem"><p>
  888. A quantified atom with other normal quantifiers (including
  889. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}</code>
  890. with <em class="replaceable"><code>m</code></em> equal to <em class="replaceable"><code>n</code></em>)
  891. is greedy (prefers longest match).
  892. </p></li><li class="listitem"><p>
  893. A quantified atom with a non-greedy quantifier (including
  894. <code class="literal">{</code><em class="replaceable"><code>m</code></em><code class="literal">,</code><em class="replaceable"><code>n</code></em><code class="literal">}?</code>
  895. with <em class="replaceable"><code>m</code></em> equal to <em class="replaceable"><code>n</code></em>)
  896. is non-greedy (prefers shortest match).
  897. </p></li><li class="listitem"><p>
  898. A branch — that is, an RE that has no top-level
  899. <code class="literal">|</code> operator — has the same greediness as the first
  900. quantified atom in it that has a greediness attribute.
  901. </p></li><li class="listitem"><p>
  902. An RE consisting of two or more branches connected by the
  903. <code class="literal">|</code> operator is always greedy.
  904. </p></li></ul></div><p>
  905. </p><p>
  906. The above rules associate greediness attributes not only with individual
  907. quantified atoms, but with branches and entire REs that contain quantified
  908. atoms. What that means is that the matching is done in such a way that
  909. the branch, or whole RE, matches the longest or shortest possible
  910. substring <span class="emphasis"><em>as a whole</em></span>. Once the length of the entire match
  911. is determined, the part of it that matches any particular subexpression
  912. is determined on the basis of the greediness attribute of that
  913. subexpression, with subexpressions starting earlier in the RE taking
  914. priority over ones starting later.
  915. </p><p>
  916. An example of what this means:
  917. </p><pre class="screen">
  918. SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
  919. <em class="lineannotation"><span class="lineannotation">Result: </span></em><code class="computeroutput">123</code>
  920. SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
  921. <em class="lineannotation"><span class="lineannotation">Result: </span></em><code class="computeroutput">1</code>
  922. </pre><p>
  923. In the first case, the RE as a whole is greedy because <code class="literal">Y*</code>
  924. is greedy. It can match beginning at the <code class="literal">Y</code>, and it matches
  925. the longest possible string starting there, i.e., <code class="literal">Y123</code>.
  926. The output is the parenthesized part of that, or <code class="literal">123</code>.
  927. In the second case, the RE as a whole is non-greedy because <code class="literal">Y*?</code>
  928. is non-greedy. It can match beginning at the <code class="literal">Y</code>, and it matches
  929. the shortest possible string starting there, i.e., <code class="literal">Y1</code>.
  930. The subexpression <code class="literal">[0-9]{1,3}</code> is greedy but it cannot change
  931. the decision as to the overall match length; so it is forced to match
  932. just <code class="literal">1</code>.
  933. </p><p>
  934. In short, when an RE contains both greedy and non-greedy subexpressions,
  935. the total match length is either as long as possible or as short as
  936. possible, according to the attribute assigned to the whole RE. The
  937. attributes assigned to the subexpressions only affect how much of that
  938. match they are allowed to <span class="quote">“<span class="quote">eat</span>”</span> relative to each other.
  939. </p><p>
  940. The quantifiers <code class="literal">{1,1}</code> and <code class="literal">{1,1}?</code>
  941. can be used to force greediness or non-greediness, respectively,
  942. on a subexpression or a whole RE.
  943. This is useful when you need the whole RE to have a greediness attribute
  944. different from what's deduced from its elements. As an example,
  945. suppose that we are trying to separate a string containing some digits
  946. into the digits and the parts before and after them. We might try to
  947. do that like this:
  948. </p><pre class="screen">
  949. SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
  950. <em class="lineannotation"><span class="lineannotation">Result: </span></em><code class="computeroutput">{abc0123,4,xyz}</code>
  951. </pre><p>
  952. That didn't work: the first <code class="literal">.*</code> is greedy so
  953. it <span class="quote">“<span class="quote">eats</span>”</span> as much as it can, leaving the <code class="literal">\d+</code> to
  954. match at the last possible place, the last digit. We might try to fix
  955. that by making it non-greedy:
  956. </p><pre class="screen">
  957. SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
  958. <em class="lineannotation"><span class="lineannotation">Result: </span></em><code class="computeroutput">{abc,0,""}</code>
  959. </pre><p>
  960. That didn't work either, because now the RE as a whole is non-greedy
  961. and so it ends the overall match as soon as possible. We can get what
  962. we want by forcing the RE as a whole to be greedy:
  963. </p><pre class="screen">
  964. SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  965. <em class="lineannotation"><span class="lineannotation">Result: </span></em><code class="computeroutput">{abc,01234,xyz}</code>
  966. </pre><p>
  967. Controlling the RE's overall greediness separately from its components'
  968. greediness allows great flexibility in handling variable-length patterns.
  969. </p><p>
  970. When deciding what is a longer or shorter match,
  971. match lengths are measured in characters, not collating elements.
  972. An empty string is considered longer than no match at all.
  973. For example:
  974. <code class="literal">bb*</code>
  975. matches the three middle characters of <code class="literal">abbbc</code>;
  976. <code class="literal">(week|wee)(night|knights)</code>
  977. matches all ten characters of <code class="literal">weeknights</code>;
  978. when <code class="literal">(.*).*</code>
  979. is matched against <code class="literal">abc</code> the parenthesized subexpression
  980. matches all three characters; and when
  981. <code class="literal">(a*)*</code> is matched against <code class="literal">bc</code>
  982. both the whole RE and the parenthesized
  983. subexpression match an empty string.
  984. </p><p>
  985. If case-independent matching is specified,
  986. the effect is much as if all case distinctions had vanished from the
  987. alphabet.
  988. When an alphabetic that exists in multiple cases appears as an
  989. ordinary character outside a bracket expression, it is effectively
  990. transformed into a bracket expression containing both cases,
  991. e.g., <code class="literal">x</code> becomes <code class="literal">[xX]</code>.
  992. When it appears inside a bracket expression, all case counterparts
  993. of it are added to the bracket expression, e.g.,
  994. <code class="literal">[x]</code> becomes <code class="literal">[xX]</code>
  995. and <code class="literal">[^x]</code> becomes <code class="literal">[^xX]</code>.
  996. </p><p>
  997. If newline-sensitive matching is specified, <code class="literal">.</code>
  998. and bracket expressions using <code class="literal">^</code>
  999. will never match the newline character
  1000. (so that matches will never cross newlines unless the RE
  1001. explicitly arranges it)
  1002. and <code class="literal">^</code> and <code class="literal">$</code>
  1003. will match the empty string after and before a newline
  1004. respectively, in addition to matching at beginning and end of string
  1005. respectively.
  1006. But the ARE escapes <code class="literal">\A</code> and <code class="literal">\Z</code>
  1007. continue to match beginning or end of string <span class="emphasis"><em>only</em></span>.
  1008. </p><p>
  1009. If partial newline-sensitive matching is specified,
  1010. this affects <code class="literal">.</code> and bracket expressions
  1011. as with newline-sensitive matching, but not <code class="literal">^</code>
  1012. and <code class="literal">$</code>.
  1013. </p><p>
  1014. If inverse partial newline-sensitive matching is specified,
  1015. this affects <code class="literal">^</code> and <code class="literal">$</code>
  1016. as with newline-sensitive matching, but not <code class="literal">.</code>
  1017. and bracket expressions.
  1018. This isn't very useful but is provided for symmetry.
  1019. </p></div><div class="sect3" id="POSIX-LIMITS-COMPATIBILITY"><div class="titlepage"><div><div><h4 class="title">9.7.3.6. Limits and Compatibility</h4></div></div></div><p>
  1020. No particular limit is imposed on the length of REs in this
  1021. implementation. However,
  1022. programs intended to be highly portable should not employ REs longer
  1023. than 256 bytes,
  1024. as a POSIX-compliant implementation can refuse to accept such REs.
  1025. </p><p>
  1026. The only feature of AREs that is actually incompatible with
  1027. POSIX EREs is that <code class="literal">\</code> does not lose its special
  1028. significance inside bracket expressions.
  1029. All other ARE features use syntax which is illegal or has
  1030. undefined or unspecified effects in POSIX EREs;
  1031. the <code class="literal">***</code> syntax of directors likewise is outside the POSIX
  1032. syntax for both BREs and EREs.
  1033. </p><p>
  1034. Many of the ARE extensions are borrowed from Perl, but some have
  1035. been changed to clean them up, and a few Perl extensions are not present.
  1036. Incompatibilities of note include <code class="literal">\b</code>, <code class="literal">\B</code>,
  1037. the lack of special treatment for a trailing newline,
  1038. the addition of complemented bracket expressions to the things
  1039. affected by newline-sensitive matching,
  1040. the restrictions on parentheses and back references in lookahead/lookbehind
  1041. constraints, and the longest/shortest-match (rather than first-match)
  1042. matching semantics.
  1043. </p><p>
  1044. Two significant incompatibilities exist between AREs and the ERE syntax
  1045. recognized by pre-7.4 releases of <span class="productname">PostgreSQL</span>:
  1046. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  1047. In AREs, <code class="literal">\</code> followed by an alphanumeric character is either
  1048. an escape or an error, while in previous releases, it was just another
  1049. way of writing the alphanumeric.
  1050. This should not be much of a problem because there was no reason to
  1051. write such a sequence in earlier releases.
  1052. </p></li><li class="listitem"><p>
  1053. In AREs, <code class="literal">\</code> remains a special character within
  1054. <code class="literal">[]</code>, so a literal <code class="literal">\</code> within a bracket
  1055. expression must be written <code class="literal">\\</code>.
  1056. </p></li></ul></div><p>
  1057. </p></div><div class="sect3" id="POSIX-BASIC-REGEXES"><div class="titlepage"><div><div><h4 class="title">9.7.3.7. Basic Regular Expressions</h4></div></div></div><p>
  1058. BREs differ from EREs in several respects.
  1059. In BREs, <code class="literal">|</code>, <code class="literal">+</code>, and <code class="literal">?</code>
  1060. are ordinary characters and there is no equivalent
  1061. for their functionality.
  1062. The delimiters for bounds are
  1063. <code class="literal">\{</code> and <code class="literal">\}</code>,
  1064. with <code class="literal">{</code> and <code class="literal">}</code>
  1065. by themselves ordinary characters.
  1066. The parentheses for nested subexpressions are
  1067. <code class="literal">\(</code> and <code class="literal">\)</code>,
  1068. with <code class="literal">(</code> and <code class="literal">)</code> by themselves ordinary characters.
  1069. <code class="literal">^</code> is an ordinary character except at the beginning of the
  1070. RE or the beginning of a parenthesized subexpression,
  1071. <code class="literal">$</code> is an ordinary character except at the end of the
  1072. RE or the end of a parenthesized subexpression,
  1073. and <code class="literal">*</code> is an ordinary character if it appears at the beginning
  1074. of the RE or the beginning of a parenthesized subexpression
  1075. (after a possible leading <code class="literal">^</code>).
  1076. Finally, single-digit back references are available, and
  1077. <code class="literal">\&lt;</code> and <code class="literal">\&gt;</code>
  1078. are synonyms for
  1079. <code class="literal">[[:&lt;:]]</code> and <code class="literal">[[:&gt;:]]</code>
  1080. respectively; no other escapes are available in BREs.
  1081. </p></div><div class="sect3" id="POSIX-VS-XQUERY"><div class="titlepage"><div><div><h4 class="title">9.7.3.8. Differences From XQuery (<code class="literal">LIKE_REGEX</code>)</h4></div></div></div><a id="id-1.5.8.12.9.35.2" class="indexterm"></a><a id="id-1.5.8.12.9.35.3" class="indexterm"></a><p>
  1082. Since SQL:2008, the SQL standard includes
  1083. a <code class="literal">LIKE_REGEX</code> operator that performs pattern
  1084. matching according to the XQuery regular expression
  1085. standard. <span class="productname">PostgreSQL</span> does not yet
  1086. implement this operator, but you can get very similar behavior using
  1087. the <code class="function">regexp_match()</code> function, since XQuery
  1088. regular expressions are quite close to the ARE syntax described above.
  1089. </p><p>
  1090. Notable differences between the existing POSIX-based
  1091. regular-expression feature and XQuery regular expressions include:
  1092. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  1093. XQuery character class subtraction is not supported. An example of
  1094. this feature is using the following to match only English
  1095. consonants: <code class="literal">[a-z-[aeiou]]</code>.
  1096. </p></li><li class="listitem"><p>
  1097. XQuery character class shorthands <code class="literal">\c</code>,
  1098. <code class="literal">\C</code>, <code class="literal">\i</code>,
  1099. and <code class="literal">\I</code> are not supported.
  1100. </p></li><li class="listitem"><p>
  1101. XQuery character class elements
  1102. using <code class="literal">\p{UnicodeProperty}</code> or the
  1103. inverse <code class="literal">\P{UnicodeProperty}</code> are not supported.
  1104. </p></li><li class="listitem"><p>
  1105. POSIX interprets character classes such as <code class="literal">\w</code>
  1106. (see <a class="xref" href="functions-matching.html#POSIX-CLASS-SHORTHAND-ESCAPES-TABLE" title="Table 9.20. Regular Expression Class-Shorthand Escapes">Table 9.20</a>)
  1107. according to the prevailing locale (which you can control by
  1108. attaching a <code class="literal">COLLATE</code> clause to the operator or
  1109. function). XQuery specifies these classes by reference to Unicode
  1110. character properties, so equivalent behavior is obtained only with
  1111. a locale that follows the Unicode rules.
  1112. </p></li><li class="listitem"><p>
  1113. The SQL standard (not XQuery itself) attempts to cater for more
  1114. variants of <span class="quote">“<span class="quote">newline</span>”</span> than POSIX does. The
  1115. newline-sensitive matching options described above consider only
  1116. ASCII NL (<code class="literal">\n</code>) to be a newline, but SQL would have
  1117. us treat CR (<code class="literal">\r</code>), CRLF (<code class="literal">\r\n</code>)
  1118. (a Windows-style newline), and some Unicode-only characters like
  1119. LINE SEPARATOR (U+2028) as newlines as well.
  1120. Notably, <code class="literal">.</code> and <code class="literal">\s</code> should
  1121. count <code class="literal">\r\n</code> as one character not two according to
  1122. SQL.
  1123. </p></li><li class="listitem"><p>
  1124. Of the character-entry escapes described in
  1125. <a class="xref" href="functions-matching.html#POSIX-CHARACTER-ENTRY-ESCAPES-TABLE" title="Table 9.19. Regular Expression Character-Entry Escapes">Table 9.19</a>,
  1126. XQuery supports only <code class="literal">\n</code>, <code class="literal">\r</code>,
  1127. and <code class="literal">\t</code>.
  1128. </p></li><li class="listitem"><p>
  1129. XQuery does not support
  1130. the <code class="literal">[:<em class="replaceable"><code>name</code></em>:]</code> syntax
  1131. for character classes within bracket expressions.
  1132. </p></li><li class="listitem"><p>
  1133. XQuery does not have lookahead or lookbehind constraints,
  1134. nor any of the constraint escapes described in
  1135. <a class="xref" href="functions-matching.html#POSIX-CONSTRAINT-ESCAPES-TABLE" title="Table 9.21. Regular Expression Constraint Escapes">Table 9.21</a>.
  1136. </p></li><li class="listitem"><p>
  1137. The metasyntax forms described in <a class="xref" href="functions-matching.html#POSIX-METASYNTAX" title="9.7.3.4. Regular Expression Metasyntax">Section 9.7.3.4</a>
  1138. do not exist in XQuery.
  1139. </p></li><li class="listitem"><p>
  1140. The regular expression flag letters defined by XQuery are
  1141. related to but not the same as the option letters for POSIX
  1142. (<a class="xref" href="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" title="Table 9.23. ARE Embedded-Option Letters">Table 9.23</a>). While the
  1143. <code class="literal">i</code> and <code class="literal">q</code> options behave the
  1144. same, others do not:
  1145. </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
  1146. XQuery's <code class="literal">s</code> (allow dot to match newline)
  1147. and <code class="literal">m</code> (allow <code class="literal">^</code>
  1148. and <code class="literal">$</code> to match at newlines) flags provide
  1149. access to the same behaviors as
  1150. POSIX's <code class="literal">n</code>, <code class="literal">p</code>
  1151. and <code class="literal">w</code> flags, but they
  1152. do <span class="emphasis"><em>not</em></span> match the behavior of
  1153. POSIX's <code class="literal">s</code> and <code class="literal">m</code> flags.
  1154. Note in particular that dot-matches-newline is the default
  1155. behavior in POSIX but not XQuery.
  1156. </p></li><li class="listitem"><p>
  1157. XQuery's <code class="literal">x</code> (ignore whitespace in pattern) flag
  1158. is noticeably different from POSIX's expanded-mode flag.
  1159. POSIX's <code class="literal">x</code> flag also
  1160. allows <code class="literal">#</code> to begin a comment in the pattern,
  1161. and POSIX will not ignore a whitespace character after a
  1162. backslash.
  1163. </p></li></ul></div><p>
  1164. </p></li></ul></div><p>
  1165. </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="functions-bitstring.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="functions.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="functions-formatting.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">9.6. Bit String Functions and Operators </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 9.8. Data Type Formatting Functions</td></tr></table></div></body></html>
上海开阖软件有限公司 沪ICP备12045867号-1