gooderp18绿色标准版
您最多选择25个主题 主题必须以字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符

210 行
14KB

  1. <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>70.2. Multivariate Statistics Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="row-estimation-examples.html" title="70.1. Row Estimation Examples" /><link rel="next" href="planner-stats-security.html" title="70.3. Planner Statistics and Security" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">70.2. Multivariate Statistics Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="70.1. Row Estimation Examples">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 70. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="70.3. Planner Statistics and Security">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="MULTIVARIATE-STATISTICS-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">70.2. Multivariate Statistics Examples</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES">70.2.1. Functional Dependencies</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MULTIVARIATE-NDISTINCT-COUNTS">70.2.2. Multivariate N-Distinct Counts</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MCV-LISTS">70.2.3. MCV Lists</a></span></dt></dl></div><a id="id-1.10.23.5.2" class="indexterm"></a><div class="sect2" id="FUNCTIONAL-DEPENDENCIES"><div class="titlepage"><div><div><h3 class="title">70.2.1. Functional Dependencies</h3></div></div></div><p>
  3. Multivariate correlation can be demonstrated with a very simple data set
  4. — a table with two columns, both containing the same values:
  5. </p><pre class="programlisting">
  6. CREATE TABLE t (a INT, b INT);
  7. INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
  8. ANALYZE t;
  9. </pre><p>
  10. As explained in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, the planner can determine
  11. cardinality of <code class="structname">t</code> using the number of pages and
  12. rows obtained from <code class="structname">pg_class</code>:
  13. </p><pre class="programlisting">
  14. SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
  15. relpages | reltuples
  16. ----------+-----------
  17. 45 | 10000
  18. </pre><p>
  19. The data distribution is very simple; there are only 100 distinct values
  20. in each column, uniformly distributed.
  21. </p><p>
  22. The following example shows the result of estimating a <code class="literal">WHERE</code>
  23. condition on the <code class="structfield">a</code> column:
  24. </p><pre class="programlisting">
  25. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
  26. QUERY PLAN
  27. -------------------------------------------------------------------------------
  28. Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
  29. Filter: (a = 1)
  30. Rows Removed by Filter: 9900
  31. </pre><p>
  32. The planner examines the condition and determines the selectivity
  33. of this clause to be 1%. By comparing this estimate and the actual
  34. number of rows, we see that the estimate is very accurate
  35. (in fact exact, as the table is very small). Changing the
  36. <code class="literal">WHERE</code> condition to use the <code class="structfield">b</code> column, an
  37. identical plan is generated. But observe what happens if we apply the same
  38. condition on both columns, combining them with <code class="literal">AND</code>:
  39. </p><pre class="programlisting">
  40. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  41. QUERY PLAN
  42. -----------------------------------------------------------------------------
  43. Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
  44. Filter: ((a = 1) AND (b = 1))
  45. Rows Removed by Filter: 9900
  46. </pre><p>
  47. The planner estimates the selectivity for each condition individually,
  48. arriving at the same 1% estimates as above. Then it assumes that the
  49. conditions are independent, and so it multiplies their selectivities,
  50. producing a final selectivity estimate of just 0.01%.
  51. This is a significant underestimate, as the actual number of rows
  52. matching the conditions (100) is two orders of magnitude higher.
  53. </p><p>
  54. This problem can be fixed by creating a statistics object that
  55. directs <code class="command">ANALYZE</code> to calculate functional-dependency
  56. multivariate statistics on the two columns:
  57. </p><pre class="programlisting">
  58. CREATE STATISTICS stts (dependencies) ON a, b FROM t;
  59. ANALYZE t;
  60. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  61. QUERY PLAN
  62. -------------------------------------------------------------------------------
  63. Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
  64. Filter: ((a = 1) AND (b = 1))
  65. Rows Removed by Filter: 9900
  66. </pre><p>
  67. </p></div><div class="sect2" id="MULTIVARIATE-NDISTINCT-COUNTS"><div class="titlepage"><div><div><h3 class="title">70.2.2. Multivariate N-Distinct Counts</h3></div></div></div><p>
  68. A similar problem occurs with estimation of the cardinality of sets of
  69. multiple columns, such as the number of groups that would be generated by
  70. a <code class="command">GROUP BY</code> clause. When <code class="command">GROUP BY</code>
  71. lists a single column, the n-distinct estimate (which is visible as the
  72. estimated number of rows returned by the HashAggregate node) is very
  73. accurate:
  74. </p><pre class="programlisting">
  75. EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
  76. QUERY PLAN
  77. -----------------------------------------------------------------------------------------
  78. HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
  79. Group Key: a
  80. -&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)
  81. </pre><p>
  82. But without multivariate statistics, the estimate for the number of
  83. groups in a query with two columns in <code class="command">GROUP BY</code>, as
  84. in the following example, is off by an order of magnitude:
  85. </p><pre class="programlisting">
  86. EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
  87. QUERY PLAN
  88. --------------------------------------------------------------------------------------------
  89. HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
  90. Group Key: a, b
  91. -&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
  92. </pre><p>
  93. By redefining the statistics object to include n-distinct counts for the
  94. two columns, the estimate is much improved:
  95. </p><pre class="programlisting">
  96. DROP STATISTICS stts;
  97. CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
  98. ANALYZE t;
  99. EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
  100. QUERY PLAN
  101. --------------------------------------------------------------------------------------------
  102. HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
  103. Group Key: a, b
  104. -&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
  105. </pre><p>
  106. </p></div><div class="sect2" id="MCV-LISTS"><div class="titlepage"><div><div><h3 class="title">70.2.3. MCV Lists</h3></div></div></div><p>
  107. As explained in <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="70.2.1. Functional Dependencies">Section 70.2.1</a>, functional
  108. dependencies are very cheap and efficient type of statistics, but their
  109. main limitation is their global nature (only tracking dependencies at
  110. the column level, not between individual column values).
  111. </p><p>
  112. This section introduces multivariate variant of <acronym class="acronym">MCV</acronym>
  113. (most-common values) lists, a straightforward extension of the per-column
  114. statistics described in <a class="xref" href="row-estimation-examples.html" title="70.1. Row Estimation Examples">Section 70.1</a>. These
  115. statistics address the limitation by storing individual values, but it is
  116. naturally more expensive, both in terms of building the statistics in
  117. <code class="command">ANALYZE</code>, storage and planning time.
  118. </p><p>
  119. Let's look at the query from <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="70.2.1. Functional Dependencies">Section 70.2.1</a>
  120. again, but this time with a <acronym class="acronym">MCV</acronym> list created on the
  121. same set of columns (be sure to drop the functional dependencies, to
  122. make sure the planner uses the newly created statistics).
  123. </p><pre class="programlisting">
  124. DROP STATISTICS stts;
  125. CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
  126. ANALYZE t;
  127. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  128. QUERY PLAN
  129. -------------------------------------------------------------------------------
  130. Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
  131. Filter: ((a = 1) AND (b = 1))
  132. Rows Removed by Filter: 9900
  133. </pre><p>
  134. The estimate is as accurate as with the functional dependencies, mostly
  135. thanks to the table being fairly small and having a simple distribution
  136. with a low number of distinct values. Before looking at the second query,
  137. which was not handled by functional dependencies particularly well,
  138. let's inspect the <acronym class="acronym">MCV</acronym> list a bit.
  139. </p><p>
  140. Inspecting the <acronym class="acronym">MCV</acronym> list is possible using
  141. <code class="function">pg_mcv_list_items</code> set-returning function.
  142. </p><pre class="programlisting">
  143. SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
  144. pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
  145. index | values | nulls | frequency | base_frequency
  146. -------+----------+-------+-----------+----------------
  147. 0 | {0, 0} | {f,f} | 0.01 | 0.0001
  148. 1 | {1, 1} | {f,f} | 0.01 | 0.0001
  149. ...
  150. 49 | {49, 49} | {f,f} | 0.01 | 0.0001
  151. 50 | {50, 50} | {f,f} | 0.01 | 0.0001
  152. ...
  153. 97 | {97, 97} | {f,f} | 0.01 | 0.0001
  154. 98 | {98, 98} | {f,f} | 0.01 | 0.0001
  155. 99 | {99, 99} | {f,f} | 0.01 | 0.0001
  156. (100 rows)
  157. </pre><p>
  158. This confirms there are 100 distinct combinations in the two columns, and
  159. all of them are about equally likely (1% frequency for each one). The
  160. base frequency is the frequency computed from per-column statistics, as if
  161. there were no multi-column statistics. Had there been any null values in
  162. either of the columns, this would be identified in the
  163. <code class="structfield">nulls</code> column.
  164. </p><p>
  165. When estimating the selectivity, the planner applies all the conditions
  166. on items in the <acronym class="acronym">MCV</acronym> list, and then sums the frequencies
  167. of the matching ones. See <code class="function">mcv_clauselist_selectivity</code>
  168. in <code class="filename">src/backend/statistics/mcv.c</code> for details.
  169. </p><p>
  170. Compared to functional dependencies, <acronym class="acronym">MCV</acronym> lists have two
  171. major advantages. Firstly, the list stores actual values, making it possible
  172. to decide which combinations are compatible.
  173. </p><pre class="programlisting">
  174. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
  175. QUERY PLAN
  176. ---------------------------------------------------------------------------
  177. Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
  178. Filter: ((a = 1) AND (b = 10))
  179. Rows Removed by Filter: 10000
  180. </pre><p>
  181. Secondly, <acronym class="acronym">MCV</acronym> lists handle a wider range of clause types,
  182. not just equality clauses like functional dependencies. For example,
  183. consider the following range query for the same table:
  184. </p><pre class="programlisting">
  185. EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a &lt;= 49 AND b &gt; 49;
  186. QUERY PLAN
  187. ---------------------------------------------------------------------------
  188. Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
  189. Filter: ((a &lt;= 49) AND (b &gt; 49))
  190. Rows Removed by Filter: 10000
  191. </pre><p>
  192. </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="row-estimation-examples.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="planner-stats-security.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">70.1. Row Estimation Examples </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 70.3. Planner Statistics and Security</td></tr></table></div></body></html>
上海开阖软件有限公司 沪ICP备12045867号-1