gooderp18绿色标准版
您最多选择25个主题 主题必须以字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符

399 行
20KB

  1. <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>70.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="70.2. Multivariate Statistics Examples" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">70.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 70. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="70.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">70.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.23.4.2" class="indexterm"></a><p>
  3. The examples shown below use tables in the <span class="productname">PostgreSQL</span>
  4. regression test database.
  5. The outputs shown are taken from version 8.3.
  6. The behavior of earlier (or later) versions might vary.
  7. Note also that since <code class="command">ANALYZE</code> uses random sampling
  8. while producing statistics, the results will change slightly after
  9. any new <code class="command">ANALYZE</code>.
  10. </p><p>
  11. Let's start with a very simple query:
  12. </p><pre class="programlisting">
  13. EXPLAIN SELECT * FROM tenk1;
  14. QUERY PLAN
  15. -------------------------------------------------------------
  16. Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
  17. </pre><p>
  18. How the planner determines the cardinality of <code class="structname">tenk1</code>
  19. is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
  20. completeness. The number of pages and rows is looked up in
  21. <code class="structname">pg_class</code>:
  22. </p><pre class="programlisting">
  23. SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
  24. relpages | reltuples
  25. ----------+-----------
  26. 358 | 10000
  27. </pre><p>
  28. These numbers are current as of the last <code class="command">VACUUM</code> or
  29. <code class="command">ANALYZE</code> on the table. The planner then fetches the
  30. actual current number of pages in the table (this is a cheap operation,
  31. not requiring a table scan). If that is different from
  32. <code class="structfield">relpages</code> then
  33. <code class="structfield">reltuples</code> is scaled accordingly to
  34. arrive at a current number-of-rows estimate. In the example above, the value of
  35. <code class="structfield">relpages</code> is up-to-date so the rows estimate is
  36. the same as <code class="structfield">reltuples</code>.
  37. </p><p>
  38. Let's move on to an example with a range condition in its
  39. <code class="literal">WHERE</code> clause:
  40. </p><pre class="programlisting">
  41. EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
  42. QUERY PLAN
  43. --------------------------------------------------------------------------------
  44. Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
  45. Recheck Cond: (unique1 &lt; 1000)
  46. -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
  47. Index Cond: (unique1 &lt; 1000)
  48. </pre><p>
  49. The planner examines the <code class="literal">WHERE</code> clause condition
  50. and looks up the selectivity function for the operator
  51. <code class="literal">&lt;</code> in <code class="structname">pg_operator</code>.
  52. This is held in the column <code class="structfield">oprrest</code>,
  53. and the entry in this case is <code class="function">scalarltsel</code>.
  54. The <code class="function">scalarltsel</code> function retrieves the histogram for
  55. <code class="structfield">unique1</code> from
  56. <code class="structname">pg_statistic</code>. For manual queries it is more
  57. convenient to look in the simpler <code class="structname">pg_stats</code>
  58. view:
  59. </p><pre class="programlisting">
  60. SELECT histogram_bounds FROM pg_stats
  61. WHERE tablename='tenk1' AND attname='unique1';
  62. histogram_bounds
  63. ------------------------------------------------------
  64. {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
  65. </pre><p>
  66. Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">&lt; 1000</span>”</span>
  67. is worked out. This is the selectivity. The histogram divides the range
  68. into equal frequency buckets, so all we have to do is locate the bucket
  69. that our value is in and count <span class="emphasis"><em>part</em></span> of it and
  70. <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
  71. the second bucket (993-1997). Assuming a linear distribution of
  72. values inside each bucket, we can calculate the selectivity as:
  73. </p><pre class="programlisting">
  74. selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
  75. = (1 + (1000 - 993)/(1997 - 993))/10
  76. = 0.100697
  77. </pre><p>
  78. that is, one whole bucket plus a linear fraction of the second, divided by
  79. the number of buckets. The estimated number of rows can now be calculated as
  80. the product of the selectivity and the cardinality of
  81. <code class="structname">tenk1</code>:
  82. </p><pre class="programlisting">
  83. rows = rel_cardinality * selectivity
  84. = 10000 * 0.100697
  85. = 1007 (rounding off)
  86. </pre><p>
  87. </p><p>
  88. Next let's consider an example with an equality condition in its
  89. <code class="literal">WHERE</code> clause:
  90. </p><pre class="programlisting">
  91. EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
  92. QUERY PLAN
  93. ----------------------------------------------------------
  94. Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
  95. Filter: (stringu1 = 'CRAAAA'::name)
  96. </pre><p>
  97. Again the planner examines the <code class="literal">WHERE</code> clause condition
  98. and looks up the selectivity function for <code class="literal">=</code>, which is
  99. <code class="function">eqsel</code>. For equality estimation the histogram is
  100. not useful; instead the list of <em class="firstterm">most
  101. common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
  102. selectivity. Let's have a look at the MCVs, with some additional columns
  103. that will be useful later:
  104. </p><pre class="programlisting">
  105. SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
  106. WHERE tablename='tenk1' AND attname='stringu1';
  107. null_frac | 0
  108. n_distinct | 676
  109. most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
  110. most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
  111. </pre><p>
  112. Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
  113. merely the corresponding entry in the list of most common frequencies
  114. (<acronym class="acronym">MCF</acronym>s):
  115. </p><pre class="programlisting">
  116. selectivity = mcf[3]
  117. = 0.003
  118. </pre><p>
  119. As before, the estimated number of rows is just the product of this with the
  120. cardinality of <code class="structname">tenk1</code>:
  121. </p><pre class="programlisting">
  122. rows = 10000 * 0.003
  123. = 30
  124. </pre><p>
  125. </p><p>
  126. Now consider the same query, but with a constant that is not in the
  127. <acronym class="acronym">MCV</acronym> list:
  128. </p><pre class="programlisting">
  129. EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
  130. QUERY PLAN
  131. ----------------------------------------------------------
  132. Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
  133. Filter: (stringu1 = 'xxx'::name)
  134. </pre><p>
  135. This is quite a different problem: how to estimate the selectivity when the
  136. value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
  137. The approach is to use the fact that the value is not in the list,
  138. combined with the knowledge of the frequencies for all of the
  139. <acronym class="acronym">MCV</acronym>s:
  140. </p><pre class="programlisting">
  141. selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
  142. = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
  143. 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
  144. = 0.0014559
  145. </pre><p>
  146. That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
  147. subtract them from one, then
  148. divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
  149. This amounts to assuming that the fraction of the column that is not any
  150. of the MCVs is evenly distributed among all the other distinct values.
  151. Notice that there are no null values so we don't have to worry about those
  152. (otherwise we'd subtract the null fraction from the numerator as well).
  153. The estimated number of rows is then calculated as usual:
  154. </p><pre class="programlisting">
  155. rows = 10000 * 0.0014559
  156. = 15 (rounding off)
  157. </pre><p>
  158. </p><p>
  159. The previous example with <code class="literal">unique1 &lt; 1000</code> was an
  160. oversimplification of what <code class="function">scalarltsel</code> really does;
  161. now that we have seen an example of the use of MCVs, we can fill in some
  162. more detail. The example was correct as far as it went, because since
  163. <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
  164. value is any more common than any other value). For a non-unique
  165. column, there will normally be both a histogram and an MCV list, and
  166. <span class="emphasis"><em>the histogram does not include the portion of the column
  167. population represented by the MCVs</em></span>. We do things this way because
  168. it allows more precise estimation. In this situation
  169. <code class="function">scalarltsel</code> directly applies the condition (e.g.,
  170. <span class="quote">“<span class="quote">&lt; 1000</span>”</span>) to each value of the MCV list, and adds up the
  171. frequencies of the MCVs for which the condition is true. This gives
  172. an exact estimate of the selectivity within the portion of the table
  173. that is MCVs. The histogram is then used in the same way as above
  174. to estimate the selectivity in the portion of the table that is not
  175. MCVs, and then the two numbers are combined to estimate the overall
  176. selectivity. For example, consider
  177. </p><pre class="programlisting">
  178. EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';
  179. QUERY PLAN
  180. ------------------------------------------------------------
  181. Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
  182. Filter: (stringu1 &lt; 'IAAAAA'::name)
  183. </pre><p>
  184. We already saw the MCV information for <code class="structfield">stringu1</code>,
  185. and here is its histogram:
  186. </p><pre class="programlisting">
  187. SELECT histogram_bounds FROM pg_stats
  188. WHERE tablename='tenk1' AND attname='stringu1';
  189. histogram_bounds
  190. --------------------------------------------------------------------------------
  191. {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
  192. </pre><p>
  193. Checking the MCV list, we find that the condition <code class="literal">stringu1 &lt;
  194. 'IAAAAA'</code> is satisfied by the first six entries and not the last four,
  195. so the selectivity within the MCV part of the population is
  196. </p><pre class="programlisting">
  197. selectivity = sum(relevant mvfs)
  198. = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
  199. = 0.01833333
  200. </pre><p>
  201. Summing all the MCFs also tells us that the total fraction of the
  202. population represented by MCVs is 0.03033333, and therefore the
  203. fraction represented by the histogram is 0.96966667 (again, there
  204. are no nulls, else we'd have to exclude them here). We can see
  205. that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
  206. third histogram bucket. Using some rather cheesy assumptions
  207. about the frequency of different characters, the planner arrives
  208. at the estimate 0.298387 for the portion of the histogram population
  209. that is less than <code class="literal">IAAAAA</code>. We then combine the estimates
  210. for the MCV and non-MCV populations:
  211. </p><pre class="programlisting">
  212. selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
  213. = 0.01833333 + 0.298387 * 0.96966667
  214. = 0.307669
  215. rows = 10000 * 0.307669
  216. = 3077 (rounding off)
  217. </pre><p>
  218. In this particular example, the correction from the MCV list is fairly
  219. small, because the column distribution is actually quite flat (the
  220. statistics showing these particular values as being more common than
  221. others are mostly due to sampling error). In a more typical case where
  222. some values are significantly more common than others, this complicated
  223. process gives a useful improvement in accuracy because the selectivity
  224. for the most common values is found exactly.
  225. </p><p>
  226. Now let's consider a case with more than one
  227. condition in the <code class="literal">WHERE</code> clause:
  228. </p><pre class="programlisting">
  229. EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
  230. QUERY PLAN
  231. --------------------------------------------------------------------------------
  232. Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
  233. Recheck Cond: (unique1 &lt; 1000)
  234. Filter: (stringu1 = 'xxx'::name)
  235. -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
  236. Index Cond: (unique1 &lt; 1000)
  237. </pre><p>
  238. The planner assumes that the two conditions are independent, so that
  239. the individual selectivities of the clauses can be multiplied together:
  240. </p><pre class="programlisting">
  241. selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
  242. = 0.100697 * 0.0014559
  243. = 0.0001466
  244. rows = 10000 * 0.0001466
  245. = 1 (rounding off)
  246. </pre><p>
  247. Notice that the number of rows estimated to be returned from the bitmap
  248. index scan reflects only the condition used with the index; this is
  249. important since it affects the cost estimate for the subsequent heap
  250. fetches.
  251. </p><p>
  252. Finally we will examine a query that involves a join:
  253. </p><pre class="programlisting">
  254. EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
  255. WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
  256. QUERY PLAN
  257. --------------------------------------------------------------------------------------
  258. Nested Loop (cost=4.64..456.23 rows=50 width=488)
  259. -&gt; Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
  260. Recheck Cond: (unique1 &lt; 50)
  261. -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
  262. Index Cond: (unique1 &lt; 50)
  263. -&gt; Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
  264. Index Cond: (unique2 = t1.unique2)
  265. </pre><p>
  266. The restriction on <code class="structname">tenk1</code>,
  267. <code class="literal">unique1 &lt; 50</code>,
  268. is evaluated before the nested-loop join.
  269. This is handled analogously to the previous range example. This time the
  270. value 50 falls into the first bucket of the
  271. <code class="structfield">unique1</code> histogram:
  272. </p><pre class="programlisting">
  273. selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
  274. = (0 + (50 - 0)/(993 - 0))/10
  275. = 0.005035
  276. rows = 10000 * 0.005035
  277. = 50 (rounding off)
  278. </pre><p>
  279. The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
  280. The operator is just
  281. our familiar <code class="literal">=</code>, however the selectivity function is
  282. obtained from the <code class="structfield">oprjoin</code> column of
  283. <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
  284. <code class="function">eqjoinsel</code> looks up the statistical information for both
  285. <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
  286. </p><pre class="programlisting">
  287. SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
  288. WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
  289. tablename | null_frac | n_distinct | most_common_vals
  290. -----------+-----------+------------+------------------
  291. tenk1 | 0 | -1 |
  292. tenk2 | 0 | -1 |
  293. </pre><p>
  294. In this case there is no <acronym class="acronym">MCV</acronym> information for
  295. <code class="structfield">unique2</code> because all the values appear to be
  296. unique, so we use an algorithm that relies only on the number of
  297. distinct values for both relations together with their null fractions:
  298. </p><pre class="programlisting">
  299. selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
  300. = (1 - 0) * (1 - 0) / max(10000, 10000)
  301. = 0.0001
  302. </pre><p>
  303. This is, subtract the null fraction from one for each of the relations,
  304. and divide by the maximum of the numbers of distinct values.
  305. The number of rows
  306. that the join is likely to emit is calculated as the cardinality of the
  307. Cartesian product of the two inputs, multiplied by the
  308. selectivity:
  309. </p><pre class="programlisting">
  310. rows = (outer_cardinality * inner_cardinality) * selectivity
  311. = (50 * 10000) * 0.0001
  312. = 50
  313. </pre><p>
  314. </p><p>
  315. Had there been MCV lists for the two columns,
  316. <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
  317. lists to determine the join selectivity within the part of the column
  318. populations represented by the MCVs. The estimate for the remainder of the
  319. populations follows the same approach shown here.
  320. </p><p>
  321. Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
  322. the unmodified size of <code class="structname">tenk2</code>. It might appear from
  323. inspection of the <code class="command">EXPLAIN</code> output that the estimate of
  324. join rows comes from 50 * 1, that is, the number of outer rows times
  325. the estimated number of rows obtained by each inner index scan on
  326. <code class="structname">tenk2</code>. But this is not the case: the join relation size
  327. is estimated before any particular join plan has been considered. If
  328. everything is working well then the two ways of estimating the join
  329. size will produce about the same answer, but due to round-off error and
  330. other factors they sometimes diverge significantly.
  331. </p><p>
  332. For those interested in further details, estimation of the size of
  333. a table (before any <code class="literal">WHERE</code> clauses) is done in
  334. <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
  335. logic for clause selectivities is in
  336. <code class="filename">src/backend/optimizer/path/clausesel.c</code>. The
  337. operator-specific selectivity functions are mostly found
  338. in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
  339. </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 70. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 70.2. Multivariate Statistics Examples</td></tr></table></div></body></html>
上海开阖软件有限公司 沪ICP备12045867号-1