gooderp18绿色标准版
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

335 line
22KB

  1. <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#id-1.5.13.5.3">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="id-1.5.13.5.3"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics</h3></div></div></div><p>
  3. As we saw in the previous section, the query planner needs to estimate
  4. the number of rows retrieved by a query in order to make good choices
  5. of query plans. This section provides a quick look at the statistics
  6. that the system uses for these estimates.
  7. </p><p>
  8. One component of the statistics is the total number of entries in
  9. each table and index, as well as the number of disk blocks occupied
  10. by each table and index. This information is kept in the table
  11. <a class="link" href="catalog-pg-class.html" title="51.11. pg_class"><code class="structname">pg_class</code></a>,
  12. in the columns <code class="structfield">reltuples</code> and
  13. <code class="structfield">relpages</code>. We can look at it with
  14. queries similar to this one:
  15. </p><pre class="screen">
  16. SELECT relname, relkind, reltuples, relpages
  17. FROM pg_class
  18. WHERE relname LIKE 'tenk1%';
  19. relname | relkind | reltuples | relpages
  20. ----------------------+---------+-----------+----------
  21. tenk1 | r | 10000 | 358
  22. tenk1_hundred | i | 10000 | 30
  23. tenk1_thous_tenthous | i | 10000 | 30
  24. tenk1_unique1 | i | 10000 | 30
  25. tenk1_unique2 | i | 10000 | 30
  26. (5 rows)
  27. </pre><p>
  28. Here we can see that <code class="structname">tenk1</code> contains 10000
  29. rows, as do its indexes, but the indexes are (unsurprisingly) much
  30. smaller than the table.
  31. </p><p>
  32. For efficiency reasons, <code class="structfield">reltuples</code>
  33. and <code class="structfield">relpages</code> are not updated on-the-fly,
  34. and so they usually contain somewhat out-of-date values.
  35. They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a
  36. few DDL commands such as <code class="command">CREATE INDEX</code>. A <code class="command">VACUUM</code>
  37. or <code class="command">ANALYZE</code> operation that does not scan the entire table
  38. (which is commonly the case) will incrementally update the
  39. <code class="structfield">reltuples</code> count on the basis of the part
  40. of the table it did scan, resulting in an approximate value.
  41. In any case, the planner
  42. will scale the values it finds in <code class="structname">pg_class</code>
  43. to match the current physical table size, thus obtaining a closer
  44. approximation.
  45. </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p>
  46. Most queries retrieve only a fraction of the rows in a table, due
  47. to <code class="literal">WHERE</code> clauses that restrict the rows to be
  48. examined. The planner thus needs to make an estimate of the
  49. <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is,
  50. the fraction of rows that match each condition in the
  51. <code class="literal">WHERE</code> clause. The information used for this task is
  52. stored in the
  53. <a class="link" href="catalog-pg-statistic.html" title="51.50. pg_statistic"><code class="structname">pg_statistic</code></a>
  54. system catalog. Entries in <code class="structname">pg_statistic</code>
  55. are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM
  56. ANALYZE</code> commands, and are always approximate even when freshly
  57. updated.
  58. </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p>
  59. Rather than look at <code class="structname">pg_statistic</code> directly,
  60. it's better to look at its view
  61. <a class="link" href="view-pg-stats.html" title="51.88. pg_stats"><code class="structname">pg_stats</code></a>
  62. when examining the statistics manually. <code class="structname">pg_stats</code>
  63. is designed to be more easily readable. Furthermore,
  64. <code class="structname">pg_stats</code> is readable by all, whereas
  65. <code class="structname">pg_statistic</code> is only readable by a superuser.
  66. (This prevents unprivileged users from learning something about
  67. the contents of other people's tables from the statistics. The
  68. <code class="structname">pg_stats</code> view is restricted to show only
  69. rows about tables that the current user can read.)
  70. For example, we might do:
  71. </p><pre class="screen">
  72. SELECT attname, inherited, n_distinct,
  73. array_to_string(most_common_vals, E'\n') as most_common_vals
  74. FROM pg_stats
  75. WHERE tablename = 'road';
  76. attname | inherited | n_distinct | most_common_vals
  77. ---------+-----------+------------+------------------------------------
  78. name | f | -0.363388 | I- 580 Ramp+
  79. | | | I- 880 Ramp+
  80. | | | Sp Railroad +
  81. | | | I- 580 +
  82. | | | I- 680 Ramp
  83. name | t | -0.284859 | I- 880 Ramp+
  84. | | | I- 580 Ramp+
  85. | | | I- 680 Ramp+
  86. | | | I- 580 +
  87. | | | State Hwy 13 Ramp
  88. (2 rows)
  89. </pre><p>
  90. Note that two rows are displayed for the same column, one corresponding
  91. to the complete inheritance hierarchy starting at the
  92. <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>),
  93. and another one including only the <code class="literal">road</code> table itself
  94. (<code class="literal">inherited</code>=<code class="literal">f</code>).
  95. </p><p>
  96. The amount of information stored in <code class="structname">pg_statistic</code>
  97. by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the
  98. <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code>
  99. arrays for each column, can be set on a
  100. column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code>
  101. command, or globally by setting the
  102. <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable.
  103. The default limit is presently 100 entries. Raising the limit
  104. might allow more accurate planner estimates to be made, particularly for
  105. columns with irregular data distributions, at the price of consuming
  106. more space in <code class="structname">pg_statistic</code> and slightly more
  107. time to compute the estimates. Conversely, a lower limit might be
  108. sufficient for columns with simple data distributions.
  109. </p><p>
  110. Further details about the planner's use of statistics can be found in
  111. <a class="xref" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Chapter 70</a>.
  112. </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics</h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p>
  113. It is common to see slow queries running bad execution plans because
  114. multiple columns used in the query clauses are correlated.
  115. The planner normally assumes that multiple conditions
  116. are independent of each other,
  117. an assumption that does not hold when column values are correlated.
  118. Regular statistics, because of their per-individual-column nature,
  119. cannot capture any knowledge about cross-column correlation.
  120. However, <span class="productname">PostgreSQL</span> has the ability to compute
  121. <em class="firstterm">multivariate statistics</em>, which can capture
  122. such information.
  123. </p><p>
  124. Because the number of possible column combinations is very large,
  125. it's impractical to compute multivariate statistics automatically.
  126. Instead, <em class="firstterm">extended statistics objects</em>, more often
  127. called just <em class="firstterm">statistics objects</em>, can be created to instruct
  128. the server to obtain statistics across interesting sets of columns.
  129. </p><p>
  130. Statistics objects are created using the
  131. <a class="xref" href="sql-createstatistics.html" title="CREATE STATISTICS"><span class="refentrytitle">CREATE STATISTICS</span></a> command.
  132. Creation of such an object merely creates a catalog entry expressing
  133. interest in the statistics. Actual data collection is performed
  134. by <code class="command">ANALYZE</code> (either a manual command, or background
  135. auto-analyze). The collected values can be examined in the
  136. <a class="link" href="catalog-pg-statistic-ext-data.html" title="51.52. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a>
  137. catalog.
  138. </p><p>
  139. <code class="command">ANALYZE</code> computes extended statistics based on the same
  140. sample of table rows that it takes for computing regular single-column
  141. statistics. Since the sample size is increased by increasing the
  142. statistics target for the table or any of its columns (as described in
  143. the previous section), a larger statistics target will normally result in
  144. more accurate extended statistics, as well as more time spent calculating
  145. them.
  146. </p><p>
  147. The following subsections describe the kinds of extended statistics
  148. that are currently supported.
  149. </p><div class="sect3" id="id-1.5.13.5.4.11"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies</h4></div></div></div><p>
  150. The simplest kind of extended statistics tracks <em class="firstterm">functional
  151. dependencies</em>, a concept used in definitions of database normal forms.
  152. We say that column <code class="structfield">b</code> is functionally dependent on
  153. column <code class="structfield">a</code> if knowledge of the value of
  154. <code class="structfield">a</code> is sufficient to determine the value
  155. of <code class="structfield">b</code>, that is there are no two rows having the same value
  156. of <code class="structfield">a</code> but different values of <code class="structfield">b</code>.
  157. In a fully normalized database, functional dependencies should exist
  158. only on primary keys and superkeys. However, in practice many data sets
  159. are not fully normalized for various reasons; intentional
  160. denormalization for performance reasons is a common example.
  161. Even in a fully normalized database, there may be partial correlation
  162. between some columns, which can be expressed as partial functional
  163. dependency.
  164. </p><p>
  165. The existence of functional dependencies directly affects the accuracy
  166. of estimates in certain queries. If a query contains conditions on
  167. both the independent and the dependent column(s), the
  168. conditions on the dependent columns do not further reduce the result
  169. size; but without knowledge of the functional dependency, the query
  170. planner will assume that the conditions are independent, resulting
  171. in underestimating the result size.
  172. </p><p>
  173. To inform the planner about functional dependencies, <code class="command">ANALYZE</code>
  174. can collect measurements of cross-column dependency. Assessing the
  175. degree of dependency between all sets of columns would be prohibitively
  176. expensive, so data collection is limited to those groups of columns
  177. appearing together in a statistics object defined with
  178. the <code class="literal">dependencies</code> option. It is advisable to create
  179. <code class="literal">dependencies</code> statistics only for column groups that are
  180. strongly correlated, to avoid unnecessary overhead in both
  181. <code class="command">ANALYZE</code> and later query planning.
  182. </p><p>
  183. Here is an example of collecting functional-dependency statistics:
  184. </p><pre class="programlisting">
  185. CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
  186. ANALYZE zipcodes;
  187. SELECT stxname, stxkeys, stxddependencies
  188. FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  189. WHERE stxname = 'stts';
  190. stxname | stxkeys | stxddependencies
  191. ---------+---------+------------------------------------------
  192. stts | 1 5 | {"1 =&gt; 5": 1.000000, "5 =&gt; 1": 0.423130}
  193. (1 row)
  194. </pre><p>
  195. Here it can be seen that column 1 (zip code) fully determines column
  196. 5 (city) so the coefficient is 1.0, while city only determines zip code
  197. about 42% of the time, meaning that there are many cities (58%) that are
  198. represented by more than a single ZIP code.
  199. </p><p>
  200. When computing the selectivity for a query involving functionally
  201. dependent columns, the planner adjusts the per-condition selectivity
  202. estimates using the dependency coefficients so as not to produce
  203. an underestimate.
  204. </p><div class="sect4" id="id-1.5.13.5.4.11.7"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies</h5></div></div></div><p>
  205. Functional dependencies are currently only applied when considering
  206. simple equality conditions that compare columns to constant values.
  207. They are not used to improve estimates for equality conditions
  208. comparing two columns or comparing a column to an expression, nor for
  209. range clauses, <code class="literal">LIKE</code> or any other type of condition.
  210. </p><p>
  211. When estimating with functional dependencies, the planner assumes that
  212. conditions on the involved columns are compatible and hence redundant.
  213. If they are incompatible, the correct estimate would be zero rows, but
  214. that possibility is not considered. For example, given a query like
  215. </p><pre class="programlisting">
  216. SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
  217. </pre><p>
  218. the planner will disregard the <code class="structfield">city</code> clause as not
  219. changing the selectivity, which is correct. However, it will make
  220. the same assumption about
  221. </p><pre class="programlisting">
  222. SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
  223. </pre><p>
  224. even though there will really be zero rows satisfying this query.
  225. Functional dependency statistics do not provide enough information
  226. to conclude that, however.
  227. </p><p>
  228. In many practical situations, this assumption is usually satisfied;
  229. for example, there might be a GUI in the application that only allows
  230. selecting compatible city and ZIP code values to use in a query.
  231. But if that's not the case, functional dependencies may not be a viable
  232. option.
  233. </p></div></div><div class="sect3" id="id-1.5.13.5.4.12"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts</h4></div></div></div><p>
  234. Single-column statistics store the number of distinct values in each
  235. column. Estimates of the number of distinct values when combining more
  236. than one column (for example, for <code class="literal">GROUP BY a, b</code>) are
  237. frequently wrong when the planner only has single-column statistical
  238. data, causing it to select bad plans.
  239. </p><p>
  240. To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct
  241. statistics for groups of columns. As before, it's impractical to do
  242. this for every possible column grouping, so data is collected only for
  243. those groups of columns appearing together in a statistics object
  244. defined with the <code class="literal">ndistinct</code> option. Data will be collected
  245. for each possible combination of two or more columns from the set of
  246. listed columns.
  247. </p><p>
  248. Continuing the previous example, the n-distinct counts in a
  249. table of ZIP codes might look like the following:
  250. </p><pre class="programlisting">
  251. CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
  252. ANALYZE zipcodes;
  253. SELECT stxkeys AS k, stxdndistinct AS nd
  254. FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  255. WHERE stxname = 'stts2';
  256. -[ RECORD 1 ]--------------------------------------------------------
  257. k | 1 2 5
  258. nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
  259. (1 row)
  260. </pre><p>
  261. This indicates that there are three combinations of columns that
  262. have 33178 distinct values: ZIP code and state; ZIP code and city;
  263. and ZIP code, city and state (the fact that they are all equal is
  264. expected given that ZIP code alone is unique in this table). On the
  265. other hand, the combination of city and state has only 27435 distinct
  266. values.
  267. </p><p>
  268. It's advisable to create <code class="literal">ndistinct</code> statistics objects only
  269. on combinations of columns that are actually used for grouping, and
  270. for which misestimation of the number of groups is resulting in bad
  271. plans. Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted.
  272. </p></div><div class="sect3" id="id-1.5.13.5.4.13"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists</h4></div></div></div><p>
  273. Another type of statistics stored for each column are most-common value
  274. lists. This allows very accurate estimates for individual columns, but
  275. may result in significant misestimates for queries with conditions on
  276. multiple columns.
  277. </p><p>
  278. To improve such estimates, <code class="command">ANALYZE</code> can collect MCV
  279. lists on combinations of columns. Similarly to functional dependencies
  280. and n-distinct coefficients, it's impractical to do this for every
  281. possible column grouping. Even more so in this case, as the MCV list
  282. (unlike functional dependencies and n-distinct coefficients) does store
  283. the common column values. So data is collected only for those groups
  284. of columns appearing together in a statistics object defined with the
  285. <code class="literal">mcv</code> option.
  286. </p><p>
  287. Continuing the previous example, the MCV list for a table of ZIP codes
  288. might look like the following (unlike for simpler types of statistics,
  289. a function is required for inspection of MCV contents):
  290. </p><pre class="programlisting">
  291. CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
  292. ANALYZE zipcodes;
  293. SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
  294. pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
  295. index | values | nulls | frequency | base_frequency
  296. -------+------------------------+-------+-----------+----------------
  297. 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05
  298. 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05
  299. 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133
  300. 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113
  301. 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114
  302. 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05
  303. 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05
  304. 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05
  305. 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05
  306. 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05
  307. ...
  308. (99 rows)
  309. </pre><p>
  310. This indicates that the most common combination of city and state is
  311. Washington in DC, with actual frequency (in the sample) about 0.35%.
  312. The base frequency of the combination (as computed from the simple
  313. per-column frequencies) is only 0.0027%, resulting in two orders of
  314. magnitude under-estimates.
  315. </p><p>
  316. It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only
  317. on combinations of columns that are actually used in conditions together,
  318. and for which misestimation of the number of groups is resulting in bad
  319. plans. Otherwise, the <code class="command">ANALYZE</code> and planning cycles
  320. are just wasted.
  321. </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>
上海开阖软件有限公司 沪ICP备12045867号-1