|
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>70.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="70.2. Multivariate Statistics Examples" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">70.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 70. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 70. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="70.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">70.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.23.4.2" class="indexterm"></a><p>
- The examples shown below use tables in the <span class="productname">PostgreSQL</span>
- regression test database.
- The outputs shown are taken from version 8.3.
- The behavior of earlier (or later) versions might vary.
- Note also that since <code class="command">ANALYZE</code> uses random sampling
- while producing statistics, the results will change slightly after
- any new <code class="command">ANALYZE</code>.
- </p><p>
- Let's start with a very simple query:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1;
-
- QUERY PLAN
- -------------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
- </pre><p>
-
- How the planner determines the cardinality of <code class="structname">tenk1</code>
- is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
- completeness. The number of pages and rows is looked up in
- <code class="structname">pg_class</code>:
-
- </p><pre class="programlisting">
- SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
-
- relpages | reltuples
- ----------+-----------
- 358 | 10000
- </pre><p>
-
- These numbers are current as of the last <code class="command">VACUUM</code> or
- <code class="command">ANALYZE</code> on the table. The planner then fetches the
- actual current number of pages in the table (this is a cheap operation,
- not requiring a table scan). If that is different from
- <code class="structfield">relpages</code> then
- <code class="structfield">reltuples</code> is scaled accordingly to
- arrive at a current number-of-rows estimate. In the example above, the value of
- <code class="structfield">relpages</code> is up-to-date so the rows estimate is
- the same as <code class="structfield">reltuples</code>.
- </p><p>
- Let's move on to an example with a range condition in its
- <code class="literal">WHERE</code> clause:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
-
- QUERY PLAN
- --------------------------------------------------------------------------------
- Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
- Recheck Cond: (unique1 < 1000)
- -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
- Index Cond: (unique1 < 1000)
- </pre><p>
-
- The planner examines the <code class="literal">WHERE</code> clause condition
- and looks up the selectivity function for the operator
- <code class="literal"><</code> in <code class="structname">pg_operator</code>.
- This is held in the column <code class="structfield">oprrest</code>,
- and the entry in this case is <code class="function">scalarltsel</code>.
- The <code class="function">scalarltsel</code> function retrieves the histogram for
- <code class="structfield">unique1</code> from
- <code class="structname">pg_statistic</code>. For manual queries it is more
- convenient to look in the simpler <code class="structname">pg_stats</code>
- view:
-
- </p><pre class="programlisting">
- SELECT histogram_bounds FROM pg_stats
- WHERE tablename='tenk1' AND attname='unique1';
-
- histogram_bounds
- ------------------------------------------------------
- {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
- </pre><p>
-
- Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">< 1000</span>”</span>
- is worked out. This is the selectivity. The histogram divides the range
- into equal frequency buckets, so all we have to do is locate the bucket
- that our value is in and count <span class="emphasis"><em>part</em></span> of it and
- <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
- the second bucket (993-1997). Assuming a linear distribution of
- values inside each bucket, we can calculate the selectivity as:
-
- </p><pre class="programlisting">
- selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
- = (1 + (1000 - 993)/(1997 - 993))/10
- = 0.100697
- </pre><p>
-
- that is, one whole bucket plus a linear fraction of the second, divided by
- the number of buckets. The estimated number of rows can now be calculated as
- the product of the selectivity and the cardinality of
- <code class="structname">tenk1</code>:
-
- </p><pre class="programlisting">
- rows = rel_cardinality * selectivity
- = 10000 * 0.100697
- = 1007 (rounding off)
- </pre><p>
- </p><p>
- Next let's consider an example with an equality condition in its
- <code class="literal">WHERE</code> clause:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
-
- QUERY PLAN
- ----------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
- Filter: (stringu1 = 'CRAAAA'::name)
- </pre><p>
-
- Again the planner examines the <code class="literal">WHERE</code> clause condition
- and looks up the selectivity function for <code class="literal">=</code>, which is
- <code class="function">eqsel</code>. For equality estimation the histogram is
- not useful; instead the list of <em class="firstterm">most
- common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
- selectivity. Let's have a look at the MCVs, with some additional columns
- that will be useful later:
-
- </p><pre class="programlisting">
- SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
- WHERE tablename='tenk1' AND attname='stringu1';
-
- null_frac | 0
- n_distinct | 676
- most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
- most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
-
- </pre><p>
-
- Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
- merely the corresponding entry in the list of most common frequencies
- (<acronym class="acronym">MCF</acronym>s):
-
- </p><pre class="programlisting">
- selectivity = mcf[3]
- = 0.003
- </pre><p>
-
- As before, the estimated number of rows is just the product of this with the
- cardinality of <code class="structname">tenk1</code>:
-
- </p><pre class="programlisting">
- rows = 10000 * 0.003
- = 30
- </pre><p>
- </p><p>
- Now consider the same query, but with a constant that is not in the
- <acronym class="acronym">MCV</acronym> list:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
-
- QUERY PLAN
- ----------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
- Filter: (stringu1 = 'xxx'::name)
- </pre><p>
-
- This is quite a different problem: how to estimate the selectivity when the
- value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
- The approach is to use the fact that the value is not in the list,
- combined with the knowledge of the frequencies for all of the
- <acronym class="acronym">MCV</acronym>s:
-
- </p><pre class="programlisting">
- selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
- = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
- 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
- = 0.0014559
- </pre><p>
-
- That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
- subtract them from one, then
- divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
- This amounts to assuming that the fraction of the column that is not any
- of the MCVs is evenly distributed among all the other distinct values.
- Notice that there are no null values so we don't have to worry about those
- (otherwise we'd subtract the null fraction from the numerator as well).
- The estimated number of rows is then calculated as usual:
-
- </p><pre class="programlisting">
- rows = 10000 * 0.0014559
- = 15 (rounding off)
- </pre><p>
- </p><p>
- The previous example with <code class="literal">unique1 < 1000</code> was an
- oversimplification of what <code class="function">scalarltsel</code> really does;
- now that we have seen an example of the use of MCVs, we can fill in some
- more detail. The example was correct as far as it went, because since
- <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
- value is any more common than any other value). For a non-unique
- column, there will normally be both a histogram and an MCV list, and
- <span class="emphasis"><em>the histogram does not include the portion of the column
- population represented by the MCVs</em></span>. We do things this way because
- it allows more precise estimation. In this situation
- <code class="function">scalarltsel</code> directly applies the condition (e.g.,
- <span class="quote">“<span class="quote">< 1000</span>”</span>) to each value of the MCV list, and adds up the
- frequencies of the MCVs for which the condition is true. This gives
- an exact estimate of the selectivity within the portion of the table
- that is MCVs. The histogram is then used in the same way as above
- to estimate the selectivity in the portion of the table that is not
- MCVs, and then the two numbers are combined to estimate the overall
- selectivity. For example, consider
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
-
- QUERY PLAN
- ------------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
- Filter: (stringu1 < 'IAAAAA'::name)
- </pre><p>
-
- We already saw the MCV information for <code class="structfield">stringu1</code>,
- and here is its histogram:
-
- </p><pre class="programlisting">
- SELECT histogram_bounds FROM pg_stats
- WHERE tablename='tenk1' AND attname='stringu1';
-
- histogram_bounds
- --------------------------------------------------------------------------------
- {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
- </pre><p>
-
- Checking the MCV list, we find that the condition <code class="literal">stringu1 <
- 'IAAAAA'</code> is satisfied by the first six entries and not the last four,
- so the selectivity within the MCV part of the population is
-
- </p><pre class="programlisting">
- selectivity = sum(relevant mvfs)
- = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
- = 0.01833333
- </pre><p>
-
- Summing all the MCFs also tells us that the total fraction of the
- population represented by MCVs is 0.03033333, and therefore the
- fraction represented by the histogram is 0.96966667 (again, there
- are no nulls, else we'd have to exclude them here). We can see
- that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
- third histogram bucket. Using some rather cheesy assumptions
- about the frequency of different characters, the planner arrives
- at the estimate 0.298387 for the portion of the histogram population
- that is less than <code class="literal">IAAAAA</code>. We then combine the estimates
- for the MCV and non-MCV populations:
-
- </p><pre class="programlisting">
- selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
- = 0.01833333 + 0.298387 * 0.96966667
- = 0.307669
-
- rows = 10000 * 0.307669
- = 3077 (rounding off)
- </pre><p>
-
- In this particular example, the correction from the MCV list is fairly
- small, because the column distribution is actually quite flat (the
- statistics showing these particular values as being more common than
- others are mostly due to sampling error). In a more typical case where
- some values are significantly more common than others, this complicated
- process gives a useful improvement in accuracy because the selectivity
- for the most common values is found exactly.
- </p><p>
- Now let's consider a case with more than one
- condition in the <code class="literal">WHERE</code> clause:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
-
- QUERY PLAN
- --------------------------------------------------------------------------------
- Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
- Recheck Cond: (unique1 < 1000)
- Filter: (stringu1 = 'xxx'::name)
- -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
- Index Cond: (unique1 < 1000)
- </pre><p>
-
- The planner assumes that the two conditions are independent, so that
- the individual selectivities of the clauses can be multiplied together:
-
- </p><pre class="programlisting">
- selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
- = 0.100697 * 0.0014559
- = 0.0001466
-
- rows = 10000 * 0.0001466
- = 1 (rounding off)
- </pre><p>
-
- Notice that the number of rows estimated to be returned from the bitmap
- index scan reflects only the condition used with the index; this is
- important since it affects the cost estimate for the subsequent heap
- fetches.
- </p><p>
- Finally we will examine a query that involves a join:
-
- </p><pre class="programlisting">
- EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
- WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
-
- QUERY PLAN
- --------------------------------------------------------------------------------------
- Nested Loop (cost=4.64..456.23 rows=50 width=488)
- -> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
- Recheck Cond: (unique1 < 50)
- -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
- Index Cond: (unique1 < 50)
- -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
- Index Cond: (unique2 = t1.unique2)
- </pre><p>
-
- The restriction on <code class="structname">tenk1</code>,
- <code class="literal">unique1 < 50</code>,
- is evaluated before the nested-loop join.
- This is handled analogously to the previous range example. This time the
- value 50 falls into the first bucket of the
- <code class="structfield">unique1</code> histogram:
-
- </p><pre class="programlisting">
- selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
- = (0 + (50 - 0)/(993 - 0))/10
- = 0.005035
-
- rows = 10000 * 0.005035
- = 50 (rounding off)
- </pre><p>
-
- The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
- The operator is just
- our familiar <code class="literal">=</code>, however the selectivity function is
- obtained from the <code class="structfield">oprjoin</code> column of
- <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
- <code class="function">eqjoinsel</code> looks up the statistical information for both
- <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
-
- </p><pre class="programlisting">
- SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
- WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
-
- tablename | null_frac | n_distinct | most_common_vals
- -----------+-----------+------------+------------------
- tenk1 | 0 | -1 |
- tenk2 | 0 | -1 |
- </pre><p>
-
- In this case there is no <acronym class="acronym">MCV</acronym> information for
- <code class="structfield">unique2</code> because all the values appear to be
- unique, so we use an algorithm that relies only on the number of
- distinct values for both relations together with their null fractions:
-
- </p><pre class="programlisting">
- selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
- = (1 - 0) * (1 - 0) / max(10000, 10000)
- = 0.0001
- </pre><p>
-
- This is, subtract the null fraction from one for each of the relations,
- and divide by the maximum of the numbers of distinct values.
- The number of rows
- that the join is likely to emit is calculated as the cardinality of the
- Cartesian product of the two inputs, multiplied by the
- selectivity:
-
- </p><pre class="programlisting">
- rows = (outer_cardinality * inner_cardinality) * selectivity
- = (50 * 10000) * 0.0001
- = 50
- </pre><p>
- </p><p>
- Had there been MCV lists for the two columns,
- <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
- lists to determine the join selectivity within the part of the column
- populations represented by the MCVs. The estimate for the remainder of the
- populations follows the same approach shown here.
- </p><p>
- Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
- the unmodified size of <code class="structname">tenk2</code>. It might appear from
- inspection of the <code class="command">EXPLAIN</code> output that the estimate of
- join rows comes from 50 * 1, that is, the number of outer rows times
- the estimated number of rows obtained by each inner index scan on
- <code class="structname">tenk2</code>. But this is not the case: the join relation size
- is estimated before any particular join plan has been considered. If
- everything is working well then the two ways of estimating the join
- size will produce about the same answer, but due to round-off error and
- other factors they sometimes diverge significantly.
- </p><p>
- For those interested in further details, estimation of the size of
- a table (before any <code class="literal">WHERE</code> clauses) is done in
- <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
- logic for clause selectivities is in
- <code class="filename">src/backend/optimizer/path/clausesel.c</code>. The
- operator-specific selectivity functions are mostly found
- in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
- </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 70. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 70.2. Multivariate Statistics Examples</td></tr></table></div></body></html>
|