Commit 52b60530 authored by Tom Lane's avatar Tom Lane

Fix tsmatchsel() to account properly for null rows.

ts_typanalyze.c computes MCE statistics as fractions of the non-null rows,
which seems fairly reasonable, and anyway changing it in released versions
wouldn't be a good idea.  But then ts_selfuncs.c has to account for that.
Failure to do so results in overestimates in columns with a significant
fraction of null documents.  Back-patch to 8.4 where this stuff was
introduced.

Jesper Krogh
parent de623f33
...@@ -189,11 +189,17 @@ tsquerysel(VariableStatData *vardata, Datum constval) ...@@ -189,11 +189,17 @@ tsquerysel(VariableStatData *vardata, Datum constval)
/* No most-common-elements info, so do without */ /* No most-common-elements info, so do without */
selec = tsquery_opr_selec_no_stats(query); selec = tsquery_opr_selec_no_stats(query);
} }
/*
* MCE stats count only non-null rows, so adjust for null rows.
*/
selec *= (1.0 - stats->stanullfrac);
} }
else else
{ {
/* No stats at all, so do without */ /* No stats at all, so do without */
selec = tsquery_opr_selec_no_stats(query); selec = tsquery_opr_selec_no_stats(query);
/* we assume no nulls here, so no stanullfrac correction */
} }
return selec; return selec;
......
...@@ -246,6 +246,8 @@ typedef FormData_pg_statistic *Form_pg_statistic; ...@@ -246,6 +246,8 @@ typedef FormData_pg_statistic *Form_pg_statistic;
* type with identifiable elements (for instance, tsvector). staop contains * type with identifiable elements (for instance, tsvector). staop contains
* the equality operator appropriate to the element type. stavalues contains * the equality operator appropriate to the element type. stavalues contains
* the most common element values, and stanumbers their frequencies. Unlike * the most common element values, and stanumbers their frequencies. Unlike
* MCV slots, frequencies are measured as the fraction of non-null rows the
* element value appears in, not the frequency of all rows. Also unlike
* MCV slots, the values are sorted into order (to support binary search * MCV slots, the values are sorted into order (to support binary search
* for a particular value). Since this puts the minimum and maximum * for a particular value). Since this puts the minimum and maximum
* frequencies at unpredictable spots in stanumbers, there are two extra * frequencies at unpredictable spots in stanumbers, there are two extra
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment