Commit 0c882e52 authored by Tom Lane's avatar Tom Lane

Improve ineq_histogram_selectivity's behavior for non-default orderings.

ineq_histogram_selectivity() can be invoked in situations where the
ordering we care about is not that of the column's histogram.  We could
be considering some other collation, or even more drastically, the
query operator might not agree at all with what was used to construct
the histogram.  (We'll get here for anything using scalarineqsel-based
estimators, so that's quite likely to happen for extension operators.)

Up to now we just ignored this issue and assumed we were dealing with
an operator/collation whose sort order exactly matches the histogram,
possibly resulting in junk estimates if the binary search gets confused.
It's past time to improve that, since the use of nondefault collations
is increasing.  What we can do is verify that the given operator and
collation match what's recorded in pg_statistic, and use the existing
code only if so.  When they don't match, instead execute the operator
against each histogram entry, and take the fraction of successes as our
selectivity estimate.  This gives an estimate that is probably good to
about 1/histogram_size, with no assumptions about ordering.  (The quality
of the estimate is likely to degrade near the ends of the value range,
since the two orderings probably don't agree on what is an extremal value;
but this is surely going to be more reliable than what we did before.)

At some point we might further improve matters by storing more than one
histogram calculated according to different orderings.  But this code
would still be good fallback logic when no matches exist, so that is
not an argument for not doing this.

While here, also improve get_variable_range() to deal more honestly
with non-default collations.

This isn't back-patchable, because it requires adding another argument
to ineq_histogram_selectivity, and because it might have significant
impact on the estimation results for extension operators relying on
scalarineqsel --- mostly for the better, one hopes, but in any case
destabilizing plan choices in back branches is best avoided.

Per investigation of a report from James Lucas.

Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
parent 87fb04af
......@@ -1217,7 +1217,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata,
fmgr_info(get_opcode(geopr), &opproc);
prefixsel = ineq_histogram_selectivity(root, vardata,
&opproc, true, true,
geopr, &opproc, true, true,
collation,
prefixcon->constvalue,
prefixcon->consttype);
......@@ -1238,7 +1238,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata,
Selectivity topsel;
topsel = ineq_histogram_selectivity(root, vardata,
&opproc, false, false,
ltopr, &opproc, false, false,
collation,
greaterstrcon->constvalue,
greaterstrcon->consttype);
......
This diff is collapsed.
......@@ -731,6 +731,55 @@ equality_ops_are_compatible(Oid opno1, Oid opno2)
return result;
}
/*
* comparison_ops_are_compatible
* Return true if the two given comparison operators have compatible
* semantics.
*
* This is trivially true if they are the same operator. Otherwise,
* we look to see if they can be found in the same btree opfamily.
* For example, '<' and '>=' ops match if they belong to the same family.
*
* (This is identical to equality_ops_are_compatible(), except that we
* don't bother to examine hash opclasses.)
*/
bool
comparison_ops_are_compatible(Oid opno1, Oid opno2)
{
bool result;
CatCList *catlist;
int i;
/* Easy if they're the same operator */
if (opno1 == opno2)
return true;
/*
* We search through all the pg_amop entries for opno1.
*/
catlist = SearchSysCacheList1(AMOPOPID, ObjectIdGetDatum(opno1));
result = false;
for (i = 0; i < catlist->n_members; i++)
{
HeapTuple op_tuple = &catlist->members[i]->tuple;
Form_pg_amop op_form = (Form_pg_amop) GETSTRUCT(op_tuple);
if (op_form->amopmethod == BTREE_AM_OID)
{
if (op_in_opfamily(opno2, op_form->amopfamily))
{
result = true;
break;
}
}
}
ReleaseSysCacheList(catlist);
return result;
}
/* ---------- AMPROC CACHES ---------- */
......@@ -3028,19 +3077,6 @@ get_attstatsslot(AttStatsSlot *sslot, HeapTuple statstuple,
sslot->staop = (&stats->staop1)[i];
sslot->stacoll = (&stats->stacoll1)[i];
/*
* XXX Hopefully-temporary hack: if stacoll isn't set, inject the default
* collation. This won't matter for non-collation-aware datatypes. For
* those that are, this covers cases where stacoll has not been set. In
* the short term we need this because some code paths involving type NAME
* do not pass any collation to prefix_selectivity and related functions.
* Even when that's been fixed, it's likely that some add-on typanalyze
* functions won't get the word right away about filling stacoll during
* ANALYZE, so we'll probably need this for awhile.
*/
if (sslot->stacoll == InvalidOid)
sslot->stacoll = DEFAULT_COLLATION_OID;
if (flags & ATTSTATSSLOT_VALUES)
{
val = SysCacheGetAttr(STATRELATTINH, statstuple,
......
......@@ -82,6 +82,7 @@ extern bool get_op_hash_functions(Oid opno,
RegProcedure *lhs_procno, RegProcedure *rhs_procno);
extern List *get_op_btree_interpretation(Oid opno);
extern bool equality_ops_are_compatible(Oid opno1, Oid opno2);
extern bool comparison_ops_are_compatible(Oid opno1, Oid opno2);
extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
int16 procnum);
extern char *get_attname(Oid relid, AttrNumber attnum, bool missing_ok);
......
......@@ -159,7 +159,8 @@ extern double generic_restriction_selectivity(PlannerInfo *root,
double default_selectivity);
extern double ineq_histogram_selectivity(PlannerInfo *root,
VariableStatData *vardata,
FmgrInfo *opproc, bool isgt, bool iseq,
Oid opoid, FmgrInfo *opproc,
bool isgt, bool iseq,
Oid collation,
Datum constval, Oid consttype);
extern double var_eq_const(VariableStatData *vardata,
......
......@@ -191,7 +191,10 @@ CREATE TABLE atest12 as
SELECT x AS a, 10001 - x AS b FROM generate_series(1,10000) x;
CREATE INDEX ON atest12 (a);
CREATE INDEX ON atest12 (abs(a));
-- results below depend on having quite accurate stats for atest12
SET default_statistics_target = 10000;
VACUUM ANALYZE atest12;
RESET default_statistics_target;
CREATE FUNCTION leak(integer,integer) RETURNS boolean
AS $$begin return $1 < $2; end$$
LANGUAGE plpgsql immutable;
......
......@@ -136,7 +136,10 @@ CREATE TABLE atest12 as
SELECT x AS a, 10001 - x AS b FROM generate_series(1,10000) x;
CREATE INDEX ON atest12 (a);
CREATE INDEX ON atest12 (abs(a));
-- results below depend on having quite accurate stats for atest12
SET default_statistics_target = 10000;
VACUUM ANALYZE atest12;
RESET default_statistics_target;
CREATE FUNCTION leak(integer,integer) RETURNS boolean
AS $$begin return $1 < $2; end$$
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment