• Tom Lane's avatar
    Improve planner's cost estimation in the presence of semijoins. · b5572269
    Tom Lane authored
    If we have a semijoin, say
    	SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
    and we're estimating the cost of a parameterized indexscan on x, the number
    of repetitions of the indexscan should not be taken as the size of y; it'll
    really only be the number of distinct values of y1, because the only valid
    plan with y on the outside of a nestloop would require y to be unique-ified
    before joining it to x.  Most of the time this doesn't make that much
    difference, but sometimes it can lead to drastically underestimating the
    cost of the indexscan and hence choosing a bad plan, as pointed out by
    David Kubečka.
    
    Fixing this is a bit difficult because parameterized indexscans are costed
    out quite early in the planning process, before we have the information
    that would be needed to call estimate_num_groups() and thereby estimate the
    number of distinct values of the join column(s).  However we can move the
    code that extracts a semijoin RHS's unique-ification columns, so that it's
    done in initsplan.c rather than on-the-fly in create_unique_path().  That
    shouldn't make any difference speed-wise and it's really a bit cleaner too.
    
    The other bit of information we need is the size of the semijoin RHS,
    which is easy if it's a single relation (we make those estimates before
    considering indexscan costs) but problematic if it's a join relation.
    The solution adopted here is just to use the product of the sizes of the
    join component rels.  That will generally be an overestimate, but since
    estimate_num_groups() only uses this input as a clamp, an overestimate
    shouldn't hurt us too badly.  In any case we don't allow this new logic
    to produce a value larger than we would have chosen before, so that at
    worst an overestimate leaves us no wiser than we were before.
    b5572269
outfuncs.c 76.8 KB