Commit 07118037 authored by Robert Haas's avatar Robert Haas

Use quicksort, not replacement selection, for external sorting.

We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small.  Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort.  This tends to be faster except perhaps for very small
amounts of working memory.

Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
parent 719c84c1
...@@ -1472,6 +1472,45 @@ include_dir 'conf.d' ...@@ -1472,6 +1472,45 @@ include_dir 'conf.d'
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry id="guc-replacement-sort-tuples" xreflabel="replacement_sort_tuples">
<term><varname>replacement_sort_tuples</varname> (<type>integer</type>)
<indexterm>
<primary><varname>replacement_sort_tuples</> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
When the number of tuples to be sorted is smaller than this number,
a sort will produce its first output run using replacement selection
rather than quicksort. This may be useful in memory-constrained
environments where tuples that are input into larger sort operations
have a strong physical-to-logical correlation. Note that this does
not include input tuples with an <emphasis>inverse</emphasis>
correlation. It is possible for the replacement selection algorithm
to generate one long run that requires no merging, where use of the
default strategy would result in many runs that must be merged
to produce a final sorted output. This may allow sort
operations to complete sooner.
</para>
<para>
The default is 150,000 tuples. Note that higher values are typically
not much more effective, and may be counter-productive, since the
priority queue is sensitive to the size of available CPU cache, whereas
the default strategy sorts runs using a <firstterm>cache
oblivious</firstterm> algorithm. This property allows the default sort
strategy to automatically and transparently make effective use
of available CPU cache.
</para>
<para>
Setting <varname>maintenance_work_mem</varname> to its default
value usually prevents utility command external sorts (e.g.,
sorts used by <command>CREATE INDEX</> to build B-Tree
indexes) from ever using replacement selection sort, unless the
input tuples are quite wide.
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-autovacuum-work-mem" xreflabel="autovacuum_work_mem"> <varlistentry id="guc-autovacuum-work-mem" xreflabel="autovacuum_work_mem">
<term><varname>autovacuum_work_mem</varname> (<type>integer</type>) <term><varname>autovacuum_work_mem</varname> (<type>integer</type>)
<indexterm> <indexterm>
......
...@@ -1432,8 +1432,8 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) ...@@ -1432,8 +1432,8 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* total, but we will also need to write and read each tuple once per * total, but we will also need to write and read each tuple once per
* merge pass. We expect about ceil(logM(r)) merge passes where r is the * merge pass. We expect about ceil(logM(r)) merge passes where r is the
* number of initial runs formed and M is the merge order used by tuplesort.c. * number of initial runs formed and M is the merge order used by tuplesort.c.
* Since the average initial run should be about twice sort_mem, we have * Since the average initial run should be about sort_mem, we have
* disk traffic = 2 * relsize * ceil(logM(p / (2*sort_mem))) * disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
* cpu = comparison_cost * t * log2(t) * cpu = comparison_cost * t * log2(t)
* *
* If the sort is bounded (i.e., only the first k result tuples are needed) * If the sort is bounded (i.e., only the first k result tuples are needed)
...@@ -1509,7 +1509,7 @@ cost_sort(Path *path, PlannerInfo *root, ...@@ -1509,7 +1509,7 @@ cost_sort(Path *path, PlannerInfo *root,
* We'll have to use a disk-based sort of all the tuples * We'll have to use a disk-based sort of all the tuples
*/ */
double npages = ceil(input_bytes / BLCKSZ); double npages = ceil(input_bytes / BLCKSZ);
double nruns = (input_bytes / sort_mem_bytes) * 0.5; double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes); double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs; double log_runs;
double npageaccesses; double npageaccesses;
......
...@@ -109,6 +109,7 @@ bool enableFsync = true; ...@@ -109,6 +109,7 @@ bool enableFsync = true;
bool allowSystemTableMods = false; bool allowSystemTableMods = false;
int work_mem = 1024; int work_mem = 1024;
int maintenance_work_mem = 16384; int maintenance_work_mem = 16384;
int replacement_sort_tuples = 150000;
/* /*
* Primary determinants of sizes of shared-memory structures. * Primary determinants of sizes of shared-memory structures.
......
...@@ -1928,6 +1928,16 @@ static struct config_int ConfigureNamesInt[] = ...@@ -1928,6 +1928,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL NULL, NULL, NULL
}, },
{
{"replacement_sort_tuples", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of tuples to be sorted using replacement selection."),
gettext_noop("When more tuples than this are present, quicksort will be used.")
},
&replacement_sort_tuples,
150000, 0, INT_MAX,
NULL, NULL, NULL
},
/* /*
* We use the hopefully-safely-small value of 100kB as the compiled-in * We use the hopefully-safely-small value of 100kB as the compiled-in
* default for max_stack_depth. InitializeGUCOptions will increase it if * default for max_stack_depth. InitializeGUCOptions will increase it if
......
...@@ -125,6 +125,7 @@ ...@@ -125,6 +125,7 @@
# actively intend to use prepared transactions. # actively intend to use prepared transactions.
#work_mem = 4MB # min 64kB #work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB #maintenance_work_mem = 64MB # min 1MB
#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB #max_stack_depth = 2MB # min 100kB
#dynamic_shared_memory_type = posix # the default is the first option #dynamic_shared_memory_type = posix # the default is the first option
......
This diff is collapsed.
...@@ -239,6 +239,7 @@ extern bool enableFsync; ...@@ -239,6 +239,7 @@ extern bool enableFsync;
extern bool allowSystemTableMods; extern bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem; extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem; extern PGDLLIMPORT int maintenance_work_mem;
extern PGDLLIMPORT int replacement_sort_tuples;
extern int VacuumCostPageHit; extern int VacuumCostPageHit;
extern int VacuumCostPageMiss; extern int VacuumCostPageMiss;
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment