Jay Taylor's notes
back to listing indexpdqsort.pdf
[web search]Page 1 of 12
Pattern-defeating Quicksort
Orson Peters 1 1 Leiden University, The Netherlands
orsonpeters@gmail.com
March 19, 2017
Years of study have shown quicksort to be the preferred generic in- Work in progress. Please do not cite.
memory sort. An asymptotically optimal average case runtime of
Θ(n log n) with great locality of reference and small memory overhead.
The drawback is a glaring worst case runtime of Θ(n
2
). This has lead
the best sorting algorithms towards hybrid approaches, combining
the fast average runtime of quicksort with other properties, such as
the guaranteed Θ(n log n) worst case of heapsort. In this paper a
novel hybrid sort is introduced that aims to improve the runtime of
quicksort on certain patterns conjectured to be common in real world
data, giving a new best-case runtime of Θ(n). Empirical study shows
that for many inputs pattern-defeating quicksort is faster, and is never
significantly slower.
Introduction
Arguably the most used hybrid sorting algorithm at the time of
writing is introsort2
. A combination of insertion sort, heapsort3 and 2 David Musser. Introspective sorting
and selection algorithms. Software
Practice and Experience, 27:983–993, 1997
3
J. W. J. Williams. Algorithm 232:
Heapsort. Communications of the ACM,
7(6):347–348, 1964
quicksort4
, it is very fast and can be seen as a truly hybrid algorithm.
4 Charles AR Hoare. Quicksort. The
Computer Journal, 5(1):10–16, 1962
The algorithm performs introspection and decides when to change
strategy using some very simple heuristics. If the recursion depth
becomes to deep, it switches to heapsort, and if the partition size
becomes too small it switches to insertion sort.
The goal of pattern-defeating quicksort (or pdqsort) is to improve
on introsort’s heuristics to create a hybrid sorting algorithm with sev- eral desirable properties. It maintains quicksort’s constant5 memory 5 Quicksort and pdqsort use log n
memory to maintain the call stack,
but this is effectively constant for any
practical size.
usage and fast average case, effectively recognizes and combats worst
case behavior (deterministically), and runs in linear time for a few
common patterns. It also unavoidably inherits quicksort’s instability,
so pdqsort can not be used in situations where stability is needed.
I have created a state of the art C++ implementation, and explore
the design techniques used to achieve a practical high speed imple- mentation. The implementation is fully compatible with std::sort
and is released under a permissive license. Standard library writers
are invited to evaluate and adopt the implementation as their generic
unstable sorting algorithm.
Page 2 of 12
pattern-defeating quicksort 2
A faster solution to the Dutch national flag problem
Faster in this context means that
pdqsort augments quicksort with a
capability to handle few distinct ele- ments efficiently with lower overhead
than previous solutions. It does not
mean that it runs faster than dedicated
solutions to the Dutch nation flag
problem.
A naive quicksort implementation might trigger the Θ(n
2
) worst
case on the all-equal input distribution by placing equal comparing
elements in the same partition. A smarter implementation never
swaps equal elements, resulting in average case performance as equal
elements will be distributed evenly across the partitions. However, an
input with many equal comparing elements is rather common, and
we can do better. Handling equal elements efficiently requires tripar- tite partitioning, which is equivalent to Dijkstra’s Dutch national flag
problem6
.
6 Edsger Wybe Dijkstra. A Discipline of
Programming. Prentice Hall PTR, Upper
Saddle River, NJ, USA, 1st edition, 1997
Pattern-defeating quicksort uses the fast ’approaching pointers’
method7
. Two indices are initialized, i at the start and j at the end 7
Jon L Bentley and M Douglas McIlroy.
Engineering a sort function. Software:
Practice and Experience, 23(11):1249–1265,
1993
of the sequence. i is incremented and j is decremented while main- taining an invariant, and when both invariants are invalidated the
elements at the pointers are swapped, restoring the invariant. The
algorithm ends when the pointers cross. Implementers must take
great care, as this algorithm is conceptually simple, but is very easy
to get wrong.
= < ? > =
Figure 1: The invariant used by Bentley- McIlroy. After partitioning the equal
elements stored at the beginning and at
the end are swapped to the middle.
Bentley and McIlroy describe an invariant for partitioning that
swaps equal elements to the edges of the partition, and swaps them
back into the middle after partitioning. This is efficient when there
are many equal elements, but has a significant drawback. Every
element needs to be explicitly checked for equality to the pivot before
swapping, costing another comparison. This happens regardless of
whether there are many equal elements, costing performance in the
average case.
i j
p ?
p < ? >=
p < >=
< p >=
r
Figure 2: The invariant used by
partition_right of pdqsort, shown
at respectively the initial, halfway and
finished state. When the loop is done
the pivot gets swapped into its correct
position. p is the single pivot element. r
is the pointer returned by the partition
routine indicating the pivot position.
The dotted lines indicate how i and
j change as the algorithm progresses.
This is a simplified representation, e.g. i
is actually off by one.
Unlike previous algorithms, pdqsort’s partitioning scheme is
not self contained. It uses two separate partition functions, one that
groups elements equal to the pivot in the left partition (partition_left),
and one that groups elements equal to the pivot in the right par- tition (partition_right). Note that both partition functions can
always be implemented using a single comparison per element as
a < b ⇔ a b and a ≮ b ⇔ a ≥ b.
For brevity I will be using a simplified C++ implementation to
illustrate pdqsort. It only supports int and compares using compar- ison operators. It is however trivial to extend this to arbitrary types
and custom comparator functions. To pass subsequences around, the
C++ convention is used of one pointer at the start, and one pointer at
one-past-the-end.
The partition functions assume the pivot is the first element, and
that it has been selected as a median of at least three elements in the
subsequence. This saves a bound check in the first iteration.
Page 3 of 12
pattern-defeating quicksort 3
int* partition_left(int* l, int* r) {
int* i = l; int* j = r;
while (*--j > *l);
if (j + 1 == r) while (i < j && *++i <= *l);
else while ( *++i <= *l);
while (i < j) {
std::swap(*i, *j);
while (*--j > *l);
while (*++i <= *l);
}
std::swap(*l, *j);
return j;
}
int* partition_right(int* l, int* r) {
int* i = l; int* j = r;
while (*++i < *l);
if (i - 1 == l) while (i < j && *--j >= *l);
else while ( *--j >= *l);
// bool no_swaps = (i >= j); (8)
while (i < j) {
std::swap(*i, *j);
while (*++i < *l);
while (*--j >= *l);
}
std::swap(*l, *(i - 1));
return i - 1;
}
Given a subsequence α let us partition it using partition_right:
8 A pre-partitioned subsequence will
perform no swaps. It’s possible to
detect this with a single comparison of
pointers, no_swaps. This is used for a
heuristic later.
α
< p >=
< p q b
If p 6= q we have q > p, and apply partition_right on q, b.
Rename q and b as the left partition: 9 Even though partition_right puts
equal elements in the right partition,
from the perspective of α there will
be no equal elements to p in the right
partition as q > p.
< p q b >9
We apply the above step recursively as long as p 6= q. If at some
point q, b is empty, we can conclude there were no elements equal
to p and the tripartite partitioning was done when we initially par- titioned α. Otherwise, consider p = q. Because of α we know that
∀x ∈ b : x ≥ p. It is easy to see that @x ∈ b : x < q. If we were
to partition q, b using partition_left, any element smaller than or
equal to q would be partitioned left. However, we just concluded that
b can not contain elements smaller than q. Thus, b’s left partition only
contains elements equal to q, and its right partition only contains
elements bigger than q:
< p = q >
This leads to the partitioning algorithm used by pdqsort. If a
subsequence has a predecessor10 p that compares equal to the chosen 10 The element directly preceding it in
the original sequence. This is the pivot
p for an ancestor α. A subsequence that
is leftmost has no predecessor.
pivot q, apply partition_left, otherwise apply partition_right.
No recursion on the left partition of partition_left is needed, as it
contains only equivalent elements.
Page 4 of 12
pattern-defeating quicksort 4
A linear time best case of pdqsort
Lemma 1. Any predecessor of a subsequence11 was the pivot of an ancestor. 11 Assuming only subsequences passed
into recursive calls of pdqsort.
Proof. If the subsequence is a direct right child of its parent partition,
its predecessor is obviously the pivot of the parent. However, if the
subsequence is the left child of its parent partition, its predecessor
is the predecessor of its parent. Since the base subsequence has a
predecessor it is not leftmost and there must exist some ancestor of
which the subsequence is a right child.
I find it important to take a moment
and defend the concept of a best- case complexity. People have argued
that any algorithm can trivially add
a best-case scenario. (A ’cheat’ that
pdqsort uses as well, for linear time on
ascending sequences.) I would like to
argue that pdqsort has a ’true’ best-case
complexity for sequences with many
equivalent elements. It converges on
linearity smoothly as k goes down, and
even if only parts of sequence have
many equivalent elements, recursive
calls on those parts will run in linear
time. Finally, I argue that sequences
with many equivalent elements are
common, and the overhead pdqsort
uses to achieve linear time on those
sequences is extraordinarily low (a
single comparison of the pivot to the
predecessor per partition operation).
Lemma 2. The first time a distinct value v is selected as a pivot, it can’t be
equal to its predecessor.
Proof. Assume v is equal to its predecessor. By lemma 1 this prede- cessor was the pivot of an ancestor partition. This is a contradiction,
thus v is not equal to its predecessor.
Corollary 2.1. The first time v is selected as a pivot, it is always used to
partition with partition_right, and all elements x such that x = v end
up in the right partition.
Lemma 3. Until an element equal to v is selected as a pivot again, for all
x = v, x must be in the partition directly to the right of v.
Proof. By corollary 2.1, x can not be in a partition to the left of v.
There also can’t be a pivot p 6= v such that v < p < x.
Lemma 4. The second time a value equal to v is selected as a pivot, all x = v
are in the correct position and are not recursed upon any further.
Proof. The second time another element x = v is selected as a pivot,
lemma 3 shows that v must be its predecessor, and thus it is used to
partition with partition_left. In this partitioning step all elements
equal to x (and thus equal to v) end up in the left partition, and are
not further recursed upon. Lemma 3 also provides the conclusion
we have just finished processing all elements equal to v passed to
pdqsort.
Theorem 1. pdqsort has complexity O(nk) when the input distribution has
k distinct values.12 12 Note that this is an upper bound.
When k is big O(n log n) still applies.
Proof. Lemma 4 proves that every distinct value can be selected as a
pivot at most twice, after which every element equal to that value is
sorted correctly. Each partition operation has complexity O(n). The
worst-case runtime is thus O(nk).
Page 5 of 12
pattern-defeating quicksort 5
An optimistic best case
As a note on the previous page I mentioned the true best-case of
pdqsort, and a ’cheat’. The cheat is specifically aimed at a set of
inputs that I argue are vastly overrepresented in the domain of inputs
to sorting functions: ascending, descending, and ascending with an
arbitrary unsorted element appended. This optimization is due to
Howard Hinnant13s std::sort.
13 Howard Hinnant et al. libc++ C++
standard library. http://libcxx.llvm.
org/, 2016. [Online; accessed 2015]
During the computation of partition_right, it can be detected if
no swaps (a perfect partition) were performed using a single pointer
comparison operation. If this is detected, and the partition was
reasonably balanced14, we will do a partial insertion sort on both 14 I will elaborate on reasonably bal- anced partitions in the worst case
section below.
entire partitions, regardless of their size.
A partial insertion sort works exactly
like regular insertion sort, except it
keeps track of how many elements it
moves. After placing an element in
its correct position it will check if the
number of moves is less than some
limit (a safe estimate I used in my
implementation is 8, to make sure there
is minimum overhead in the case of a
misdiagnosed best case). If not, it will
terminate the sort and report the sort
failed. If the sort completes successfully
this is also reported.
bool partial_insertion_sort(int* l, int* r) {
if (l == r) return true;
int limit = 0;
for (int* cur = l + 1; cur != r; ++cur) {
if (limit > partial_insertion_sort_limit) return false;
int* sift = cur; int* sift_1 = cur - 1;
if (*sift < *sift_1) {
int tmp = *sift;
do { *sift-- = *sift_1; }
while (sift != l && tmp < *--sift_1);
*sift = tmp; limit += cur - sift;
}
}
return true;
}
The effect is that small disturbances in a near-sorted array can
be fixed and a single element at the end can be moved to its correct
position in O(n) time. A descending array will go through one
regular partitioning operation, followed by a perfect partition for
each half, triggering the partial insertion sort15, giving a total runtime 15 This is more or less a happy coinci- dence, as the behavior for a descending
array depends on the "approaching
pointers" partitioning method pdqsort
uses. Furthermore, this also requires
the median-of-3 pivot selection to ac- tually sort the values, rather than just
selecting the median
of O(n) operations.
The overhead for this optimization is beyond measure, as perfect
partitions become exceedingly unlikely for large arrays to happen by
chance. A hostile attacker crafting worst-case inputs is also no better
off. The maximum overhead per partition is n operations, so it dou- bles performance at worst, but this can already be done to pdqsort
by crafting a worst case that degenerates to heapsort. Additionally,
the partial insertion sort is only triggered for a reasonably balanced
partition, forcing an attacker to generate good quicksort progress if
she intends to trigger repeated misdiagnosed partial insertion sorts.
Page 6 of 12
pattern-defeating quicksort 6
Maintaining a fast average case
Quicksort has a reputation for a fast average case. However, the
difference between an optimized implementation and a naive one
can result in a big reduction in runtime. Unless noted otherwise,
pattern-defeating quicksort implements a median-of-3 quicksort with
all standard optimization techniques.
One of the most well-known tricks is to switch to insertion sort
when recursing on a small16 subarray. But even the insertion sort is 16 Benchmarking gave 24 elements as a
good general cut-off for insertion sort
on various desktop machines. This is
the default value in pattern-defeating
quicksort, but this value is heavily
dependent on cache effects and the cost
of a comparison.
subject to optimization:
void insertion_sort(int* l, int* r) {
if (l == r) return;
for (int* cur = l + 1; cur != r; ++cur) {
int* sift = cur; int* sift_1 = cur - 1;
if (*sift < *sift_1) {
int tmp = *sift;
do { *sift-- = *sift_1; }
while (sift != l && tmp < *--sift_1);
*sift = tmp;
}
}
}
This is not the usual way of implementing insertion sort. We don’t
use swaps, but a series of moves. We move the first check out of
the loop to prevent two moves for an element already positioned
correctly. And we explicitly use sift and sift_1 to ensure that a
compiler will not do needless decrements, even with minimal or no
optimization turned on.
A big17 improvement can be made by eliminating the bounds 17 5-15% from my benchmarks, for
sorting small integers. check in while (sift != l && tmp < *--sift_1). This can be
done for any subsequence that is not leftmost, as there must exist
an element before l that is smaller than or equal to tmp, breaking
the loop. Although this change is tiny, for optimal performance
in a programming language like C (that offers very little metapro- gramming) this requires an entirely new function, often called
unguarded_insertion_sort.
The majority of the time spent in pattern-defeating quicksort is
in partition_right. We already listed its implementation above18
,
18 One optimization that is omitted from
the listed code above is moving *l (the
pivot) into a local variable. This is done
to prevent the compiler from repeatedly
loading it from memory, as it otherwise
might not be able to prove that it does
not change while partitioning.
and it’s of note that great care was taken to ensure that the inner loop
does the least amount of work as possible. Again this is done by not
doing bounds checks in the inner loop, instead separately handling
the first iteration and using elements from previous iterations as
sentinel elements to break the inner loop.