#### Jay Taylor's notes

back to listing index### pdqsort.pdf

[web search]*drive.google.com*)

## Page 1 of 12

Pattern-defeating Quicksort

Orson Peters 1 1 Leiden University, The Netherlands

orsonpeters@gmail.com

March 19, 2017

Years of study have shown quicksort to be the preferred generic in- Work in progress. Please do not cite.

memory sort. An asymptotically optimal average case runtime of

Θ(n log n) with great locality of reference and small memory overhead.

The drawback is a glaring worst case runtime of Θ(n

2

). This has lead

the best sorting algorithms towards hybrid approaches, combining

the fast average runtime of quicksort with other properties, such as

the guaranteed Θ(n log n) worst case of heapsort. In this paper a

novel hybrid sort is introduced that aims to improve the runtime of

quicksort on certain patterns conjectured to be common in real world

data, giving a new best-case runtime of Θ(n). Empirical study shows

that for many inputs pattern-defeating quicksort is faster, and is never

significantly slower.

Introduction

Arguably the most used hybrid sorting algorithm at the time of

writing is introsort2

. A combination of insertion sort, heapsort3 and 2 David Musser. Introspective sorting

and selection algorithms. Software

Practice and Experience, 27:983–993, 1997

3

J. W. J. Williams. Algorithm 232:

Heapsort. Communications of the ACM,

7(6):347–348, 1964

quicksort4

, it is very fast and can be seen as a truly hybrid algorithm.

4 Charles AR Hoare. Quicksort. The

Computer Journal, 5(1):10–16, 1962

The algorithm performs introspection and decides when to change

strategy using some very simple heuristics. If the recursion depth

becomes to deep, it switches to heapsort, and if the partition size

becomes too small it switches to insertion sort.

The goal of pattern-defeating quicksort (or pdqsort) is to improve

on introsort’s heuristics to create a hybrid sorting algorithm with sev- eral desirable properties. It maintains quicksort’s constant5 memory 5 Quicksort and pdqsort use log n

memory to maintain the call stack,

but this is effectively constant for any

practical size.

usage and fast average case, effectively recognizes and combats worst

case behavior (deterministically), and runs in linear time for a few

common patterns. It also unavoidably inherits quicksort’s instability,

so pdqsort can not be used in situations where stability is needed.

I have created a state of the art C++ implementation, and explore

the design techniques used to achieve a practical high speed imple- mentation. The implementation is fully compatible with std::sort

and is released under a permissive license. Standard library writers

are invited to evaluate and adopt the implementation as their generic

unstable sorting algorithm.

## Page 2 of 12

pattern-defeating quicksort 2

A faster solution to the Dutch national flag problem

Faster in this context means that

pdqsort augments quicksort with a

capability to handle few distinct ele- ments efficiently with lower overhead

than previous solutions. It does not

mean that it runs faster than dedicated

solutions to the Dutch nation flag

problem.

A naive quicksort implementation might trigger the Θ(n

2

) worst

case on the all-equal input distribution by placing equal comparing

elements in the same partition. A smarter implementation never

swaps equal elements, resulting in average case performance as equal

elements will be distributed evenly across the partitions. However, an

input with many equal comparing elements is rather common, and

we can do better. Handling equal elements efficiently requires tripar- tite partitioning, which is equivalent to Dijkstra’s Dutch national flag

problem6

.

6 Edsger Wybe Dijkstra. A Discipline of

Programming. Prentice Hall PTR, Upper

Saddle River, NJ, USA, 1st edition, 1997

Pattern-defeating quicksort uses the fast ’approaching pointers’

method7

. Two indices are initialized, i at the start and j at the end 7

Jon L Bentley and M Douglas McIlroy.

Engineering a sort function. Software:

Practice and Experience, 23(11):1249–1265,

1993

of the sequence. i is incremented and j is decremented while main- taining an invariant, and when both invariants are invalidated the

elements at the pointers are swapped, restoring the invariant. The

algorithm ends when the pointers cross. Implementers must take

great care, as this algorithm is conceptually simple, but is very easy

to get wrong.

= < ? > =

Figure 1: The invariant used by Bentley- McIlroy. After partitioning the equal

elements stored at the beginning and at

the end are swapped to the middle.

Bentley and McIlroy describe an invariant for partitioning that

swaps equal elements to the edges of the partition, and swaps them

back into the middle after partitioning. This is efficient when there

are many equal elements, but has a significant drawback. Every

element needs to be explicitly checked for equality to the pivot before

swapping, costing another comparison. This happens regardless of

whether there are many equal elements, costing performance in the

average case.

i j

p ?

p < ? >=

p < >=

< p >=

r

Figure 2: The invariant used by

partition_right of pdqsort, shown

at respectively the initial, halfway and

finished state. When the loop is done

the pivot gets swapped into its correct

position. p is the single pivot element. r

is the pointer returned by the partition

routine indicating the pivot position.

The dotted lines indicate how i and

j change as the algorithm progresses.

This is a simplified representation, e.g. i

is actually off by one.

Unlike previous algorithms, pdqsort’s partitioning scheme is

not self contained. It uses two separate partition functions, one that

groups elements equal to the pivot in the left partition (partition_left),

and one that groups elements equal to the pivot in the right par- tition (partition_right). Note that both partition functions can

always be implemented using a single comparison per element as

a < b ⇔ a b and a ≮ b ⇔ a ≥ b.

For brevity I will be using a simplified C++ implementation to

illustrate pdqsort. It only supports int and compares using compar- ison operators. It is however trivial to extend this to arbitrary types

and custom comparator functions. To pass subsequences around, the

C++ convention is used of one pointer at the start, and one pointer at

one-past-the-end.

The partition functions assume the pivot is the first element, and

that it has been selected as a median of at least three elements in the

subsequence. This saves a bound check in the first iteration.

## Page 3 of 12

pattern-defeating quicksort 3

int* partition_left(int* l, int* r) {

int* i = l; int* j = r;

while (*--j > *l);

if (j + 1 == r) while (i < j && *++i <= *l);

else while ( *++i <= *l);

while (i < j) {

std::swap(*i, *j);

while (*--j > *l);

while (*++i <= *l);

}

std::swap(*l, *j);

return j;

}

int* partition_right(int* l, int* r) {

int* i = l; int* j = r;

while (*++i < *l);

if (i - 1 == l) while (i < j && *--j >= *l);

else while ( *--j >= *l);

// bool no_swaps = (i >= j); (8)

while (i < j) {

std::swap(*i, *j);

while (*++i < *l);

while (*--j >= *l);

}

std::swap(*l, *(i - 1));

return i - 1;

}

Given a subsequence α let us partition it using partition_right:

8 A pre-partitioned subsequence will

perform no swaps. It’s possible to

detect this with a single comparison of

pointers, no_swaps. This is used for a

heuristic later.

α

< p >=

< p q b

If p 6= q we have q > p, and apply partition_right on q, b.

Rename q and b as the left partition: 9 Even though partition_right puts

equal elements in the right partition,

from the perspective of α there will

be no equal elements to p in the right

partition as q > p.

< p q b >9

We apply the above step recursively as long as p 6= q. If at some

point q, b is empty, we can conclude there were no elements equal

to p and the tripartite partitioning was done when we initially par- titioned α. Otherwise, consider p = q. Because of α we know that

∀x ∈ b : x ≥ p. It is easy to see that @x ∈ b : x < q. If we were

to partition q, b using partition_left, any element smaller than or

equal to q would be partitioned left. However, we just concluded that

b can not contain elements smaller than q. Thus, b’s left partition only

contains elements equal to q, and its right partition only contains

elements bigger than q:

< p = q >

This leads to the partitioning algorithm used by pdqsort. If a

subsequence has a predecessor10 p that compares equal to the chosen 10 The element directly preceding it in

the original sequence. This is the pivot

p for an ancestor α. A subsequence that

is leftmost has no predecessor.

pivot q, apply partition_left, otherwise apply partition_right.

No recursion on the left partition of partition_left is needed, as it

contains only equivalent elements.

## Page 4 of 12

pattern-defeating quicksort 4

A linear time best case of pdqsort

Lemma 1. Any predecessor of a subsequence11 was the pivot of an ancestor. 11 Assuming only subsequences passed

into recursive calls of pdqsort.

Proof. If the subsequence is a direct right child of its parent partition,

its predecessor is obviously the pivot of the parent. However, if the

subsequence is the left child of its parent partition, its predecessor

is the predecessor of its parent. Since the base subsequence has a

predecessor it is not leftmost and there must exist some ancestor of

which the subsequence is a right child.

I find it important to take a moment

and defend the concept of a best- case complexity. People have argued

that any algorithm can trivially add

a best-case scenario. (A ’cheat’ that

pdqsort uses as well, for linear time on

ascending sequences.) I would like to

argue that pdqsort has a ’true’ best-case

complexity for sequences with many

equivalent elements. It converges on

linearity smoothly as k goes down, and

even if only parts of sequence have

many equivalent elements, recursive

calls on those parts will run in linear

time. Finally, I argue that sequences

with many equivalent elements are

common, and the overhead pdqsort

uses to achieve linear time on those

sequences is extraordinarily low (a

single comparison of the pivot to the

predecessor per partition operation).

Lemma 2. The first time a distinct value v is selected as a pivot, it can’t be

equal to its predecessor.

Proof. Assume v is equal to its predecessor. By lemma 1 this prede- cessor was the pivot of an ancestor partition. This is a contradiction,

thus v is not equal to its predecessor.

Corollary 2.1. The first time v is selected as a pivot, it is always used to

partition with partition_right, and all elements x such that x = v end

up in the right partition.

Lemma 3. Until an element equal to v is selected as a pivot again, for all

x = v, x must be in the partition directly to the right of v.

Proof. By corollary 2.1, x can not be in a partition to the left of v.

There also can’t be a pivot p 6= v such that v < p < x.

Lemma 4. The second time a value equal to v is selected as a pivot, all x = v

are in the correct position and are not recursed upon any further.

Proof. The second time another element x = v is selected as a pivot,

lemma 3 shows that v must be its predecessor, and thus it is used to

partition with partition_left. In this partitioning step all elements

equal to x (and thus equal to v) end up in the left partition, and are

not further recursed upon. Lemma 3 also provides the conclusion

we have just finished processing all elements equal to v passed to

pdqsort.

Theorem 1. pdqsort has complexity O(nk) when the input distribution has

k distinct values.12 12 Note that this is an upper bound.

When k is big O(n log n) still applies.

Proof. Lemma 4 proves that every distinct value can be selected as a

pivot at most twice, after which every element equal to that value is

sorted correctly. Each partition operation has complexity O(n). The

worst-case runtime is thus O(nk).

## Page 5 of 12

pattern-defeating quicksort 5

An optimistic best case

As a note on the previous page I mentioned the true best-case of

pdqsort, and a ’cheat’. The cheat is specifically aimed at a set of

inputs that I argue are vastly overrepresented in the domain of inputs

to sorting functions: ascending, descending, and ascending with an

arbitrary unsorted element appended. This optimization is due to

Howard Hinnant13s std::sort.

13 Howard Hinnant et al. libc++ C++

standard library. http://libcxx.llvm.

org/, 2016. [Online; accessed 2015]

During the computation of partition_right, it can be detected if

no swaps (a perfect partition) were performed using a single pointer

comparison operation. If this is detected, and the partition was

reasonably balanced14, we will do a partial insertion sort on both 14 I will elaborate on reasonably bal- anced partitions in the worst case

section below.

entire partitions, regardless of their size.

A partial insertion sort works exactly

like regular insertion sort, except it

keeps track of how many elements it

moves. After placing an element in

its correct position it will check if the

number of moves is less than some

limit (a safe estimate I used in my

implementation is 8, to make sure there

is minimum overhead in the case of a

misdiagnosed best case). If not, it will

terminate the sort and report the sort

failed. If the sort completes successfully

this is also reported.

bool partial_insertion_sort(int* l, int* r) {

if (l == r) return true;

int limit = 0;

for (int* cur = l + 1; cur != r; ++cur) {

if (limit > partial_insertion_sort_limit) return false;

int* sift = cur; int* sift_1 = cur - 1;

if (*sift < *sift_1) {

int tmp = *sift;

do { *sift-- = *sift_1; }

while (sift != l && tmp < *--sift_1);

*sift = tmp; limit += cur - sift;

}

}

return true;

}

The effect is that small disturbances in a near-sorted array can

be fixed and a single element at the end can be moved to its correct

position in O(n) time. A descending array will go through one

regular partitioning operation, followed by a perfect partition for

each half, triggering the partial insertion sort15, giving a total runtime 15 This is more or less a happy coinci- dence, as the behavior for a descending

array depends on the "approaching

pointers" partitioning method pdqsort

uses. Furthermore, this also requires

the median-of-3 pivot selection to ac- tually sort the values, rather than just

selecting the median

of O(n) operations.

The overhead for this optimization is beyond measure, as perfect

partitions become exceedingly unlikely for large arrays to happen by

chance. A hostile attacker crafting worst-case inputs is also no better

off. The maximum overhead per partition is n operations, so it dou- bles performance at worst, but this can already be done to pdqsort

by crafting a worst case that degenerates to heapsort. Additionally,

the partial insertion sort is only triggered for a reasonably balanced

partition, forcing an attacker to generate good quicksort progress if

she intends to trigger repeated misdiagnosed partial insertion sorts.

## Page 6 of 12

pattern-defeating quicksort 6

Maintaining a fast average case

Quicksort has a reputation for a fast average case. However, the

difference between an optimized implementation and a naive one

can result in a big reduction in runtime. Unless noted otherwise,

pattern-defeating quicksort implements a median-of-3 quicksort with

all standard optimization techniques.

One of the most well-known tricks is to switch to insertion sort

when recursing on a small16 subarray. But even the insertion sort is 16 Benchmarking gave 24 elements as a

good general cut-off for insertion sort

on various desktop machines. This is

the default value in pattern-defeating

quicksort, but this value is heavily

dependent on cache effects and the cost

of a comparison.

subject to optimization:

void insertion_sort(int* l, int* r) {

if (l == r) return;

for (int* cur = l + 1; cur != r; ++cur) {

int* sift = cur; int* sift_1 = cur - 1;

if (*sift < *sift_1) {

int tmp = *sift;

do { *sift-- = *sift_1; }

while (sift != l && tmp < *--sift_1);

*sift = tmp;

}

}

}

This is not the usual way of implementing insertion sort. We don’t

use swaps, but a series of moves. We move the first check out of

the loop to prevent two moves for an element already positioned

correctly. And we explicitly use sift and sift_1 to ensure that a

compiler will not do needless decrements, even with minimal or no

optimization turned on.

A big17 improvement can be made by eliminating the bounds 17 5-15% from my benchmarks, for

sorting small integers. check in while (sift != l && tmp < *--sift_1). This can be

done for any subsequence that is not leftmost, as there must exist

an element before l that is smaller than or equal to tmp, breaking

the loop. Although this change is tiny, for optimal performance

in a programming language like C (that offers very little metapro- gramming) this requires an entirely new function, often called

unguarded_insertion_sort.

The majority of the time spent in pattern-defeating quicksort is

in partition_right. We already listed its implementation above18

,

18 One optimization that is omitted from

the listed code above is moving *l (the

pivot) into a local variable. This is done

to prevent the compiler from repeatedly

loading it from memory, as it otherwise

might not be able to prove that it does

not change while partitioning.

and it’s of note that great care was taken to ensure that the inner loop

does the least amount of work as possible. Again this is done by not

doing bounds checks in the inner loop, instead separately handling

the first iteration and using elements from previous iterations as

sentinel elements to break the inner loop.