**Cooking with Python, Part 1**

Pages: 1, **2**

## Recipe 5.10: Selecting the nth Smallest Element of a Sequence

*Credit: Raymond Hettinger, David Eppstein, Shane Holloway,
Chris Perkins*

### Problem

You need to get from a sequence the
*n*th item in rank order (e.g., the middle
item, known as the *median*). If the sequence
was sorted, you would just use
`seq[`

*n*`]`

.
But the sequence *isn't* sorted,
and you wonder if you can do better than just sorting it first.

### Solution

Perhaps you can do better, if the sequence is big, has been shuffled
enough, and comparisons between its items are costly. Sort is very
fast, but in the end (when applied to a thoroughly shuffled sequence
of length *n*) it always takes
`O(`

*n*`log`

*n*`)`

time,
while there exist algorithms that can be used to get the
*n*th smallest element in time
`O(`

*n*`)`

.
Here is a function with a solid implementation of such an algorithm:

```
import random
def select(data, n):
" Find the nth rank ordered element (the least value has rank 0). "
# make a new list, deal with <0 indices, check for valid index
data = list(data)
if n<0:
n += len(data)
if not 0 <= n < len(data):
raise ValueError, "can't get rank %d out of %d" % (n, len(data))
# main loop, quicksort-like but with no need for recursion
while True:
pivot = random.choice(data)
pcount = 0
under, over = [ ], [ ]
uappend, oappend = under.append, over.append
for elem in data:
if elem < pivot:
uappend(elem)
elif elem > pivot:
oappend(elem)
else:
pcount += 1
numunder = len(under)
if n < numunder:
data = under
elif n < numunder + pcount:
return pivot
else:
data = over
n -= numunder + pcount
```

Related Reading Python Cookbook |

### Discussion

This recipe is meant for cases in which repetitions
*count*. For example, the median of the list
`[1, 1, 1, 2, 3]`

is `1`

because
that is the third one of the five items in rank order. If, for some
strange reason, you want to discount duplications, you need to reduce
the list to its unique items first (e.g., by applying the Recipe 18.1), after which you may
want to come back to this recipe.

Input argument
*data* can be any bounded iterable; the
recipe starts by calling `list`

on it to ensure
that. The algorithm then loops, implementing at each leg a few key
ideas: randomly choosing a *pivot element*;
slicing up the list into two parts, made up of the items that are
"under" and
"over" the pivot respectively;
continuing work for the next leg on just one of the two parts, since
we can tell which one of them the
*n*^{th} element
will be in, and the other part can safely be ignored. The ideas are
very close to that in the classic algorithm known as
*quicksort* (except that quicksort cannot ignore
either part, and thus must use recursion, or recursion-removal
techniques such as keeping an explicit stack, to make sure it deals
with both parts).

The random choice of pivot makes the algorithm
robust against unfavorable data orderings (the kind that wreak havoc
with naive quicksort); this implementation decision costs about
`log`

_{2}N calls to
`random.choice`

. Another implementation issue worth
pointing out is that the recipe counts the number of occurrences of
the pivot: this precaution ensures good performance even in the
anomalous case where *data* contains a high
number of repetitions of identical values.

Extracting the bound methods `.append`

of lists
under and over as local variables
uappend and oappend may look like a
pointless, if tiny, complication, but it is, in fact, a very
important optimization technique in Python. To keep the compiler
simple, straightforward, unsurprising, and robust, Python does not
*hoist* constant computations out of loops, nor
does it "cache" the results of
method lookup. If you call under.append and
over.append in the inner loop, you pay the cost of
lookup each and every time. If you want something hoisted, hoist it
yourself. When you're considering an optimization,
you should always measure the code's performance
*with* and *without* that
optimization, to check that the optimization does indeed make an
important difference. According to my measurements, removing this
single optimization slows performance down by about 50% for the
typical task of picking the 5000th item of
`range(10000)`

. Considering the tiny amount of
complication involved, a difference of 50% is well worth it.

A natural idea for optimization, which just didn't
make the grade once carefully measured, is to call ```
cmp(elem,
pivot)
```

in the loop body, rather than making separate tests
for `elem < pivot`

and ```
elem >
pivot
```

. Unfortunately, measurement shows that
`cmp`

doesn't speed things up; in
fact, it slows them down, at least when the items of
*data* are of elementary types such as
numbers and strings.

So, how does select's performance compare with the simpler alternative of:

```
def selsor(data, n):
data = list(data)
data.sort( )
return data[n]
```

On thoroughly shuffled lists of 3,001 integers on my machine, this
recipe's select takes about 16
milliseconds to find the median, while selsor takes
about 13 milliseconds; considering that `sort`

could
take advantage of any partial sortedness in the data, for this kind
of length, and on elementary data whose comparisons are fast,
it's not to your advantage to use
select. For a length of 30,001, performance becomes
very close between the two approaches—around 170 milliseconds
either way. When you push the length all the way to 300,001,
select provides an advantage, finding the median in
about 2.2 seconds, while selsor takes about
2.5.

The break-even point will be smaller if the items in the sequence
have costly comparison methods, since the key difference between the
two approaches is in the number of comparisons
performed--select takes
`O(n)`

, selsor takes ```
O(n
log n)
```

. For example, say we need to compare instances of a
class designed for somewhat costly comparisons (simulating
four-dimensional points that will often coincide on the first few
dimensions):

```
class X(object):
def _ _init_ _(self):
self.a = self.b = self.c = 23.51
self.d = random.random( )
def _dats(self):
return self.a, self.b, self.c, self.d
def _ _cmp_ _(self, oth):
return cmp(self._dats, oth._dats)
```

Here, select already becomes faster than selsor when what we're computing is the median of vectors of 201 such instances.

In other words, although select has more general
overhead, when compared to the wondrously efficient coding of
lists' `sort`

method, nevertheless,
if *n* is large enough and each comparison
is costly enough, select is still well worth
considering.

### See Also

*Library Reference* and *Python in a
Nutshell* docs about method `sort`

of
type `list`

, and about module
`random`

.

*
*

*
*

View catalog information for Python Cookbook, Second Edition

Return to Python DevCenter.