I made some experiences when I rewrote some of the recipes of the itertools documentation as C extensions. I think I may have some insights that could help you.
Generator vs. Iterator class.
When you write pure Python code it's a tradeoff between speed (generator) and features (iterator).
The yield
functions (known as generators) are for speed and generally they can be written without bothering about internal state. So it's less effort to write them and they are fast because Python just manages all the "state".
The reason generators are faster (or at least not slower) is mostly because:
They implement the __next__
-slot directly (typically tp_iternext
) besides the __next__
-method. In that case Python doesn't have to lookup the __next__
method - that's essentially what makes it faster in the following example:
from itertools import islice
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So it's almost 3 times faster just because generators directly populate the __next__
-slot.
A yield
-function and the class have a state, but the yield
function saves and loads the state much faster than you could with a class and attribute access:
def test():
i = 0
while True:
yield i
i += 1
class Test(object):
def __init__(self):
self.val = 0
def __iter__(self):
return self
def __next__(self):
current = self.val
self.val += 1
return current
%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time the class is already 4 times slower (compared to the almost 3 times, when no state was involved). That is a cumulative effect: so the more "state" you have, the slower the class variant will be.
So much for the yield
vs. class approach. Note that the actual timing will depend on the kind of operations. For example if the actual code that is run when next
is called is slow (i.e. time.sleep(1)
) then there's almost no difference between generator and class!
Cython
If you want a cython iterator class that is fast it has to be a cdef class
. Otherwise you don't get the really fast class. The reason is that only a cdef class
creates an extension type that directly implements the tp_iternext
field! I'll use IPythons %%cython
to compile the code (so I don't have to include the setup):
%%cython
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
cdef class Test_cdef(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The timings already show that the generator and basic class are faster than the pure Python equivalent, but their relative performance roughly stayed the same. However the cdef class
variant beats both of them and that's mainly because the tp_iternext
slot was used instead of just implementing the __next__
method. (Inspect the Cython generated C code if you don't trust me :) )
However it's just 2 times faster than the Python generator, that's not bad but it's not exactly overwhelming. To get really amazing speedups, you'll need to find a way to express your program without Python objects (the less Python objects the more speedup). For example if you use a dictionary for storing the item and it's multiplicity you still store Python objects and any lookup has to be done using python dictionary methods - even if you can call them by C API function instead of having to look up the real methods:
%%cython
cpdef cython_count(items):
cdef dict res = dict()
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
import random
def count(items):
res = {}
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
There's one catch here, you didn't use collections.Counter
which has an optimized C code (at least in python-3) for this kind of operation:
from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A quick note here: Don't use something in some_dict.keys()
because the keys()
are list-like in Python2 and ony implement O(n)
contains operations while something in some_dict
is typically O(1)
(both Pythons)! That will make things faster in both versions but especially on Python2:
def count2(items):
res = {}
for item in items:
if item in res.keys(): # with "keys()"
res[item] += 1
else:
res[item] = 1
return res
# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop <--- WHOOPS!!!
That shows that you can only hope for something like 3-4 times speedup with Cython (and C extensions) when you use python structures but even minor mistakes like using ".keys()" can cost you much more in terms of performance if used incorrectly.
Optimizing Cython
So what can you do if you want it faster? The answer is relativly easy: Create your own data structure based on C types instead of Python types.
That means you have to think about the design:
- What types do you want to support in your
uniqComb**
? Do you want integers (the examples say so, but I suppose you want arbitary Python objects).
- Do you want introspection from Python (like current state)? If you want it it would make sense to keep the multiplicity as python objects, but if you don't care you can save them as integer-like object instead of python objects.
- Do you need the objects passed to the
uniqComb**
function to be sortable? You used sorted
but you could also use a OrderedDict
and keep keys in the order of appearance instead of by numerical value.
The answers to these questions (these are only the question I immediatly asked myself, there are probably many more!) can help you decide which structure you can use internally. For example with Cython you can interface to C++ and you could use a map
containing integer keys and integer values instead of a dictionary. It's sorted by default so you don't need to manually sort them yourself and you operate on native integers instead of Python objects. But you loose the ability to process arbitary python objects in your uniqComb
and you need to know how to operate with C++ types in Cython. It could be amazingly fast though!
I don't go down that path because I assume you want to support arbitary orderable python types and I stick with the Counter
as starting point but I'll save the multiplicities as integer array.array
s instead of as list
. Let's call it the "least invasive" optimization. It actually doesn't matter much in terms of performance if you use a list
or the array
for lstCntRpts
and multiplicities
because they aren't a bottleneck - but it's a bit faster and saves a bit memory and more importantly it shows how you can include homogeneous array
s with cython:
%%cython
from cpython.list cimport PyList_Size # (most) C API functions can be used with cython!
from array import array
from collections import Counter
cdef class uniqCmboClassIter:
cdef list lstUniqs
cdef Py_ssize_t lenUniqs
cdef int[:] lstCntRpts # memoryview
cdef Py_ssize_t lenCmbo
cdef list cmboAsIdxUniqs
cdef int[:] multiplicities # memoryview
cdef Py_ssize_t idxIntoCmbo
cdef Py_ssize_t idxIntoUniqs
cdef bint stopIteration
cdef Py_ssize_t x
cdef Py_ssize_t y
def __init__(self, lstItems, lenCmbo):
dctCounter = Counter(lstItems)
self.lstUniqs = sorted(dctCounter)
self.lenUniqs = PyList_Size(self.lstUniqs)
self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = array('i', [0] * self.lenUniqs)
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = 0
self.y = 0
return
def __iter__(self):
return self
def __next__(self):
if self.stopIteration is True:
raise StopIteration
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
You actually didn't share your parameters for the timings but I tried it with some of mine:
from itertools import combinations
import random
import time
def create_values(maximum):
vals = [random.randint(0, maximum) for _ in range(48)]
print('length: ', len(vals))
print('sorted values: ', sorted(vals))
print('uniques: ', len(set(vals)))
print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))
return vals
class Timer(object):
def __init__(self):
pass
def __enter__(self):
self._time = time.time()
def __exit__(self, *args, **kwargs):
print(time.time() - self._time)
vals = create_values(maximum=50) # and 22 and 75 and 120
n = 6
with Timer():
list(combinations(vals, n))
with Timer():
list(uniqCmboClassIter(vals, n))
with Timer():
list(uniqCmboClassIterOriginal(vals, n))
with Timer():
list(uniqCmboYieldIterOriginal(vals, n))
length: 48
sorted values: [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22]
uniques: 21
uniques in percent: 43.750000%
6.250450611114502
0.4217393398284912
4.250436305999756
2.7186365127563477
length: 48
sorted values: [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50]
uniques: 33
uniques in percent: 68.750000%
6.2034173011779785
4.343803882598877
42.39261245727539
26.65750527381897
length: 48
sorted values: [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99]
uniques: 39
uniques in percent: 81.250000%
6.859697341918945
10.437987327575684
104.12988543510437
65.25306582450867
length: 48
sorted values: [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147]
uniques: 41
uniques in percent: 85.416667%
6.484673023223877
13.610010623931885
136.28764533996582
84.73834943771362
It definetly performs much better than the original approaches, actually several times faster with just type declarations. There's probably a lot more that could be optimized (disable bounds checking, using Python C API function calls, using unsigned integers or smaller integers if you know the "maximum" and "minimum" of your multiplicities, ...) - but the fact that it's not much slower than itertools.combinations
even for 80% unique items and much faster than any original implementation is good enough for me. :-)
Best Answer
In your code,
myIter(n)
actually does work -- it loops 100 times.myGen(n)
, on the other hand, simply builds the generator -- and that's it. It doesn't count to 100. All you're doing is timing how long it takes to build the object, and you're timing it in an unreliable way. If we usetimeit
(here using IPython to make things simpler):And we see that the
myGen(n)
time is independent ofn
, because it's not doing anything. In fact, we can see your code was never executed another way:If we fix this typo, and then try a fast way to consume the generator, we get something like
and the generator version is slower, as is often the case.