Python – Why Generators Are Faster

generatoriteratorpython

I understand that generators are faster than iterators. I also understand that generators can be implemented using for loop syntax. For example:

    import time 


startT = time.time()


def myGen(n):
    for i in range(n):
        yield x         


def myIter(n):
    for i in range(n):
        pass

def main():
    n=100
    startT=time.time()
    myIter(n)
    print 'myIter took ', time.time() - startT

    startT=time.time()
    myGen(n)
    print 'myGen(n) took ', time.time() - startT

This is just one example of the results:

myIter took 0.09234782
myGen(n) took 0.017847266

Since this uses the for loop syntax, then I don't understand how it is faster than an iterator. This generator uses an iterator, because the "for" loop is implemented using an iterator. If you time these, the generator is consistently faster. Why is this, when the generator uses an iterator?

Thanks.

Best Answer

In your code, myIter(n) actually does work -- it loops 100 times.

myGen(n), on the other hand, simply builds the generator -- and that's it. It doesn't count to 100. All you're doing is timing how long it takes to build the object, and you're timing it in an unreliable way. If we use timeit (here using IPython to make things simpler):

>>> %timeit myIter(100)
1000000 loops, best of 3: 1 µs per loop
>>> %timeit myGen(100)
10000000 loops, best of 3: 163 ns per loop
>>> %timeit myGen(10**1000)
10000000 loops, best of 3: 163 ns per loop

And we see that the myGen(n) time is independent of n, because it's not doing anything. In fact, we can see your code was never executed another way:

>>> list(myGen(100))
Traceback (most recent call last):
  File "<ipython-input-11-dd43d937402a>", line 1, in <module>
    list(myGen(100))
  File "<ipython-input-1-ba968e48e9fd>", line 3, in myGen
    yield x
NameError: name 'x' is not defined

If we fix this typo, and then try a fast way to consume the generator, we get something like

>>> %timeit myIter(100)
1000000 loops, best of 3: 1 µs per loop
>>> %timeit consume(myGen(100), 100)
100000 loops, best of 3: 3.44 µs per loop

and the generator version is slower, as is often the case.

Generator vs. Iterator class.

When you write pure Python code it's a tradeoff between speed (generator) and features (iterator).

The yield functions (known as generators) are for speed and generally they can be written without bothering about internal state. So it's less effort to write them and they are fast because Python just manages all the "state".

The reason generators are faster (or at least not slower) is mostly because:

They implement the __next__-slot directly (typically tp_iternext) besides the __next__-method. In that case Python doesn't have to lookup the __next__ method - that's essentially what makes it faster in the following example:

from itertools import islice

def test():
    while True:
        yield 1

class Test(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it's almost 3 times faster just because generators directly populate the __next__-slot.

A yield-function and the class have a state, but the yield function saves and loads the state much faster than you could with a class and attribute access:

def test():
    i = 0
    while True:
        yield i
        i += 1

class Test(object):
    def __init__(self):
        self.val = 0

    def __iter__(self):
        return self

    def __next__(self):
        current = self.val
        self.val += 1
        return current

%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This time the class is already 4 times slower (compared to the almost 3 times, when no state was involved). That is a cumulative effect: so the more "state" you have, the slower the class variant will be.

So much for the yield vs. class approach. Note that the actual timing will depend on the kind of operations. For example if the actual code that is run when next is called is slow (i.e. time.sleep(1)) then there's almost no difference between generator and class!

Cython

If you want a cython iterator class that is fast it has to be a cdef class. Otherwise you don't get the really fast class. The reason is that only a cdef class creates an extension type that directly implements the tp_iternext field! I'll use IPythons %%cython to compile the code (so I don't have to include the setup):

%%cython

def test():
    while True:
        yield 1

class Test(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

cdef class Test_cdef(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The timings already show that the generator and basic class are faster than the pure Python equivalent, but their relative performance roughly stayed the same. However the cdef class variant beats both of them and that's mainly because the tp_iternext slot was used instead of just implementing the __next__ method. (Inspect the Cython generated C code if you don't trust me :) )

However it's just 2 times faster than the Python generator, that's not bad but it's not exactly overwhelming. To get really amazing speedups, you'll need to find a way to express your program without Python objects (the less Python objects the more speedup). For example if you use a dictionary for storing the item and it's multiplicity you still store Python objects and any lookup has to be done using python dictionary methods - even if you can call them by C API function instead of having to look up the real methods:

%%cython

cpdef cython_count(items):
    cdef dict res = dict()
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res

import random

def count(items):
    res = {}
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res

l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

There's one catch here, you didn't use collections.Counter which has an optimized C code (at least in python-3) for this kind of operation:

from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A quick note here: Don't use something in some_dict.keys() because the keys() are list-like in Python2 and ony implement O(n) contains operations while something in some_dict is typically O(1) (both Pythons)! That will make things faster in both versions but especially on Python2:

def count2(items):
    res = {}
    for item in items:
        if item in res.keys():  # with "keys()"
            res[item] += 1
        else:
            res[item] = 1
    return res

# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop  <--- WHOOPS!!!

That shows that you can only hope for something like 3-4 times speedup with Cython (and C extensions) when you use python structures but even minor mistakes like using ".keys()" can cost you much more in terms of performance if used incorrectly.

Optimizing Cython

So what can you do if you want it faster? The answer is relativly easy: Create your own data structure based on C types instead of Python types.

That means you have to think about the design:

What types do you want to support in your uniqComb**? Do you want integers (the examples say so, but I suppose you want arbitary Python objects).
Do you want introspection from Python (like current state)? If you want it it would make sense to keep the multiplicity as python objects, but if you don't care you can save them as integer-like object instead of python objects.
Do you need the objects passed to the uniqComb** function to be sortable? You used sorted but you could also use a OrderedDict and keep keys in the order of appearance instead of by numerical value.

The answers to these questions (these are only the question I immediatly asked myself, there are probably many more!) can help you decide which structure you can use internally. For example with Cython you can interface to C++ and you could use a map containing integer keys and integer values instead of a dictionary. It's sorted by default so you don't need to manually sort them yourself and you operate on native integers instead of Python objects. But you loose the ability to process arbitary python objects in your uniqComb and you need to know how to operate with C++ types in Cython. It could be amazingly fast though!

I don't go down that path because I assume you want to support arbitary orderable python types and I stick with the Counter as starting point but I'll save the multiplicities as integer array.arrays instead of as list. Let's call it the "least invasive" optimization. It actually doesn't matter much in terms of performance if you use a list or the array for lstCntRpts and multiplicities because they aren't a bottleneck - but it's a bit faster and saves a bit memory and more importantly it shows how you can include homogeneous arrays with cython:

%%cython

from cpython.list cimport PyList_Size  # (most) C API functions can be used with cython!

from array import array
from collections import Counter

cdef class uniqCmboClassIter:

    cdef list lstUniqs
    cdef Py_ssize_t lenUniqs
    cdef int[:] lstCntRpts   # memoryview
    cdef Py_ssize_t lenCmbo
    cdef list cmboAsIdxUniqs
    cdef int[:] multiplicities  # memoryview
    cdef Py_ssize_t idxIntoCmbo
    cdef Py_ssize_t idxIntoUniqs
    cdef bint stopIteration
    cdef Py_ssize_t x
    cdef Py_ssize_t y

    def __init__(self, lstItems, lenCmbo):
        dctCounter = Counter(lstItems)

        self.lstUniqs = sorted(dctCounter)
        self.lenUniqs = PyList_Size(self.lstUniqs)
        self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])

        self.lenCmbo        = lenCmbo
        self.cmboAsIdxUniqs = [None] * lenCmbo
        self.multiplicities = array('i', [0] * self.lenUniqs)
        self.idxIntoCmbo, self.idxIntoUniqs = 0, 0

        while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
            count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
            self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
            self.multiplicities[self.idxIntoUniqs] = count
            self.idxIntoCmbo += count
            self.idxIntoUniqs += 1
            # print("self.multiplicities:", self.multiplicities)
            # print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)

        if self.idxIntoCmbo != self.lenCmbo:
            return

        self.stopIteration = False
        self.x = 0
        self.y = 0

        return

    def __iter__(self):
        return self

    def __next__(self):
        if self.stopIteration is True:
            raise StopIteration

        nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)

        for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.y = self.x + 1

            if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
                break
        else:
            self.stopIteration = True
            return nextCmbo

        for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
            self.multiplicities[self.x] -= 1
            self.multiplicities[self.y] += 1
            # print("# multiplicities:", multiplicities)

            while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
                self.y += 1

            if self.y == self.lenUniqs:
                break

        return nextCmbo

You actually didn't share your parameters for the timings but I tried it with some of mine:

from itertools import combinations

import random
import time

def create_values(maximum):

    vals = [random.randint(0, maximum) for _ in range(48)]
    print('length: ', len(vals))
    print('sorted values: ', sorted(vals))
    print('uniques: ', len(set(vals)))
    print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))

    return vals

class Timer(object):
    def __init__(self):
        pass

    def __enter__(self):
        self._time = time.time()

    def __exit__(self, *args, **kwargs):
        print(time.time() -  self._time)

vals = create_values(maximum=50)  # and 22 and 75 and 120
n = 6

with Timer():
    list(combinations(vals, n))

with Timer():
    list(uniqCmboClassIter(vals, n))

with Timer():
    list(uniqCmboClassIterOriginal(vals, n))

with Timer():
    list(uniqCmboYieldIterOriginal(vals, n))

length:  48
sorted values:  [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22]
uniques:  21
uniques in percent: 43.750000%
6.250450611114502
0.4217393398284912
4.250436305999756
2.7186365127563477

length:  48
sorted values:  [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50]
uniques:  33
uniques in percent: 68.750000%
6.2034173011779785
4.343803882598877
42.39261245727539
26.65750527381897

length:  48
sorted values:  [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99]
uniques:  39
uniques in percent: 81.250000%
6.859697341918945
10.437987327575684
104.12988543510437
65.25306582450867

length:  48
sorted values:  [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147]
uniques:  41
uniques in percent: 85.416667%
6.484673023223877
13.610010623931885
136.28764533996582
84.73834943771362

It definetly performs much better than the original approaches, actually several times faster with just type declarations. There's probably a lot more that could be optimized (disable bounds checking, using Python C API function calls, using unsigned integers or smaller integers if you know the "maximum" and "minimum" of your multiplicities, ...) - but the fact that it's not much slower than itertools.combinations even for 80% unique items and much faster than any original implementation is good enough for me. :-)

Best Answer

Related Solutions

Python – Understanding Generator Behavior

performance – Generator Function Yield vs Iterator Class in Python

Generator vs. Iterator class.

Cython

Optimizing Cython

Related Question