Python Generators

First published in Micro Mart #1423, July 2016

In this article we'll see what generators are, why Python 3 favours them, and how to create our own. Incidentally, Python 2 methods that returned sequences (e.g., dict.keys()), generally returned lists, while in Python 3 they return generators. Let's start with a couple of Python 3 examples:

N = 1000000 # 1_000_000 allowed in Python 3.6
lst = list(range(N))
gen = range(N)
print(sys.getsizeof(lst)) # 9000112
print(sys.getsizeof(gen)) #      48

So the list occupies almost 9MB while the generator consumes a mere 48 bytes! Clearly generators can save a lot of memory. And it gets better. If we change N to, say, 20 million, the list will occupy 180MB, while the generator still consumes just 48 bytes.

A list stores object references, so if we want to store a million numbers, the list will contain a million objects. But a generator stores a tiny bit of code—the code needed to generate each item on demand. When I timed the code with twenty million numbers, it took 0.36 seconds to create the list, and 1.1 seconds to iterate over it using a for loop, for a total of 1.4 seconds. For the generator it took no time to create (well, less than 0.000004 seconds), and 1.3 seconds to iterate.

For lists, we pay an up-front cost in processing and memory to create the list, and get fast iteration. For generators there's effectively no setup cost and no memory overhead, but there is a tiny processing cost per item we iterate over. In practice, we normally do some per-item processing which is usually so much more expensive than the tiny cost of generating each item, that generators are just as fast in practical use as lists. For these reasons, Python 3 always provides generators rather than lists when sequences of items are required. And, of course, we can always get a list (or tuple) simply by wrapping the generator in list() or tuple()—as shown in the example.

Like lists, generators have a literal syntax, for example:

odd_lst = [x for x in range(1000) if x % 2]
odd_gen = (x for x in range(1000) if x % 2)

The odd_lst list's size varies proportionally (e.g., 528 bytes for odd numbers less than 100, 4272 bytes for less than 1000, and so on), while the generator is a fixed size of 72 bytes. This generator is slightly bigger than the previous one because the generating code is a bit more complex.

Using the built-in range() function, or the literal syntax with parentheses is straightforward and useful. Furthermore, many of Python's built in types return a generator rather than a list: for example, dict.keys(), dict.values(), and dict.items(). One other distinction between lists and generators is worth pointing out: lists are always either empty or of finite length, but a generator may be of any length—even infinite. Naturally, we would like to be able to make our own generators when appropriate. Let's start with a simple generator function that returns odd numbers:

def odds():
    start = 1
    while True:
        yield start
        start += 2

Although tiny, this function is quite sophisticated. When it is called it creates a local variable called start with value 1. It then immediately begins an infinite loop whose first statement is yield start. When a yield statement is reached, the function returns the value of the yield statement (or None if there isn't a value given), and then suspends execution of the function. This means that all the function's state (in this case the value of start), is stored. The second time the function is called it resumes from the statement following the yield, so in this case start is incremented to value 3, then the loop continues, and again the yield is reached, so this time 3 is returned (yielded), and the state of start is stored as 3 and the function is again suspended. Here's how we can use the function:

for x in odds():
    print(x, end=" ")
    if x > 20:
        break
print()
# 1 3 5 7 9 11 13 15 17 19 21

The odds() generator function contains an infinite loop, so we must only ever call it a finite number of times or else it will never stop. Here's an improved version:

def odds(start=1, end=None):
    while True:
        if end is not None and start >= end:
            break
        yield start
        start += 2

We can call this with optional start and end values. If we give neither it will behave like the original version producing 1, 3, 5, .... If we give a start and end, or just an end (e.g., odds(end=40)), then we will get a finite generator:

for x in odds(20, 40):
    print(x, end=" ")
print()
# 20 22 24 26 28 30 32 34 36 38

The two odds() functions are sufficient to show how to create generator functions. But what if we want to use a generator in a class? We might, for instance, have a class that contains multiple items for which we want to provide an iterator (like dict.keys()). Here is a simple Polygon class whose instances store a sequence of integers representing pairs of (x, y) coordinates:

class Polygon:
    def __init__(self, *points):
        self._points = array.array("i", points)
    def x(self, index=0):
        return self._points[index * 2]
    def setX(self, index, value):
        self._points[index * 2] = value
    def y(self, index=0):
        return self._points[1 + (index * 2)]
    def setY(self, index, value):
        self._points[1 + (index * 2)] = value

This class uses the array module's array class to store the points in a very compact form (e.g., 56 bytes for 20 points vs. 472 bytes for a list of 40 integers), but at the price of some tiny processing overhead to get and set points. The *points syntax means that the constructor will accept zero or more arguments, so if none are given the polygon will be created with an empty array. What if we want to iterate over all the points in a polygon? We could easily add a points() method:

    def points(self):
        points = []
        x = None
        for coord in self._points:
            if x is None:
                x = coord
            else:
                points.append((x, coord))
                x = None
        return points

The points() method can be used as follows:

polygon = Polygon(range(40))
for point in polygon.points():
    print(point)
    # (0, 1)\n(2, 3) etc.
for x, y in polygon.points():
    print(x, y)
    # 0 1\n2 3 etc.

The points() method has three disadvantages. First, users of our Polygon class must learn it, when really they expect to be able to iterate the same as with any other class, e.g., for point in polygon. Second, the class creates a (possibly big) list of integers rather than returning each pair on demand. Third, although correct, the code is inelegant.

Integrating our Polygon class so that it supports iteration is really easy. We simply rename the points() method to __iter__(), and make it return an iterator—and while we're at it, we'll improve the code too:

    def __iter__(self):
        points = []
        for x, y in zip(self._points[::2], self._points[1::2]):
            points.append((x, y))
        return iter(points)

Python's built-in zip() function takes two or more iterables and returns a two-(or more)-tuple of the first items, then the second items, and so on, until one of the iterators runs out of items. Here, we've used slicing syntax [::2] which means step from the first (zeroth) to the last in steps of two (i.e., [0:len(_self.points):2]), and [1::2] which means step from the second to the last in steps of two (i.e., [1:len(self_points):2]). Once we have this code in place we can write for point in polygon: or for x, y in polygon:. Here's our final refinement:

    def __iter__(self):
        for x, y in zip(self._points[::2], self._points[1::2]):
            yield (x, y)

This version of the method doesn't store any data at all, simply yielding each x, y coordinate pair in turn. The built-in zip() function is also a generator, so that doesn't create any overhead either. And this new __iter__() can be used with exactly the same for point in polygon: or for x, y in polygon: syntax as before. And if we ever needed a tuple or list of a polygon's points we would simply write tuple(polygon) or list(polygon) and under the hood Python would automatically use the Polygon.__iter__() method.

In general, it is better to provide functions or methods that return a generator than ones which return lists. Generators consume hardly any memory, and users of our functions or methods have the choice of using the generator as-is, or if they really need a tuple or list, forcing the generator to generate all its values.

For more see Python Programming Tips

Top