Generators in Python – Your code 1000+ times more efficient

python generators featured image code efficient

Do you know the difference between a “normal” function and a generator function in Python? What is the difference between the return of a usual function and the yield of a generator? In this article we will answer these questions and also delve into some aspects of the language.

Let’s start by declaring a simple function that receives a sequence of numbers, squares them, and returns these squares in a list. If you need to review the concept of iterable, see this article, in which we use this same function as an example.

def square_of_numbers(iterable_of_numbers):
    result = []
    for number in iterable_of_numbers:
        result.append(number**2)
    return result

Now, let’s create a tuple of numbers and pass it to this function:

numbers = (1, 2, 3)

square_of_numbers(numbers)
[1, 4, 9]

OK, expected behavior. We can check the type of square_of_numbers and square_of_numbers(numbers):

type(square_of_numbers)
function
type(square_of_numbers(numbers))
list

Note that the object square_of_numbers is of type function as expected. And when the argument numbers was passed, the object became list. After all, the function returns a list. The return ends the function by delivering what was requested, the list.

yield – Generators

Let’s start seeing how the language’s documentation defines generators:

A function which returns a generator iterator. It looks like a normal function, except that it contains yield expressions for producing a series of values that can be used in a for loop or that can be retrieved one at a time with the next() function. Usually refers to a generator function, but may refer to a generator iterator in some contexts. In some cases where the intended meaning is not clear, using the full term avoids ambiguity.

I believe it was not very enlightening, but let’s improve that. First, to better understand the idea of an iterator, read this article where I explain it in detail. It is important for what we will see here. The documentation also defines the term “generator iterator”:

An object created by a generator function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try statements). When the generator iterator resumes, it picks up where it left off (in contrast to functions which start fresh on every invocation).

In this last definition there are critical aspects, especially the idea of temporary suspension. Let’s create our generator, modifying our function. Note that only the creation of the list was removed, and the return was replaced by yield number**2:

def squares_generator(iterable_of_numbers):
    for number in iterable_of_numbers:
        yield number**2

Before interacting with this generator, let’s define the term yield. Searching in various dictionaries, you will find the following definitions:

  • to produce
  • to provide
  • amount of production

So, we can think of a generator as a construction that produces values on demand with each call to next. We will now see this idea. Let’s start by evaluating the types:

type(squares_generator)
function
squares_generator(numbers)
type(squares_generator(numbers))
generator
g = squares_generator(numbers)
g

Note that the object square_generator itself is of type function. However, when the argument is passed, it becomes of type generator and such object has a reference in memory. Note that no value was returned. According to the documentation, the values will be generated one at a time on demand:

next(g)
1
next(g)
4
next(g)
9

It does have the same behavior as we saw in the article about iterators. Thus, a StopIteration is expected when the entire iterable passed has been consumed:

next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[13], line 1
----> 1 next(g)

StopIteration: 

Generator expressions

The same generator function could be written in just one line using a generator expression:

(number**2 for number in numbers)
 at 0x7f91981a7d30>

It is the same syntax as a list comprehension, but with parentheses instead of brackets. It has the same behavior as the generator created earlier:

for squared in (number**2 for number in numbers):
    print(squared)
1
4
9
g = (number**2 for number in numbers)
next(g)
1
next(g)
4
next(g)
9
next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[20], line 1
----> 1 next(g)

StopIteration: 

Practical applications of generators

More important than knowing about the existence of generators is knowing how to apply them in real cases. PEP 255, which implements generators in the Python language, presents the rationale that led to the implementation, but it can be somewhat abstract for beginners as it deals with callbacks and coroutines. Here, I will present more accessible uses than those presented in the PEP.

Let’s start by creating a list of names and a list of courses for students at a given institution:

names = [
    "Americium",
    "Californium",
    "Copernicium",
    "Dysprosium",
    "Einsteinium",
]  # yes, these are all chemical elements but consider them students

courses = ["Geography", "Cinema", "Astronomy", "Greek", "Physics"]

Imagine that you want to create a registration, unifying in the same data structure the names of the students and their courses, in addition to associating a sequential registration number to each student.

In the next cell, two implementations are made. One in the form of a traditional function, storing each student as a dictionary within a list and returning that list. The other generating each student dictionary, that is, without storing in any structure. If storage is desired, it must be performed by the consumer of the generator:

def student_registration(names, courses):
    registration = []
    for i, (name, course) in enumerate(zip(names, courses)):
        student = {
            'registration': i,
            'name': name,
            'course': course,
        }
        registration.append(student)
    return registration


def student_generator(names, courses):
    for i, (name, course) in enumerate(zip(names, courses)):
        student = {
            'registration': i,
            'name': name,
            'course': course,
        }
        yield student

Let’s start with the function, seeing the result of passing the variables names and courses as arguments and seeing how long this processing takes. IPython and Jupyter Notebook have their own functions (“magics” in the documentation) that allow for some interesting analyses. One of them is %timeit, which allows you to measure the time it takes for a given command. Since this article is being written in a Jupyter Notebook, we can use this function.

student_registration(names, courses)
[{'registration': 0, 'name': 'Americium', 'course': 'Geography'},
 {'registration': 1, 'name': 'Californium', 'course': 'Cinema'},
 {'registration': 2, 'name': 'Copernicium', 'course': 'Astronomy'},
 {'registration': 3, 'name': 'Dysprosium', 'course': 'Greek'},
 {'registration': 4, 'name': 'Einsteinium', 'course': 'Physics'}]
%timeit student_registration(names, courses)
2.08 µs ± 72.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It is little time, in the order of 1 microsecond. Let’s see the generator now. First, let’s check that the generator does not return any value, only an object in memory:

student_generator(names, courses)

And if we want all the results at once, we need to consume the generator by putting it, for example, in a list:

list(student_generator(names, courses))
[{'registration': 0, 'name': 'Americium', 'course': 'Geography'},
 {'registration': 1, 'name': 'Californium', 'course': 'Cinema'},
 {'registration': 2, 'name': 'Copernicium', 'course': 'Astronomy'},
 {'registration': 3, 'name': 'Dysprosium', 'course': 'Greek'},
 {'registration': 4, 'name': 'Einsteinium', 'course': 'Physics'}]

But, if the generator only creates a reference in memory, without actually creating a data structure, how long does it take?

%timeit student_generator(names, courses)
137 ns ± 2.44 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Behold, we have fallen to the order of nanoseconds! After all, only a reference was created and each student record will be created on demand:

%timeit next(student_generator(names, courses))
1.49 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

If only a reference was created and not a data structure, then this should also be reflected in the space occupied and here it starts to get even more interesting. To analyze the space occupied by each object, we will use the pympler library:

from pympler.asizeof import asizeof
asizeof(student_registration(names, courses))
1888
asizeof(student_generator(names, courses))
1280

The results are in bytes. As the function returns a list with the records, in the penultimate cell we are seeing the space occupied by this list, while in the last cell only the space of the generated reference.

If it is not clear, let’s do a more resource-demanding example. Modifying the logic of the previous function and generator a little, we can make them receive an integer that represents how many students want to be stored or generated, respectively. And these students, in our example, will be generated randomly based on the two lists created earlier.

import random


def random_student_registration(quantity):
    registration = []
    for i in range(quantity):
        student = {
            'registration': i,
            'name': random.choice(names),
            'course': random.choice(courses),
        }
        registration.append(student)
    return registration


def random_student_generator(quantity):
    for i in range(quantity):
        student = {
            'registration': i,
            'name': random.choice(names),
            'course': random.choice(courses),
        }
        yield student

Let’s see an example with 5 random students:

random_student_registration(5)
[{'registration': 0, 'name': 'Einsteinium', 'course': 'Physics'},
 {'registration': 1, 'name': 'Copernicium', 'course': 'Astronomy'},
 {'registration': 2, 'name': 'Einsteinium', 'course': 'Greek'},
 {'registration': 3, 'name': 'Americium', 'course': 'Astronomy'},
 {'registration': 4, 'name': 'Einsteinium', 'course': 'Greek'}]
random_student_generator(5)
list(random_student_generator(5))
[{'registration': 0, 'name': 'Einsteinium', 'course': 'Greek'},
 {'registration': 1, 'name': 'Einsteinium', 'course': 'Geography'},
 {'registration': 2, 'name': 'Dysprosium', 'course': 'Geography'},
 {'registration': 3, 'name': 'Dysprosium', 'course': 'Cinema'},
 {'registration': 4, 'name': 'Californium', 'course': 'Greek'}]

Note that the results are different and there may be repetition, as they are now random.

But what happens if we ask for, like… 1 million students?

%timeit random_student_registration(1_000_000)
1.24 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit random_student_generator(1_000_000)
129 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

The function left the microsecond magnitude order for the second order. After all, now it has to generate the entire list with these records. However, the generator remains in the nanosecond order, as nothing was actually created or stored, only the generator object was stored in memory and this takes negligible time to be done.

And what about the space occupied by each object?

asizeof(random_student_registration(1_000_000))
224449416

It is even difficult to understand this number in bytes, let’s convert it to megabytes:

_ / 2**20  # convert to MB
214.05164337158203

A list with 1 million entries, each one a dictionary, occupying around 200 MB. Did you imagine this number?

Let’s see the generator:

asizeof(random_student_generator(1_000_000))
504

It is a negligible space. After all, nothing was actually stored besides the generator object. Now, if we actually consume this generator, storing each result in, for example, a list, we will have the same previous values:

%timeit list(random_student_generator(1_000_000))
1.25 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
asizeof(list(random_student_generator(1_000_000))) / 2**20
214.05164337158203

Now, we don’t necessarily need to consume the entire generator at once. That’s the beauty of generators, we can consume on demand and therefore have better management of computational resources. Both in terms of execution time and storage.

Here on the website I will show uses, but you can already imagine some: consumption of a large database in small parts; reading large files in parts; possibility of creating virtually infinite series of values etc.

Conclusion

That’s it for this article. I hope you have understood and learned something new. We went through several concepts, such as iterables, iterators, functions, generators and we did a little analysis of time and space occupied. A lot of added value, I’m sure.

Follow Chemistry Programming on social media.

Did you like this article? It is part of the Python Drops, a set of shorter posts focused on fundamentals talking about some aspects of the Python language and programming in general. You can read more of these articles by searching for the tag “drops” here on the website.

Until next time!

Note: the time generated by %timeit may vary from machine to machine, you may not necessarily get the same values. However, the difference in orders of magnitude should remain.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top