How to append a trailing newline using a generator in Python
A neat thing about file
objects in Python is that they are iterable, so you can do this:
for line in file:
sys.stdout.write(line)
In the background the lines are read from the file on an as-needed basis (lazily).
Let’s assume, that we have a tool that works with such an iterable of lines, but we want to abstract from the age-old problem of missing trailing newline character, i.e. we want all of the lines end with a newline character ('\n'
) whether the last line originally had it or not. (By the way, this would be really easy in Haskell, which is functional, lazy, and has pattern recognition. But now I’m playing with Python.)
Let’s add one more important assumption: we need to be able to stream through large files, so we want to do it in the same lazy way, i.e. to have an iterable (iterator) that would read from the file as needed and add a trailing newline to the last line if necessary when it is reached.
Here’s my first try (0) (in Python 3):
This works fine. Essentially the object is always one line ahead of what it has returned via __next__
in order to know if the file ended and thus if we are on the last line. But it is ugly as hell and actually adds quite a lot of overhead (more on that later.)
So I thought, what about using a generator to do exactly the same thing? (1)
Here we are spared of 10 lines of cruft, but now there’s also a slightly cryptic part: the outer for loop is performed exactly once to read the first line (like we do in the __iter__
method of the initial version). But once you understand that looping in an iterable is just a way of calling __next__
and having the StopIteration
exception handled by jumping out of the loop, it’s no big mystery.
As a by-effect we no longer need any instance variables, because the stack of a generator is preserved between the yield
s, so we can simplify further by getting rid of the class (2):
You may wonder if just checking for trailing new-line on each line wouldn’t actually be fast enough. That would look like this (3):
I ran a test with timeit
that sanitizes 1000 lines 1000 times on all four. First 999 lines had newlines, the last one did not. Here are the results (in seconds on a 1.7 GHz Intel Core i7 MacBook Air):
- NOP 0.003
- (0) 0.444
- (1) 0.055
- (2) 0.055
- (3) 0.114
So, the plain generator (2) or generator wrapped in the class (1) are actually on par, while checking each line (3) doubles the overhead. The original version (0), that actually does the same thing as the generator in a explicit way is by far the worst. In all four cases I timed this code: list(SanitizeNl(lines))
(to actually execute the generator). For NOP I timed just list(lines)
for a baseline. (lines
are actually already a list, in order to avoid executing and timing another unrelated generator generating/reading the lines.)
That’s it. Not really as clean as Haskell would be, but pretty neat, huh? Actually the takeaway has nothing to do with files and lines, but rather with the effectivity of generators versus explicitly implemented iterables in Python.
If you want to learn more, there’s a great practical intro to generators by David Beazley. (It’s for Python 2, so it uses next
instead of __next__
.)