python cookbook（04）：迭代器与生成器

迭代是 Python 最强大的功能之一，这一篇文章将介绍 Python 迭代器与生成器相关的编码技巧。

手动遍历迭代器

问题

想要手动遍历迭代器，但是不想使用 for 循环。

解决方案

为了手动地遍历迭代器，使用 next() 函数并在代码中捕获 StopIteration 异常
next() 函数支持以指定值来标记结尾

示例

def manual_iter():
    with open('/etc/passwd') as f:
        try:
            while True:
                line = next(f)
                print(line, end='')
        except StopIteration:
            pass


def manual_iter_2():
    with open('/etc/passwd') as f:
        while True:
            line = next(f, None)
            if line is None:
                break
            print(line, end='')

>>> a = [1, 2, 3, 4]
>>> it = iter(a)
>>> next(it)
1
>>> next(it)
2
>>> next(it)
3
>>> next(it)
4
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

代理迭代

问题

当构建了一个自定义容器时，里面包含了列表、元组或其他可迭代对象。需要再这个新的容器对象上执行迭代操作。

解决方案

python 的迭代器协议要求在 可迭代对象 上调用 iter() 函数时，将返回一个 迭代器(iterator)
iter(s) 等效于 s.__iter__()，因此 可迭代对象 需要实现 __iter__() 方法，返回一个 迭代器
迭代器 实现了 __next__() 方法，因此可以在迭代器上调用 next() 函数
next(i) 等效于 i.__next__()

示例

class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)


if __name__ == '__main__':
    root = Node(0)
    child1 = Node(1)
    child2 = Node(2)
    root.add_child(child1)
    root.add_child(child2)
    for ch in root:
        print(ch)

使用生成器创建新的迭代器模式

问题

想实现一个自定义迭代器模式，和内置的 range()、reserved() 不一样

解决方案

当想要实现一种新的迭代模式，可以通过生成器函数来定义它：

在 Python 中，一个函数包含 yield 语句，就是 生成器函数。调用 生成器函数 将返回一个 生成器
生成器 是迭代器，支持迭代协议，因此可以在 生成器 上调用 next() 函数。此时将会执行函数体，直至遇到 yield 语句，返回所产生的值，并在该位置停止。当再次调用 next() 函数时，将从函数停止位置恢复执行，依次类推。最终函数执行完毕并返回时，将抛出 StopIteration 异常

示例

def frange(start, stop, increment):
    x = start
    while x < stop:
        yield x
        x += increment


for n in frange(1, 5, 2):
    print(n)


print(list(frange(1, 6, 0.8)))

>>> def countdown(n):
...     print('Starting to count from', n)
...     while n > 0:
...             yield n
...             n -= 1
...     print('Done')
...
>>> c = countdown(3)
>>>
>>> next(c)
Starting to count from 3
3
>>> next(c)
2
>>> next(c)
1
>>> next(c)
Done
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

实现迭代器协议

问题

想让自定义类型支持迭代操作，用最简单的方法实现迭代协议。

解决方案

Python 的迭代器协议要去：类型提供了 __iter__() 方法并返回一个迭代器。该迭代器实现了 __next__() 方法并且通过 StopIteration 异常标识迭代的完成
在自定义类型实现迭代最简单的方式就是使用一个 生成器函数，当把自定义类型的 __iter__() 方法实现为 生成器函数 时，该类型就天然支持迭代器协议
更复杂的方式则是自己实现这套迭代器协议。此时可能需要在迭代处理过程中维护大量状态信息

示例

class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def __iter__(self):
        return iter(self._children)

    def add_child(self, node):
        self._children.append(node)

    def dfs(self):
        yield self
        for c in self._children:
            yield from c.dfs()


if __name__ == '__main__':
    root = Node(0)
    child1 = Node(1)
    child2 = Node(2)
    root.add_child(child1)
    root.add_child(child2)
    child1.add_child(Node(3))
    child1.add_child(Node(4))
    child2.add_child(Node(5))

    for ch in root.dfs():
        print(ch)

class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)

    def dfs(self):
        return NodeDfsIterator(self)


class NodeDfsIterator:
    def __init__(self, node):
        self._node = node
        self._children_iter = None
        self._child_iter = None

    def __iter__(self):
        return self

    def __next__(self):
        if self._children_iter is None:
            self._children_iter = iter(self._node)
            return self._node
        elif self._child_iter:
            try:
                next_child = next(self._child_iter)
                return next_child
            except StopIteration:
                self._child_iter = None
                return next(self)
        else:
            self._child_iter = next(self._children_iter).dfs()
            return next(self)

反向迭代

问题

想反方向迭代一个序列。

解决方案

可以使用内置的 reversed() 函数
反向迭代仅仅当对象大小可以预先确定，或者对象实现了 __reversed__() 方法
当两者都不符合时，必须先将对象转换为一个列表才行。例如迭代一个文件对象时
因此对于自定义类型，只要实现了 __reversed__() 方法，也就可以支持反向迭代
定义反向迭代可以使代码非常高效，因为此时不需要先将数据填充到一个列表中，然后再去反向迭代这个列表

示例

>>> a = [1, 2, 3, 4]
>>> for i in reversed(a):
...     print(i)
...
4
3
2
1

1
2
3

>>> f = open('somefile')
>>> for line in reversed(list(f)):
...     print(line)

class CountDown:
    def __init__(self, start):
        self._start = start

    def __iter__(self):
        n = self._start
        while n > 0:
            yield n
            n -= 1

    def __reversed__(self):
        n = 1
        while n <= self._start:
            yield n
            n += 1


for rr in reversed(CountDown(5)):
    print(rr)


for r in CountDown(5):
    print(r)

带有外部状态的生成器函数

问题

在调用生成器时，需要暴露一些给用户使用的外部状态值。

解决方案

生成器逻辑不一定要真正地实现为函数，也可以将它实现为一个类，然后把 __iter__ 函数定义为生成器函数
此时该类型的对象支持迭代器协议，同时可以访问该类型对象的内部属性值，从而获得这些状态值

示例

class linehistory:
    def __init__(self, lines, histlen=3):
        self.lines = lines
        self.history = deque(maxlen=histlen)

    def __iter__(self):
        for lineno, line in enumerate(self.lines):
            self.history.append((lineno, line))
            yield line

    def clear(self):
        self.history.clear()

with open('somefile') as f:
    lh = linehistory(f)
    for line in lh:
        if 'python' in line:
            for lineno, hline in lh.history:
                print('{}:{}'.format(lineno, hline), end='')

迭代器切片

问题

想要在迭代器上生成切片对象。

解决方案

标准的切片操作不能应用于迭代器，因为它们的长度事先不知道
itertools.islice() 可以用于在迭代器和生成器上做切片操作
itertools.islice() 函数会消耗掉传入迭代器中的数据：它会首先丢弃从 首元素 到 开始索引位置 之间的所有元素，之后才逐个返回元素，直到 结束索引位置

示例

>>> def count(n):
...     while True:
...             yield n
...             n += 1
...

>>> c = count(0)
>>> for x in itertools.islice(c, 0, 5):
...     print(x)
...
0
1
2
3
4

跳过可迭代对象的开始部分

问题

想要遍历一个可迭代对象，但是开始的某些元素并不感兴趣，需要跳过。

解决方案

itertools.dropwhile() 函数可以实现该任务，该函数接收一个函数对象和一个可迭代对象，它会返回一个迭代器对象，丢弃原有序列中 直到函数返回 False 之前的所有元素，然后开始正常返回元素
如果你明确知道跳过元素的序列，也可以使用 itertools.islice()，此时将结束索引设置为 None 即可。

示例

如下代码跳过文件开始的 #:

>>> from itertools import dropwhile
>>> with open('somefile') as f:
...     for line in dropwhile(lambda line: not line.startswith('#'), f):
...             print(line)
...

如果不使用 dropwhile()，代码会复杂一些：

>>> with open('somefile') as f:
...     while True:
...             line = next(f, '')
...             if not line.startswith('#'):
...                     break
...     while line:
...             print(line, end='')
...             line = next(f, None)
...

>>> from itertools import islice
>>> l = ['a', 'b', 'c', 1, 2, 3]
>>> for i in islice(l, 3, None):
...     print(i)
...
1
2
3
>>> for i in islice(l, None, 3):
...     print(i)
...
a
b
c

排列组合的迭代

问题

如果想迭代一个集合中元素的所有可能的排列或组合。

解决方案

itertools.permutations() 接收一个集合并产生一个元组序列，每个元组由集合中的所有元素的一个可能排列组成
如果想得到指定长度的所有排列，可以指定一个可选的长度参数
使用 itertools.combinations() 可以得到输入集合中元素的所有组合
itertools.combinations_with_replacement() 允许同一个元素被选择多次
如果碰到复杂的迭代问题时，可以先看看 itertools 模块，看看有没有解决方案

示例

>>> items = ['a', 'b', 'c']
>>> from itertools import permutations
>>> for p in permutations(items):
...     print(p)
...
('a', 'b', 'c')
('a', 'c', 'b')
('b', 'a', 'c')
('b', 'c', 'a')
('c', 'a', 'b')
('c', 'b', 'a')

>>> for p in permutations(items, 2):
...     print(p)
...
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')

>>> from itertools import combinations
>>> for c in combinations(items, 3):
...     print(c)
...
('a', 'b', 'c')

>>> for c in combinations(items, 2):
...     print(c)
...
('a', 'b')
('a', 'c')
('b', 'c')

>>> for c in combinations(items, 1):
...     print(c)
...
('a',)
('b',)
('c',)

>>> for c in combinations_with_replacement(items, 3):
...     print(c)
...
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'a', 'c')
('a', 'b', 'b')
('a', 'b', 'c')
('a', 'c', 'c')
('b', 'b', 'b')
('b', 'b', 'c')
('b', 'c', 'c')
('c', 'c', 'c')

序列上索引值迭代

问题

如果想在迭代一个序列的同时，跟踪正在被处理元素的索引。

解决方案

enumerate() 函数可以解决该问题
enumerate() 函数返回的是一个 enumerate 对象实例，它是一个迭代器，返回连续的包含一个计数值和一个值的元组
enumberate() 函数还可以接受一个 start 参数，用于指定计数索引的起始值

示例

>>> l = ['a', 'b', 'c', 'd']
>>> for i, v in enumerate(l):
...     print(i, v)
...
0 a
1 b
2 c
3 d

>>> data = [(1, 2), (3, 4), (5, 6)]
>>> for n, (x, y) in enumerate(data):
...     print(n, x, 6)
...
0 1 6
1 3 6
2 5 6

def parse_data(filename):
    with open(filename, 'rt') as f:
        for lineno, line in enumerate(f, 1):
            fields = line.split()
        try:
            count = int(fields[1])
            print(count)
        except ValueError as e:
            print('Line {}: parse error: {}'.format(lineno, e))

同时迭代多个序列

问题

如果你想同时迭代多个序列，每次分别从一个序列中获取一个元素。

解决方案

为了同时迭代多个序列，可以使用 zip() 函数，zip(a, b) 会生成一个可返回元组 (x, y) 的迭代器，其中 x 来自 a、y 来自 b。一旦某个序列到达结尾，则迭代结束因此迭代长度和参数中最短序列长度一致。如果想实现按最长序列进行迭代，可以使用 itertools.zip_longest() 函数

示例

>>> xpts = [1, 2, 3, 4, 5]
>>> ypts = [10, 20, 30, 40, 50]
>>> for x, y in zip(xpts, ypts):
...     print(x, y)
...
1 10
2 20
3 30
4 40
5 50

>>> a = [1, 2, 3]
>>> b = ['x', 'y', 'z', 'o']
>>> [(x, y) for x, y in zip(a, b)]
[(1, 'x'), (2, 'y'), (3, 'z')]
>>> dict(zip(a, b))
{1: 'x', 2: 'y', 3: 'z'}

>>> from itertools import zip_longest
>>> [(x, y) for x, y in zip_longest(a, b)]
[(1, 'x'), (2, 'y'), (3, 'z'), (None, 'o')]
>>> [(x, y) for x, y in zip_longest(a, b, fillvalue=0)]
[(1, 'x'), (2, 'y'), (3, 'z'), (0, 'o')]

不同集合上元素的迭代

问题

如果你想要在多个对象上执行相同的操作，但是这些对象在不同的容器中，如何避免写重复循环。

解决方案

itertools.chain() 方法可以接受一个可迭代对象列表作为输入，并返回一个迭代器。通过该迭代器可以依次连续地返回每个可迭代对象中的元素
itertools.chain() 非常适合对不同集合（不同可迭代对象类型）中的所有元素执行某些操作，它比使用多个单独的循环更加优雅

示例

>>> a = [1, 2, 3, 4]
>>> b = ['x', 'y', 'z']
>>> for x in chain(a, b):
...     print(x)
...
1
2
3
4
x
y
z

>>> aset = (1, 2, 3)
>>> blist = [4, 5, 6]
>>> c = list(chain(aset, blist))
>>> c
[1, 2, 3, 4, 5, 6]

创建数据处理管道

问题

想以数据管道（类似 Unix 管道）的方式迭代处理数据。

解决方案

以管道方式处理数据可以用来解决 大量数据的一次性处理问题
生成器函数是一个实现管道机制的好方法。重点需要理解：yield 语句是数据的生产者，而 for 循环语句 则是数据的消费者。当生成器被连接在一起时，每个 yield 结果会作为一个单独的数据元素传递给迭代处理管道的下一阶段
这种实现方式的优点是：每个生成器函数都很小并且都是独立的，便于维护与扩展
yield from it 将操作代理到 it 迭代器上，并简单地返回生成器 it 所产生的值，

示例

import fnmatch
import gzip
import bz2
import re


def gen_find(filepat, top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            yield os.path.join(path, name)


def gen_opener(filenames):
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()


def gen_concatenate(iterators):
    for it in iterators:
        yield from it


def gen_grep(pattern, lines):
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line


lognames = gen_find('access-log*', 'www')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('(?i)python', lines)
for line in pylines:
    print(line)

展开嵌套的序列

问题

想将一个多层嵌套的序列展开成一个单层列表。

解决方案

语句 yield from 在你想在生成器中调用其他生成器作为子例程时非常有用
如果不使用 yield from，你就得多写一个 for 循环
yield from 在涉及到基于协程和生成器的并发编程中也扮演重要角色

示例

#!/usr/bin/env python3

from collections.abc import Iterable


def flatten(items, ignore_types=(str, bytes)):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, ignore_types):
            yield from flatten(x, ignore_types)
        else:
            yield x


items = [1, 2, [3, 4, 5, [6]], [7, 8]]
print(list(flatten(items)))

顺序迭代合并后的排序迭代对象

问题

你有一系列排序序列，想要将它们合并后得到一个排序序列并在上面迭代遍历。

解决方案

使用 heapq.merge() 可以对多个已排序序列进行合并，而且它返回一个生成器对象，这就意味着它并不会马上将所有序列都读取到内存中，因此在非常长的序列中使用该函数，也不会有太大开销
heapq.merge() 所输入的序列必须是排过序的。它仅仅检查所有序列的开始部分并返回最小的那个，该过程会持续直到所有输入序列中的元素被遍历完成

示例

>>> import heapq
>>> a = [1, 2, 5, 6]
>>> b = [3, 4, 7, 8]
>>> heapq.merge(a, b)
<generator object merge at 0x7f9cf132c270>
>>> list(_)
[1, 2, 3, 4, 5, 6, 7, 8]

迭代器替代 while 无限循环

问题

如果你在代码中使用 while 循环来处理数据，然后当满足某个条件时在推出循环。这种编码模式能否有迭代器来实现呢？

解决方案

iter() 函数一个鲜为人知的特性是它接受一个可选的 callable 对象和一个标记结尾的值作为输入参数，它会创建一个迭代器，该迭代器不断调用 callable() 对象，直到返回值和标记值相等，此时表示迭代结束
这种方法对于一些特定的、会被重复调用的函数很有效果

示例

>>> CHUNKSIZE = 8192
>>> def reader(s):
...     while True:
...             data = s.recv(CHUNKSIZE)
...             if data == b'':
...                     break
...             pass
...

1
2
3

>>> def reader2(s):
...     for data in iter(lambda: s.recv(CHUNKSIZE), b''):
...             pass