如何使用文件中的生成器进行标记化而不是具体化字符串列表？

Question

我有 2 个文件：

hyp.txt

It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history

ref.txt

It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book

我有一个函数可以进行一些计算来比较文本的行，例如hyp.txt 的第 1 行和 ref.txt.

的第 1 行

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   """
   :type list_of_tokenized_hyp: iter(iter(str))
   :type list_of_tokenized_ref: iter(iter(str))
   """   
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       # do something with the iter(str)
   return score

而且这个功能不能更改。但是，我可以操纵我提供给函数的内容。所以目前我正在将文件输入到这样的函数中：

with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
    hyp = [line.split() for line in hypfin]
    ref = [line.split() for line in reffin]
    scorer(hypfin, reffin)

但是通过这样做，我已经将整个文件和拆分的字符串加载到内存中，然后再将其送入 scorer()。

知道 scorer() 正在逐行处理文件，有没有办法在不更改 scorer() 函数的情况下在输入函数 之前不具体化拆分字符串?

有没有办法代替某种发电机？

我试过这个：

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
    hyp = (h.split() for h in hypline)
    ref = (r.split() for r in hypline)
    scorer(hypfin, reffin)

但我不确定 h.split() 是否已经实现。 如果实现了，为什么？如果不是，为什么？

如果我可以更改 scorer() 函数，那么我可以轻松地在 for:

之后添加这一行

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       hypline = hypline.split()
       refline = refline.split()
       # do something with the iter(str)
   return score

但这对我来说是不可能的，因为我无法更改该功能。

Answer 1

是的，您的示例定义了两个生成器

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
    hyp = (h.split() for h in hypfin)
    ref = (r.split() for r in reffin)
    scorer(hyp, ref)

和 split，以及下一行的相应读取，是针对每个 for-loop-iteration.

完成的

Answer 2

您的生成器表达式与 Python 3 的 zip()（替换为 Python 2 中的 itertools.izip()）的行为符合您的要求，即它们不会读取整个文件一次性创建拆分列表。

您可以通过替换 str.split() 的日志版本来了解发生了什么：

def my_split(s):
    print('my_split(): {!r}'.format(s))
    return s.split()

>>> hypfin = open('hyp.txt', 'r')
>>> reffin = open('ref.txt', 'r')
>>> hyp = (my_split(h) for h in hypfin)    # N.B. my_split() not called here
>>> hyp
<generator object <genexpr> at 0x7fa89ad16b40>
>>> ref = (my_split(r) for r in reffin)    # N.B. my_split() not called here
>>> ref
<generator object <genexpr> at 0x7fa89ad16bd0>

>>> z = zip(hyp, ref)    # N.B. my_split() not called here
>>> z
<zip object at 0x7fa89ad15cc8>

>>> hypline, refline = next(z)
my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n'
my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n'
>>> hypline, refline = next(z)
my_split(): 'he read the book because he was interested in world history\n'
my_split(): 'he was interested in world history because he read the book\n'
>>> hypline, refline = next(z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

从 my_split() 的输出你可以看到 hyp 和 ref 确实是生成器，在需要时才消耗输入。 z 是一个 zip 对象，在访问之前它也不会消耗任何输入。 for 循环是用 next() 模拟的，以证明每次迭代只消耗来自每个文件的一行输入。

如何使用文件中的生成器进行标记化而不是具体化字符串列表？

How to workwith generators from file for tokenization rather than materializing a list of strings?

python

string

split

list

generator