如何使用文件中的生成器进行标记化而不是具体化字符串列表?
How to workwith generators from file for tokenization rather than materializing a list of strings?
我有 2 个文件:
hyp.txt
It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history
ref.txt
It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book
我有一个函数可以进行一些计算来比较文本的行,例如hyp.txt 的第 1 行和 ref.txt.
的第 1 行
def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
"""
:type list_of_tokenized_hyp: iter(iter(str))
:type list_of_tokenized_ref: iter(iter(str))
"""
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
# do something with the iter(str)
return score
而且这个功能不能更改。但是,我可以操纵我提供给函数的内容。所以目前我正在将文件输入到这样的函数中:
with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
hyp = [line.split() for line in hypfin]
ref = [line.split() for line in reffin]
scorer(hypfin, reffin)
但是通过这样做,我已经将整个文件和拆分的字符串加载到内存中,然后再将其送入 scorer()
。
知道 scorer()
正在逐行处理文件,有没有办法在不更改 scorer()
函数的情况下在输入函数 之前不具体化拆分字符串?
有没有办法代替某种发电机?
我试过这个:
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
hyp = (h.split() for h in hypline)
ref = (r.split() for r in hypline)
scorer(hypfin, reffin)
但我不确定 h.split()
是否已经实现。 如果实现了,为什么?如果不是,为什么?
如果我可以更改 scorer()
函数,那么我可以轻松地在 for
:
之后添加这一行
def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
hypline = hypline.split()
refline = refline.split()
# do something with the iter(str)
return score
但这对我来说是不可能的,因为我无法更改该功能。
是的,您的示例定义了两个生成器
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
hyp = (h.split() for h in hypfin)
ref = (r.split() for r in reffin)
scorer(hyp, ref)
和 split
,以及下一行的相应读取,是针对每个 for-loop-iteration.
完成的
您的生成器表达式与 Python 3 的 zip()
(替换为 Python 2 中的 itertools.izip()
)的行为符合您的要求,即它们不会读取整个文件一次性创建拆分列表。
您可以通过替换 str.split()
的日志版本来了解发生了什么:
def my_split(s):
print('my_split(): {!r}'.format(s))
return s.split()
>>> hypfin = open('hyp.txt', 'r')
>>> reffin = open('ref.txt', 'r')
>>> hyp = (my_split(h) for h in hypfin) # N.B. my_split() not called here
>>> hyp
<generator object <genexpr> at 0x7fa89ad16b40>
>>> ref = (my_split(r) for r in reffin) # N.B. my_split() not called here
>>> ref
<generator object <genexpr> at 0x7fa89ad16bd0>
>>> z = zip(hyp, ref) # N.B. my_split() not called here
>>> z
<zip object at 0x7fa89ad15cc8>
>>> hypline, refline = next(z)
my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n'
my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n'
>>> hypline, refline = next(z)
my_split(): 'he read the book because he was interested in world history\n'
my_split(): 'he was interested in world history because he read the book\n'
>>> hypline, refline = next(z)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
从 my_split()
的输出你可以看到 hyp
和 ref
确实是生成器,在需要时才消耗输入。 z
是一个 zip
对象,在访问之前它也不会消耗任何输入。 for
循环是用 next()
模拟的,以证明每次迭代只消耗来自每个文件的一行输入。
我有 2 个文件:
hyp.txt
It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history
ref.txt
It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book
我有一个函数可以进行一些计算来比较文本的行,例如hyp.txt 的第 1 行和 ref.txt.
的第 1 行def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
"""
:type list_of_tokenized_hyp: iter(iter(str))
:type list_of_tokenized_ref: iter(iter(str))
"""
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
# do something with the iter(str)
return score
而且这个功能不能更改。但是,我可以操纵我提供给函数的内容。所以目前我正在将文件输入到这样的函数中:
with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
hyp = [line.split() for line in hypfin]
ref = [line.split() for line in reffin]
scorer(hypfin, reffin)
但是通过这样做,我已经将整个文件和拆分的字符串加载到内存中,然后再将其送入 scorer()
。
知道 scorer()
正在逐行处理文件,有没有办法在不更改 scorer()
函数的情况下在输入函数 之前不具体化拆分字符串?
有没有办法代替某种发电机?
我试过这个:
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
hyp = (h.split() for h in hypline)
ref = (r.split() for r in hypline)
scorer(hypfin, reffin)
但我不确定 h.split()
是否已经实现。 如果实现了,为什么?如果不是,为什么?
如果我可以更改 scorer()
函数,那么我可以轻松地在 for
:
def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
hypline = hypline.split()
refline = refline.split()
# do something with the iter(str)
return score
但这对我来说是不可能的,因为我无法更改该功能。
是的,您的示例定义了两个生成器
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
hyp = (h.split() for h in hypfin)
ref = (r.split() for r in reffin)
scorer(hyp, ref)
和 split
,以及下一行的相应读取,是针对每个 for-loop-iteration.
您的生成器表达式与 Python 3 的 zip()
(替换为 Python 2 中的 itertools.izip()
)的行为符合您的要求,即它们不会读取整个文件一次性创建拆分列表。
您可以通过替换 str.split()
的日志版本来了解发生了什么:
def my_split(s):
print('my_split(): {!r}'.format(s))
return s.split()
>>> hypfin = open('hyp.txt', 'r')
>>> reffin = open('ref.txt', 'r')
>>> hyp = (my_split(h) for h in hypfin) # N.B. my_split() not called here
>>> hyp
<generator object <genexpr> at 0x7fa89ad16b40>
>>> ref = (my_split(r) for r in reffin) # N.B. my_split() not called here
>>> ref
<generator object <genexpr> at 0x7fa89ad16bd0>
>>> z = zip(hyp, ref) # N.B. my_split() not called here
>>> z
<zip object at 0x7fa89ad15cc8>
>>> hypline, refline = next(z)
my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n'
my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n'
>>> hypline, refline = next(z)
my_split(): 'he read the book because he was interested in world history\n'
my_split(): 'he was interested in world history because he read the book\n'
>>> hypline, refline = next(z)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
从 my_split()
的输出你可以看到 hyp
和 ref
确实是生成器,在需要时才消耗输入。 z
是一个 zip
对象,在访问之前它也不会消耗任何输入。 for
循环是用 next()
模拟的,以证明每次迭代只消耗来自每个文件的一行输入。