如何在 Python 中逐句解析文件
How to parse a file sentence by sentence in Python
我需要阅读大量的大文本文件。
对于每个文件,我都需要打开它并逐句阅读文本。
我找到的大多数方法都是逐行阅读的。
如何使用 Python 完成?
如果文件有大量行,您可以使用 yield 语句制作一个生成器
def read(filename):
file = open(filename, "r")
for line in file.readlines():
for word in line.split():
yield word
for word in read("sample.txt"):
print word
这将return文件每一行的所有单词
如果你想要句子标记化,nltk 可能是最快的方法。 http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
会让你走得很远。
即来自文档的代码
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
我需要阅读大量的大文本文件。
对于每个文件,我都需要打开它并逐句阅读文本。
我找到的大多数方法都是逐行阅读的。
如何使用 Python 完成?
如果文件有大量行,您可以使用 yield 语句制作一个生成器
def read(filename):
file = open(filename, "r")
for line in file.readlines():
for word in line.split():
yield word
for word in read("sample.txt"):
print word
这将return文件每一行的所有单词
如果你想要句子标记化,nltk 可能是最快的方法。 http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt 会让你走得很远。
即来自文档的代码
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.