列表创建的项目数量错误

Question

我正在创建一个由元组组成的文档列表，每个元组由一个元组列表和一个字符串组成，所以它看起来像这样：

[([('NOUN', 'ADP'), ('ADP', 'NOUN'), ('NOUN', 'PROPN'), ('PROPN', 'ADJ'), ('ADJ', 'DET')], 'M'), 
('NOUN', 'ADP'), ('ADP', 'NOUN'), ('NOUN', 'PROPN'), ('PROPN', 'ADJ'), ('ADJ', 'DET')], 'F'), ...]

我正在使用 nltk 生成列表：

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt')
documents = [(list(ngrams(corpus.words(fileid), 2)), gender)
    for gender in [f[47] for f in corpus.fileids()]
    for fileid in corpus.fileids()]

问题是，len(corpus.fileids()) 是 84（正确的），但 len(documents) 是 7056。所以，不知何故，我设法使文档数量平方。我希望列表只有 84 个项目。

我注意到 documents[0] 和 documents[84] 是相同的（当然 documents[1] 和 documents[85] 等也是）。当然，我可以切分 7056 个项目的完整列表，但这并不能解释任何事情......我是 Python 和编程的新手，所以任何帮助将不胜感激。

Answer 1

如果我正确地阅读了你的程序，你正在尝试将每个文档的列表存储在元组中，连同文档的 'gender'，即索引 47 处的元素文件 ID。

您用来构造 documents 的列表理解首先遍历内部列表理解，然后遍历 corpus.fileids()。当 Python 列表理解迭代两个可迭代对象时，它将针对第一个可迭代对象的每个值迭代整个第二个可迭代对象。我们可以通过一个例子看到这一点：

>>> print([(a, b) for a in [1, 2] for b in [1, 2]])
[(1, 1), (1, 2), (2, 1), (2, 2)]

相反，在这种情况下，我们似乎可以通过将 f[47] 应用于我们从 corpus.fileids() 中提取的文件 ID 来避免双重迭代。这样每个fileid只会被考虑一次。

documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

整个程序就这样变成了

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt')
documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

列表创建的项目数量错误

Wrong number of items by the list creation

python

loops

tuples

list

nltk