使用 python 进行文本分析 - 程序在运行 8 次后停止

Question

我想对多个文本文件（>50,000 个文件）进行文本分析，其中一些在 html 脚本中。我的程序（如下）遍历这些文件，依次打开每个文件，使用 NLTK 模块分析内容并将输出写入 CSV 文件，然后继续分析第二个文件。

程序运行对于单个文件没问题，但循环几乎在第 8 个运行后停止，即使要分析的第 9 个文件不大于第 8 个。例如。前八次迭代总共花费了 10 分钟，而第 9 次迭代花费了 45 分钟。第 10 个甚至超过 45 分钟（文件比第一个小得多）。

我相信程序可以进一步优化，因为我对 Python 还比较陌生，但我不明白为什么在第 8 个运行之后它变得这么慢？任何帮助，将不胜感激。谢谢！

#import necessary modules
import urllib, csv, re, nltk
from string import punctuation
from bs4 import BeautifulSoup
import glob

#Define bags of words (There are more variable, ie word counts, that are calculated)
adaptability=['adaptability', 'flexibility']

csvfile=open("test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):

    ###Open files and arrange them so that they are ready for pre-processing
    review=open(filename, encoding='utf-8', errors='ignore').read()
    soup=BeautifulSoup(review)
    text=soup.get_text()

    from nltk.stem import WordNetLemmatizer
    wnl=WordNetLemmatizer()

    adaptability_counts=[]
    adaptability_counter=0
    review_processed=text.lower().replace('\r',' ').replace('\t',' ').replace('\n',' ').replace('. ', ' ').replace(';',' ').replace(', ',' ')
    words=review_processed.split(' ')
    word_l1=[word for word in words if word not in stopset]
    word_l=[x for x in word_l1 if x != ""]
    word_count=len(word_l)
    for word in words:
       wnl.lemmatize(word)
       if word in adaptability:
         adaptability_counter=adaptability_counter+1
    adaptability_counts.append(adaptability_counter)

    #I then repeat the analysis with 2 subsections of the text files
    #(eg. calculate adaptability_counts for Part I only)

    output=zip(adaptability_counts)
    writer=csv.writer(open('test_10.csv','a',newline='', encoding='cp850', errors='replace'))
    writer.writerows(output)
    csvfile.flush()

Answer 1

文件一旦打开就永远不会关闭。我的猜测是您运行内存不足，并且花费了很长时间，因为您的机器必须从页面文件（在磁盘上）交换数据。不是仅仅调用 open()，而是必须在完成文件后 close() 文件或使用 with open 结构，这将在您完成后自动关闭文件。有关详细信息，请参阅此页面：http://effbot.org/zone/python-with-statement.htm

如果是我，我会改这行：

review=open(filename, encoding='utf-8', errors='ignore').read()

对此：

with open(filename, encoding='utf-8', errors='ignore') as f:
    review = f.read()
    ...

并确保适当缩进。您在打开文件时执行的代码需要在 with 块内缩进。

Answer 2

由于接受的答案没有完全解决您的问题，这里有一个跟进：

您有一个列表 adaptability，您可以在其中查找输入的每个单词。 永远不要在列表中查找单词！用一组替换列表，您应该会看到巨大的改进。（如果您使用列表来计算单个单词，请将其替换为 collections.counter，或 nltk 的 FreqDist。）如果您的 adaptability 列表随着您阅读的每个文件而增长（是吗？它应该是？），这绝对足以引起您的问题。
但罪魁祸首可能不止一个。您遗漏了很多代码，因此无法说明其他数据结构随着您看到的每个文件而增长，或者这是否有意。很明显，您的代码是 "quadratic" 并且随着数据变大而变慢，这不是因为内存大小，而是因为您需要更多步骤。

不要费心切换到数组 CountVectorizer，你只会稍微推迟问题。弄清楚如何在恒定时间内处理每个文件。如果您的算法不需要从多个文件中收集单词，最快的解决方案是运行分别在每个文件上进行（自动化并不难）。

使用 python 进行文本分析 - 程序在运行 8 次后停止

Textual analysis with python - program stalls after 8 runs

python

nltk