使用 FreqDist 并写入 CSV

Question

我正在尝试使用 nltk 和 pandas 从另一个 csv 中找到前 100 个单词，并将它们列在一个新的 CSV 中。我能够绘制单词但是当我打印到 CSV 时我得到

word | count
  52 |    7       <- This is current CSV output

不确定我哪里出错了，正在寻找一些指导。

我的密码是

words= []
with open('SECParse2.csv', encoding = 'utf-8') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    freq_all = nltk.FreqDist()

    for row in reader:
        note = row[1]
        tokens = [t for t in note.split()] 

        freq = nltk.FreqDist(tokens) 
        fd_t100 = freq.most_common(100)
        freq_all.update(tokens)

    freq_all.plot(100, cumulative=False)

df3 = pd.DataFrame(freq_all,columns=['word','count'], index=[1])
df3.to_csv("./SECParse3.csv", sep=',',index=False)

我猜这是我的 df3 行，但我无法让它在 CSV 中显示正确的分布

也试过

df3 = pd.DataFrame(fd_t100,columns=['word','count'])

CSV2-

中的一些示例内容


filename                    text                                     
AAL_0000004515_10Q_20200331 generally industry may affected 
AAL_0000004515_10Q_20200331 material decrease demand international air travel
AAPL_0000320193_10Q_2020032 february following initial outbreak virus china 
AAP_0001158449_10Q_20200418 restructuring cost cost primarily relating early

Answer 1

给你。代码压缩的比较多，喜欢的可以随意展开。

首先，确保源文件确实是 CSV 文件（即以逗号分隔）。我 copied/pasted 将问题中的示例文本转换为文本文件并添加了逗号（如下所示）。

逐行分解代码：

将 CSV 文件读入 DataFrame
提取text列并展平成一串单词，并分词
提取最常用的 100 个词
将结果写入新的 CSV 文件

代码：

import pandas as pd
from nltk import FreqDist, word_tokenize

df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False

示例输入：

filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early

输出：

word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...

使用 FreqDist 并写入 CSV

Using FreqDist and writing to CSV

python

csv

nltk

pandas

代码：

示例输入：

输出：