使用 FreqDist 并写入 CSV
Using FreqDist and writing to CSV
我正在尝试使用 nltk 和 pandas 从另一个 csv 中找到前 100 个单词,并将它们列在一个新的 CSV 中。我能够绘制单词但是当我打印到 CSV 时我得到
word | count
52 | 7 <- This is current CSV output
不确定我哪里出错了,正在寻找一些指导。
我的密码是
words= []
with open('SECParse2.csv', encoding = 'utf-8') as csvfile:
reader = csv.reader(csvfile)
next(reader)
freq_all = nltk.FreqDist()
for row in reader:
note = row[1]
tokens = [t for t in note.split()]
freq = nltk.FreqDist(tokens)
fd_t100 = freq.most_common(100)
freq_all.update(tokens)
freq_all.plot(100, cumulative=False)
df3 = pd.DataFrame(freq_all,columns=['word','count'], index=[1])
df3.to_csv("./SECParse3.csv", sep=',',index=False)
我猜这是我的 df3 行,但我无法让它在 CSV 中显示正确的分布
也试过
df3 = pd.DataFrame(fd_t100,columns=['word','count'])
CSV2-
中的一些示例内容
filename text
AAL_0000004515_10Q_20200331 generally industry may affected
AAL_0000004515_10Q_20200331 material decrease demand international air travel
AAPL_0000320193_10Q_2020032 february following initial outbreak virus china
AAP_0001158449_10Q_20200418 restructuring cost cost primarily relating early
给你。代码压缩的比较多,喜欢的可以随意展开。
首先,确保源文件确实是 CSV 文件(即以逗号分隔)。我 copied/pasted 将问题中的示例文本转换为文本文件并添加了逗号(如下所示)。
逐行分解代码:
- 将 CSV 文件读入
DataFrame
- 提取
text
列并展平成一串单词,并分词
- 提取最常用的 100 个词
- 将结果写入新的 CSV 文件
代码:
import pandas as pd
from nltk import FreqDist, word_tokenize
df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False
示例输入:
filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early
输出:
word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...
我正在尝试使用 nltk 和 pandas 从另一个 csv 中找到前 100 个单词,并将它们列在一个新的 CSV 中。我能够绘制单词但是当我打印到 CSV 时我得到
word | count
52 | 7 <- This is current CSV output
不确定我哪里出错了,正在寻找一些指导。
我的密码是
words= []
with open('SECParse2.csv', encoding = 'utf-8') as csvfile:
reader = csv.reader(csvfile)
next(reader)
freq_all = nltk.FreqDist()
for row in reader:
note = row[1]
tokens = [t for t in note.split()]
freq = nltk.FreqDist(tokens)
fd_t100 = freq.most_common(100)
freq_all.update(tokens)
freq_all.plot(100, cumulative=False)
df3 = pd.DataFrame(freq_all,columns=['word','count'], index=[1])
df3.to_csv("./SECParse3.csv", sep=',',index=False)
我猜这是我的 df3 行,但我无法让它在 CSV 中显示正确的分布
也试过
df3 = pd.DataFrame(fd_t100,columns=['word','count'])
CSV2-
中的一些示例内容
filename text
AAL_0000004515_10Q_20200331 generally industry may affected
AAL_0000004515_10Q_20200331 material decrease demand international air travel
AAPL_0000320193_10Q_2020032 february following initial outbreak virus china
AAP_0001158449_10Q_20200418 restructuring cost cost primarily relating early
给你。代码压缩的比较多,喜欢的可以随意展开。
首先,确保源文件确实是 CSV 文件(即以逗号分隔)。我 copied/pasted 将问题中的示例文本转换为文本文件并添加了逗号(如下所示)。
逐行分解代码:
- 将 CSV 文件读入
DataFrame
- 提取
text
列并展平成一串单词,并分词 - 提取最常用的 100 个词
- 将结果写入新的 CSV 文件
代码:
import pandas as pd
from nltk import FreqDist, word_tokenize
df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False
示例输入:
filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early
输出:
word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...