如何找到 NLP 字数并绘制它?

How to find NLP words count and plot it?

我正在做一些 NLP 工作

我的原始数据框是df_all

Index    Text
1        Hi, Hello, this is mike, I saw your son playing in the garden...
2        Besides that, sometimes my son studies math for fun...
3        I cannot believe she said that. she always says such things...

我将文本转换为 BOW 数据框

所以我的数据框 df_BOW 现在看起来像这样

Index    Hi   This   my   son   play   garden ...
1        3    6      3    0     2       4
2        0    2      4    4     3       1
3        0    2      0    7     3       0

我想找出每个词在语料库中出现了多少次

cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();

获得像这样的热门词

但我得到的图表没有显示任何信息

我该如何解决?

你可以用collections.Counter来统计字数:

import pandas as pd
import seaborn as sns
from collections import Counter
import re
import matplotlib.pyplot as plt

data = ['Hi, Hello, this is mike, I saw your son playing in the garden', 'Besides that, sometimes my son studies math for fun', 'I cannot believe she said that. she always says such things']
df = pd.DataFrame(data, columns=['text'])

df['text_split'] = df['text'].apply(lambda x: re.findall(r'\w+', x)) #split sentences to words with regex
words = [item.lower() for sublist in df['text_split'].tolist() for item in sublist] # flattens the list of lists and lowers the words

counted_words = Counter(words)
counted_df = pd.DataFrame(counted_words.items(), columns=['word', 'count']).sort_values('count', ascending=False).reset_index(drop=True) #create new df from counter

plt.figure(figsize=(12,4))
sns.barplot(data=counted_df[:10], x='word', y='count', alpha=0.8) #plot only the top 10 by slicing the df
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show()

结果:

我不确定你是如何创建的 df_BOW 但它不是绘图的理想格式。

df_all = pd.DataFrame(
    {
        "text": [
            "Hi, Hello, this is mike, I saw your son playing in the garden",
            "Besides that, sometimes my son studies math for fun",
            "I cannot believe she said that. she always says such things",
        ]
    }
)

类似,我们可以使用正则表达式来提取单词,但我们只会使用pandas方法:

counts = df["text"].str.findall(r"(\w+)").explode().value_counts()

counts是一个系列,索引是单词,值是计数:

son          2
she          2
I            2
...
says         1
garden       1
math         1
Name: text, dtype: int64

然后绘制:

fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)

如果你只是想要 N 个最常用的词,你可以像这样使用 nlargest

top_10 = counts.nlargest(10)

并以同样的方式绘制。