如何找到 NLP 字数并绘制它?
How to find NLP words count and plot it?
我正在做一些 NLP 工作
我的原始数据框是df_all
Index Text
1 Hi, Hello, this is mike, I saw your son playing in the garden...
2 Besides that, sometimes my son studies math for fun...
3 I cannot believe she said that. she always says such things...
我将文本转换为 BOW 数据框
所以我的数据框 df_BOW
现在看起来像这样
Index Hi This my son play garden ...
1 3 6 3 0 2 4
2 0 2 4 4 3 1
3 0 2 0 7 3 0
我想找出每个词在语料库中出现了多少次
cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();
获得像这样的热门词
但我得到的图表没有显示任何信息
我该如何解决?
你可以用collections.Counter
来统计字数:
import pandas as pd
import seaborn as sns
from collections import Counter
import re
import matplotlib.pyplot as plt
data = ['Hi, Hello, this is mike, I saw your son playing in the garden', 'Besides that, sometimes my son studies math for fun', 'I cannot believe she said that. she always says such things']
df = pd.DataFrame(data, columns=['text'])
df['text_split'] = df['text'].apply(lambda x: re.findall(r'\w+', x)) #split sentences to words with regex
words = [item.lower() for sublist in df['text_split'].tolist() for item in sublist] # flattens the list of lists and lowers the words
counted_words = Counter(words)
counted_df = pd.DataFrame(counted_words.items(), columns=['word', 'count']).sort_values('count', ascending=False).reset_index(drop=True) #create new df from counter
plt.figure(figsize=(12,4))
sns.barplot(data=counted_df[:10], x='word', y='count', alpha=0.8) #plot only the top 10 by slicing the df
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show()
结果:
我不确定你是如何创建的 df_BOW
但它不是绘图的理想格式。
df_all = pd.DataFrame(
{
"text": [
"Hi, Hello, this is mike, I saw your son playing in the garden",
"Besides that, sometimes my son studies math for fun",
"I cannot believe she said that. she always says such things",
]
}
)
与类似,我们可以使用正则表达式来提取单词,但我们只会使用pandas方法:
counts = df["text"].str.findall(r"(\w+)").explode().value_counts()
Series.str.findall
:应用正则表达式 (\w+)
来捕获所有单词。这 returns 个 Series
列表。
Series.explode
:将类列表的每个元素转换为一行。
Series.value_counts
:Return 包含唯一值计数的系列。
counts
是一个系列,索引是单词,值是计数:
son 2
she 2
I 2
...
says 1
garden 1
math 1
Name: text, dtype: int64
然后绘制:
fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)
如果你只是想要 N 个最常用的词,你可以像这样使用 nlargest
:
top_10 = counts.nlargest(10)
并以同样的方式绘制。
我正在做一些 NLP 工作
我的原始数据框是df_all
Index Text
1 Hi, Hello, this is mike, I saw your son playing in the garden...
2 Besides that, sometimes my son studies math for fun...
3 I cannot believe she said that. she always says such things...
我将文本转换为 BOW 数据框
所以我的数据框 df_BOW
现在看起来像这样
Index Hi This my son play garden ...
1 3 6 3 0 2 4
2 0 2 4 4 3 1
3 0 2 0 7 3 0
我想找出每个词在语料库中出现了多少次
cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();
获得像这样的热门词
但我得到的图表没有显示任何信息
我该如何解决?
你可以用collections.Counter
来统计字数:
import pandas as pd
import seaborn as sns
from collections import Counter
import re
import matplotlib.pyplot as plt
data = ['Hi, Hello, this is mike, I saw your son playing in the garden', 'Besides that, sometimes my son studies math for fun', 'I cannot believe she said that. she always says such things']
df = pd.DataFrame(data, columns=['text'])
df['text_split'] = df['text'].apply(lambda x: re.findall(r'\w+', x)) #split sentences to words with regex
words = [item.lower() for sublist in df['text_split'].tolist() for item in sublist] # flattens the list of lists and lowers the words
counted_words = Counter(words)
counted_df = pd.DataFrame(counted_words.items(), columns=['word', 'count']).sort_values('count', ascending=False).reset_index(drop=True) #create new df from counter
plt.figure(figsize=(12,4))
sns.barplot(data=counted_df[:10], x='word', y='count', alpha=0.8) #plot only the top 10 by slicing the df
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show()
结果:
我不确定你是如何创建的 df_BOW
但它不是绘图的理想格式。
df_all = pd.DataFrame(
{
"text": [
"Hi, Hello, this is mike, I saw your son playing in the garden",
"Besides that, sometimes my son studies math for fun",
"I cannot believe she said that. she always says such things",
]
}
)
与
counts = df["text"].str.findall(r"(\w+)").explode().value_counts()
Series.str.findall
:应用正则表达式(\w+)
来捕获所有单词。这 returns 个Series
列表。Series.explode
:将类列表的每个元素转换为一行。Series.value_counts
:Return 包含唯一值计数的系列。
counts
是一个系列,索引是单词,值是计数:
son 2
she 2
I 2
...
says 1
garden 1
math 1
Name: text, dtype: int64
然后绘制:
fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)
如果你只是想要 N 个最常用的词,你可以像这样使用 nlargest
:
top_10 = counts.nlargest(10)
并以同样的方式绘制。