如何找出是否有停用词并计算是否存在
How to find out if there is stopwords and count if exist
我有一个 csv 文件,其中包含行中的句子列表,我想找出每行中是否有停用词,return 1 如果存在,否则 return 0。如果return 1、我想统计停用词。到目前为止,下面是我的代码,我只能获取 csv 中存在的所有停用词,但不能获取每一行。
import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize,word_tokenize
from collections import Counter
from nltk.corpus import stopwords
nltk.download('stopwords')
top_N = 10
news=pd.read_csv("split.csv",usecols=['STORY'])
newss = news.STORY.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(newss)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
这是截断后的 csv 文件
id STORY
0 In the bag
1 What is your name
2 chips, bag
我想将输出保存到一个新的 csv 文件中,预期的输出应该如下所示
id STORY exist How many
0 In the bag 1 2
1 What is your name 1 4
2 chips bag 0 0
df = pd.DataFrame({"story":['In the bag', 'what is your name', 'chips, bag']})
stopwords = nltk.corpus.stopwords.words('english')
df['clean'] = df['story'].apply(lambda x : nltk.tokenize.word_tokenize(x.lower().replace(r',', ' ')))
df
story clean
0 In the bag [in, the, bag]
1 what is your name [what, is, your, name]
2 chips, bag [chips, bag]
df['clean'] = df.clean.apply(lambda x : [y for y in x if y in stopwords])
df['exist'] = df.clean.apply(lambda x : 1 if len(x) > 0 else 0)
df['how many'] = df.clean.apply(lambda x : len(x))
df
story clean exist how many
0 In the bag [in, the] 1 2
1 what is your name [what, is, your] 1 3
2 chips, bag [] 0 0
注意:您可以根据需要更改正则表达式。如果以后需要,您可以删除 clean
列或保留它。
我有一个 csv 文件,其中包含行中的句子列表,我想找出每行中是否有停用词,return 1 如果存在,否则 return 0。如果return 1、我想统计停用词。到目前为止,下面是我的代码,我只能获取 csv 中存在的所有停用词,但不能获取每一行。
import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize,word_tokenize
from collections import Counter
from nltk.corpus import stopwords
nltk.download('stopwords')
top_N = 10
news=pd.read_csv("split.csv",usecols=['STORY'])
newss = news.STORY.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(newss)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
这是截断后的 csv 文件
id STORY
0 In the bag
1 What is your name
2 chips, bag
我想将输出保存到一个新的 csv 文件中,预期的输出应该如下所示
id STORY exist How many
0 In the bag 1 2
1 What is your name 1 4
2 chips bag 0 0
df = pd.DataFrame({"story":['In the bag', 'what is your name', 'chips, bag']})
stopwords = nltk.corpus.stopwords.words('english')
df['clean'] = df['story'].apply(lambda x : nltk.tokenize.word_tokenize(x.lower().replace(r',', ' ')))
df
story clean
0 In the bag [in, the, bag]
1 what is your name [what, is, your, name]
2 chips, bag [chips, bag]
df['clean'] = df.clean.apply(lambda x : [y for y in x if y in stopwords])
df['exist'] = df.clean.apply(lambda x : 1 if len(x) > 0 else 0)
df['how many'] = df.clean.apply(lambda x : len(x))
df
story clean exist how many
0 In the bag [in, the] 1 2
1 what is your name [what, is, your] 1 3
2 chips, bag [] 0 0
注意:您可以根据需要更改正则表达式。如果以后需要,您可以删除 clean
列或保留它。