计算双字母组频率
Counting Bigram frequency
我有一个 3 列 CSV,如下所示:
Comment Comment Author Location
As for the requirement David ON
The sky is blue Martin SK
As for the assignment David ON
As for the request Eric QC
As for the request Eric QC
基于此 CSV,我创建了一个代码,允许我将评论列拆分为双字母组并计算它们出现的频率。但是,它不会根据 Comment Author 和 Location 列对其进行分组。
我当前的代码正在生成如下所示的输出 csv
Word Frequency Comment Author Location
As for 4 David ON
the request 2 Martin SK
the assignment 1 David ON
the sky 1 Eric QC
is blue 1 Eric QC
我想要的输出 CSV 应该是这样的
Word Frequency Comment Author Location
As for 2 David ON
As for 2 Eric QC
the request 2 Eric QC
the requirement 1 David ON
the sky 1 Martin SK
is blue 1 Martin SK
我试过使用 df.groupby 但它没有给我想要的输出。我在我的代码中导入了停用词,但为了上面的例子,我保留了停用词。我的代码如下所示:
import nltk
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd
from collections import Counter
from itertools import combinations
df = pd.read_csv('modified.csv', 'r', encoding="utf8", index_col=False, header=None, delimiter=",",
names=['comment','Comment Author', 'Location'])
top_N = 100000
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in stopwords]
bigrm = list(nltk.bigrams(words))
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
rslt['Comment Author'] = df['Comment Author']
rslt['Location'] = df['Location']
print(rslt)
rslt.to_csv('bigram3.csv',index=False)
谢谢!
import pandas as pd
from flashtext import KeywordProcessor
import nltk
from collections import Counter
# creating dataframe :
df = pd.DataFrame([['As per the requirement','ON','David'],['The sky is blue','SK','Martin'],['As per the assignment','ON','David'],['As per the request','QC','Eric'],['As per the request','QC','Eric']],columns = ['comments', 'location','Author'])
#creating a bigram token
txt = df.comments.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigram = list(nltk.bigrams(words))
bigram_token = [' '.join(x) for x in bigram]
#now use flashtext for extracting bigram token from comments
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)
# groupby on author and location
groupby_element = list(df.groupby(['Author', 'location']))
data =[]
for i in range(len(groupby_element)):
author = groupby_element[i][0][0]
location = groupby_element[i][0][1]
text = groupby_element[i][1]['comments'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
data.append((author,location,text))
#groupby dataframe
groupby_df = pd.DataFrame(data, columns = ['Author','location','text'])
groupby_df['bigram_token_count'] = groupby_df['comment'].apply(lambda x: Counter(kp.extract_keywords(x)))
#o/p
Author location text bigram_token_count
0 David ON as per the requirement as per the assignment {'as per': 2, 'the requirement': 1, 'the assig...
1 Eric QC as per the request as per the request {'as per': 2, 'the request': 2}
2 Martin SK the sky is blue {'the sky': 1, 'is blue': 1}
你也可以使用 Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range = (2,2))
bigram_df = pd.DataFrame(vect.fit_transform(groupby_df['text']).todense(), columns = vect.get_feature_names())
final_df = pd.concat([groupby_df[['Author', 'location']],bigram_df], axis=1)
我有一个 3 列 CSV,如下所示:
Comment Comment Author Location
As for the requirement David ON
The sky is blue Martin SK
As for the assignment David ON
As for the request Eric QC
As for the request Eric QC
基于此 CSV,我创建了一个代码,允许我将评论列拆分为双字母组并计算它们出现的频率。但是,它不会根据 Comment Author 和 Location 列对其进行分组。
我当前的代码正在生成如下所示的输出 csv
Word Frequency Comment Author Location
As for 4 David ON
the request 2 Martin SK
the assignment 1 David ON
the sky 1 Eric QC
is blue 1 Eric QC
我想要的输出 CSV 应该是这样的
Word Frequency Comment Author Location
As for 2 David ON
As for 2 Eric QC
the request 2 Eric QC
the requirement 1 David ON
the sky 1 Martin SK
is blue 1 Martin SK
我试过使用 df.groupby 但它没有给我想要的输出。我在我的代码中导入了停用词,但为了上面的例子,我保留了停用词。我的代码如下所示:
import nltk
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd
from collections import Counter
from itertools import combinations
df = pd.read_csv('modified.csv', 'r', encoding="utf8", index_col=False, header=None, delimiter=",",
names=['comment','Comment Author', 'Location'])
top_N = 100000
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in stopwords]
bigrm = list(nltk.bigrams(words))
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
rslt['Comment Author'] = df['Comment Author']
rslt['Location'] = df['Location']
print(rslt)
rslt.to_csv('bigram3.csv',index=False)
谢谢!
import pandas as pd
from flashtext import KeywordProcessor
import nltk
from collections import Counter
# creating dataframe :
df = pd.DataFrame([['As per the requirement','ON','David'],['The sky is blue','SK','Martin'],['As per the assignment','ON','David'],['As per the request','QC','Eric'],['As per the request','QC','Eric']],columns = ['comments', 'location','Author'])
#creating a bigram token
txt = df.comments.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigram = list(nltk.bigrams(words))
bigram_token = [' '.join(x) for x in bigram]
#now use flashtext for extracting bigram token from comments
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)
# groupby on author and location
groupby_element = list(df.groupby(['Author', 'location']))
data =[]
for i in range(len(groupby_element)):
author = groupby_element[i][0][0]
location = groupby_element[i][0][1]
text = groupby_element[i][1]['comments'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
data.append((author,location,text))
#groupby dataframe
groupby_df = pd.DataFrame(data, columns = ['Author','location','text'])
groupby_df['bigram_token_count'] = groupby_df['comment'].apply(lambda x: Counter(kp.extract_keywords(x)))
#o/p
Author location text bigram_token_count
0 David ON as per the requirement as per the assignment {'as per': 2, 'the requirement': 1, 'the assig...
1 Eric QC as per the request as per the request {'as per': 2, 'the request': 2}
2 Martin SK the sky is blue {'the sky': 1, 'is blue': 1}
你也可以使用 Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range = (2,2))
bigram_df = pd.DataFrame(vect.fit_transform(groupby_df['text']).todense(), columns = vect.get_feature_names())
final_df = pd.concat([groupby_df[['Author', 'location']],bigram_df], axis=1)