计算双字母组频率

Question

我有一个 3 列 CSV，如下所示：

Comment                   Comment Author       Location
As for the requirement    David                ON
The sky is blue           Martin               SK
As for the assignment     David                ON
As for the request        Eric                 QC 
As for the request        Eric                 QC

基于此 CSV，我创建了一个代码，允许我将评论列拆分为双字母组并计算它们出现的频率。但是，它不会根据 Comment Author 和 Location 列对其进行分组。

我当前的代码正在生成如下所示的输出 csv

Word         Frequency           Comment Author       Location
As for            4                 David                ON
the request       2                 Martin               SK
the assignment   1                  David                ON
the sky         1                   Eric                 QC
is blue        1                    Eric                 QC

我想要的输出 CSV 应该是这样的

Word         Frequency           Comment Author       Location
As for            2                 David                ON
As for            2                 Eric                 QC
the request       2                 Eric                 QC
the requirement  1                  David                ON
the sky         1                   Martin               SK
is blue        1                    Martin               SK

我试过使用 df.groupby 但它没有给我想要的输出。我在我的代码中导入了停用词，但为了上面的例子，我保留了停用词。我的代码如下所示：

import nltk
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd


from collections import Counter

from itertools import combinations

df = pd.read_csv('modified.csv', 'r', encoding="utf8", index_col=False, header=None, delimiter=",",
                 names=['comment','Comment Author', 'Location'])

top_N = 100000
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))

txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')

words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in stopwords]

bigrm = list(nltk.bigrams(words))



word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
rslt['Comment Author'] = df['Comment Author']
rslt['Location'] = df['Location']
print(rslt)
rslt.to_csv('bigram3.csv',index=False)

谢谢！

Answer 1

import pandas as pd
from flashtext import KeywordProcessor
import nltk
from collections import Counter

# creating dataframe :
df = pd.DataFrame([['As per the requirement','ON','David'],['The sky is blue','SK','Martin'],['As per the assignment','ON','David'],['As per the request','QC','Eric'],['As per the request','QC','Eric']],columns = ['comments', 'location','Author'])



#creating a bigram token
txt = df.comments.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigram = list(nltk.bigrams(words))
bigram_token = [' '.join(x) for x in bigram]

#now use flashtext for extracting bigram token from comments
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)

# groupby on author and location 
groupby_element =  list(df.groupby(['Author', 'location']))

data =[]
for i in range(len(groupby_element)):
    author = groupby_element[i][0][0]
    location = groupby_element[i][0][1]
    text = groupby_element[i][1]['comments'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
    data.append((author,location,text))

#groupby dataframe 
groupby_df = pd.DataFrame(data, columns = ['Author','location','text'])
groupby_df['bigram_token_count'] = groupby_df['comment'].apply(lambda x: Counter(kp.extract_keywords(x)))

 #o/p
 Author location                                          text                                 bigram_token_count
 0   David       ON  as per the requirement as per the assignment  {'as per': 2, 'the requirement': 1, 'the assig...
 1    Eric       QC         as per the request as per the request                    {'as per': 2, 'the request': 2}
 2  Martin       SK                               the sky is blue                       {'the sky': 1, 'is blue': 1}

你也可以使用 Countvectorizer

from sklearn.feature_extraction.text import  CountVectorizer
vect =  CountVectorizer(ngram_range = (2,2))
bigram_df = pd.DataFrame(vect.fit_transform(groupby_df['text']).todense(), columns = vect.get_feature_names())

final_df = pd.concat([groupby_df[['Author', 'location']],bigram_df], axis=1)

计算双字母组频率

Counting Bigram frequency

python

csv

nltk

python-3.x

export-to-csv