使用 nltk 按日期标记化

Tokenization by date using nltk

我有以下数据集:

  Date                  D
_       
0   01/18/2020  shares recipes ... - news updates · breaking news emails · lives to remem...
1   01/18/2020  both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2   01/18/2020  honey, tea tree oil ...learn more from webmd about honey ...
3   01/18/2020  years of downtown arts | times leaderas the local community dealt with concerns, pet...
4   01/18/2020  brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. .00. smoked ...
5   01/19/2020  santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6   01/19/2020  abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7   01/19/2020  fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9   01/19/2020  100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..

我申请CountVectorizer如下:

stop_words = stopwords.words('english')

word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0] 
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)

获取二元语法的最高频率值。由于我有兴趣按日期获取此信息(即按 01/18/2020 和 01/19/2020 分组以获得每个日期的二元语法),我所做的还不够,因为

  pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)

创建一个没有关于 Date 信息的空数据框。我如何按日期对二元组进行分组?如果我对一克感兴趣,我会做类似的事情:

 remove_words = list(stopwords.words('english'))


 df.D = df.D.str.replace('\d+', '')
 df.D = df.D.apply(lambda x: list(word for word in x.split() if word not in remove_words)) 

 df.groupby('Date').agg({'D': 'value_counts'})

我不知道如何使用 nltkCountVectorizer 做类似的事情。我希望你能帮助我。

预期输出:

Date         Bi-gram       Frequency
              
0  2019-01-01  This is         1
1  2019-01-01  some sentence   1
....
n-m 2020-01-01 Stackoverlow is 1
....
n   2020-01-01 type now        1

考虑示例数据框

         Date                 Sentence
0  2019-01-01    This is some sentence
1  2019-01-01  Another random sentence
2  2020-01-01    Stackoverlow is great
3  2020-01-01   What should I type now
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
# fit on entire dataset to get count for a word across the dataset
vec.fit(df["Sentence"])

df.groupby("Date").apply(lambda x: vec.transform(x["Sentence"]).toarray())

这将为您提供给定日期每个句子中每个单词的计数。如评论中所述,您可以使用 get_feature_names()

映射给定位置的单词索引
In [34]: print(vec.get_feature_names())                                                                                          
['another', 'great', 'is', 'now', 'random', 'sentence', 'should', 'some', 'stackoverlow', 'this', 'type', 'what']

输出-

Date
2019-01-01    [[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]]
2020-01-01    [[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]]
dtype: object

考虑一下,[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]对应日期2019-01-01的第一句。这里 1 at index 2 表示该词在第一个句子中出现一次。