使用 nltk 按日期标记化
Tokenization by date using nltk
我有以下数据集:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. .00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
我申请CountVectorizer
如下:
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
获取二元语法的最高频率值。由于我有兴趣按日期获取此信息(即按 01/18/2020 和 01/19/2020 分组以获得每个日期的二元语法),我所做的还不够,因为
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
创建一个没有关于 Date
信息的空数据框。我如何按日期对二元组进行分组?如果我对一克感兴趣,我会做类似的事情:
remove_words = list(stopwords.words('english'))
df.D = df.D.str.replace('\d+', '')
df.D = df.D.apply(lambda x: list(word for word in x.split() if word not in remove_words))
df.groupby('Date').agg({'D': 'value_counts'})
我不知道如何使用 nltk
和 CountVectorizer
做类似的事情。我希望你能帮助我。
预期输出:
Date Bi-gram Frequency
0 2019-01-01 This is 1
1 2019-01-01 some sentence 1
....
n-m 2020-01-01 Stackoverlow is 1
....
n 2020-01-01 type now 1
考虑示例数据框
Date Sentence
0 2019-01-01 This is some sentence
1 2019-01-01 Another random sentence
2 2020-01-01 Stackoverlow is great
3 2020-01-01 What should I type now
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
# fit on entire dataset to get count for a word across the dataset
vec.fit(df["Sentence"])
df.groupby("Date").apply(lambda x: vec.transform(x["Sentence"]).toarray())
这将为您提供给定日期每个句子中每个单词的计数。如评论中所述,您可以使用 get_feature_names()
映射给定位置的单词索引
In [34]: print(vec.get_feature_names())
['another', 'great', 'is', 'now', 'random', 'sentence', 'should', 'some', 'stackoverlow', 'this', 'type', 'what']
输出-
Date
2019-01-01 [[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]]
2020-01-01 [[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]]
dtype: object
考虑一下,[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]
对应日期2019-01-01
的第一句。这里 1
at index 2 表示该词在第一个句子中出现一次。
我有以下数据集:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. .00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
我申请CountVectorizer
如下:
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
获取二元语法的最高频率值。由于我有兴趣按日期获取此信息(即按 01/18/2020 和 01/19/2020 分组以获得每个日期的二元语法),我所做的还不够,因为
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
创建一个没有关于 Date
信息的空数据框。我如何按日期对二元组进行分组?如果我对一克感兴趣,我会做类似的事情:
remove_words = list(stopwords.words('english'))
df.D = df.D.str.replace('\d+', '')
df.D = df.D.apply(lambda x: list(word for word in x.split() if word not in remove_words))
df.groupby('Date').agg({'D': 'value_counts'})
我不知道如何使用 nltk
和 CountVectorizer
做类似的事情。我希望你能帮助我。
预期输出:
Date Bi-gram Frequency
0 2019-01-01 This is 1
1 2019-01-01 some sentence 1
....
n-m 2020-01-01 Stackoverlow is 1
....
n 2020-01-01 type now 1
考虑示例数据框
Date Sentence
0 2019-01-01 This is some sentence
1 2019-01-01 Another random sentence
2 2020-01-01 Stackoverlow is great
3 2020-01-01 What should I type now
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
# fit on entire dataset to get count for a word across the dataset
vec.fit(df["Sentence"])
df.groupby("Date").apply(lambda x: vec.transform(x["Sentence"]).toarray())
这将为您提供给定日期每个句子中每个单词的计数。如评论中所述,您可以使用 get_feature_names()
In [34]: print(vec.get_feature_names())
['another', 'great', 'is', 'now', 'random', 'sentence', 'should', 'some', 'stackoverlow', 'this', 'type', 'what']
输出-
Date
2019-01-01 [[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]]
2020-01-01 [[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]]
dtype: object
考虑一下,[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]
对应日期2019-01-01
的第一句。这里 1
at index 2 表示该词在第一个句子中出现一次。