如何清理字符串以按日期获取 value_counts 感兴趣的单词?
How to clean a string to get value_counts for words of interest by date?
我从 groupby('Datetime')
和 value_counts()
生成了以下数据
Datetime 0
01/01/2020 Paul 8
03 2
01/02/2020 Paul 2
10982360967 1
01/03/2020 religion 3
..
02/28/2020 l 18
02/29/2020 Paul 78
march 22
03/01/2020 church 63
l 21
我想删除一个特定的名称(在本例中我想删除 'Paul')和所有数字(在此特定示例中为 03、10982360967)。我不知道为什么会有一个字符 'l' 因为我试图删除包括字母(和数字)在内的停用词。
您知道我如何 进一步 清理此选择吗?
避免混淆的预期输出:
Datetime 0
01/03/2020 religion 3
..
02/29/2020 march 22
03/01/2020 church 63
我删除了 Paul、03、109... 和 l。
原始数据:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
因为我有100多个句子,所以我不能把所有的原始数据都放上去。
我使用的代码是:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
由于出现Key错误,我不得不按如下方式编辑代码:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
提取我用过的词str.extractall
清洁字符串是一个多步骤过程
创建数据框
import pandas as pd
from nltk.corpus import stopwords
import string
# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
'Corpus': ['Paul: Examples of religion',
'Paul:shinto is a religion 03',
"don't talk to me about religion, Paul 03",
'march is the third month of the year 10982360967',
'during march, there are some cold days.',
'she is at church right now']}
test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)
| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
干净Corpus
- 向
remove_words
列表中添加额外的单词
- 它们应该是小写的
- 可以合并一些清洁步骤,但我不建议这样做
- 循序渐进可以更轻松地确定您是否犯了错误
- 这是一个文字清洗的小例子。
- 关于这个主题的书籍很多。
- 没有上下文分析
example = 'We march to the church in March.'
value_count
对于 example.lower()
中的 'march'
是 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase
# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)
test.dropna(inplace=True) # drop any na rows
# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numbers
test.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation
test.Corpus = test.Corpus.str.replace('\s+', ' ') # remove occurrences of more than one whitespace
test.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of string
test.Corpus = test.Corpus.str.lower() # convert all to lowercase
test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words
| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
爆炸 Corpus
& groupby
# explode list
test = test.explode('Corpus')
# dropna incase there are empty rows from filtering
test.dropna(inplace=True)
# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})
word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1
我从 groupby('Datetime')
和 value_counts()
Datetime 0
01/01/2020 Paul 8
03 2
01/02/2020 Paul 2
10982360967 1
01/03/2020 religion 3
..
02/28/2020 l 18
02/29/2020 Paul 78
march 22
03/01/2020 church 63
l 21
我想删除一个特定的名称(在本例中我想删除 'Paul')和所有数字(在此特定示例中为 03、10982360967)。我不知道为什么会有一个字符 'l' 因为我试图删除包括字母(和数字)在内的停用词。 您知道我如何 进一步 清理此选择吗?
避免混淆的预期输出:
Datetime 0
01/03/2020 religion 3
..
02/29/2020 march 22
03/01/2020 church 63
我删除了 Paul、03、109... 和 l。
原始数据:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
因为我有100多个句子,所以我不能把所有的原始数据都放上去。
我使用的代码是:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
由于出现Key错误,我不得不按如下方式编辑代码:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
提取我用过的词str.extractall
清洁字符串是一个多步骤过程
创建数据框
import pandas as pd
from nltk.corpus import stopwords
import string
# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
'Corpus': ['Paul: Examples of religion',
'Paul:shinto is a religion 03',
"don't talk to me about religion, Paul 03",
'march is the third month of the year 10982360967',
'during march, there are some cold days.',
'she is at church right now']}
test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)
| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
干净Corpus
- 向
remove_words
列表中添加额外的单词- 它们应该是小写的
- 可以合并一些清洁步骤,但我不建议这样做
- 循序渐进可以更轻松地确定您是否犯了错误
- 这是一个文字清洗的小例子。
- 关于这个主题的书籍很多。
- 没有上下文分析
example = 'We march to the church in March.'
value_count
对于example.lower()
中的'march'
是 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase
# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)
test.dropna(inplace=True) # drop any na rows
# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numbers
test.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation
test.Corpus = test.Corpus.str.replace('\s+', ' ') # remove occurrences of more than one whitespace
test.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of string
test.Corpus = test.Corpus.str.lower() # convert all to lowercase
test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words
| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
爆炸 Corpus
& groupby
# explode list
test = test.explode('Corpus')
# dropna incase there are empty rows from filtering
test.dropna(inplace=True)
# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})
word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1