如何清理字符串以按日期获取 value_counts 感兴趣的单词?

How to clean a string to get value_counts for words of interest by date?

我从 groupby('Datetime')value_counts()

生成了以下数据
Datetime        0          
01/01/2020  Paul            8
            03              2
01/02/2020  Paul            2
            10982360967     1
01/03/2020  religion        3
                           ..
02/28/2020  l              18
02/29/2020  Paul           78
            march          22
03/01/2020  church         63
            l              21

我想删除一个特定的名称(在本例中我想删除 'Paul')和所有数字(在此特定示例中为 03、10982360967)。我不知道为什么会有一个字符 'l' 因为我试图删除包括字母(和数字)在内的停用词。 您知道我如何 进一步 清理此选择吗?

避免混淆的预期输出:

Datetime        0          
01/03/2020  religion        3
                           ..
02/29/2020  march          22
03/01/2020  church         63

我删除了 Paul、03、109... 和 l。

原始数据:

Datetime        Corpus          
01/03/2020      Paul: examples of religion
01/03/2020      Paul:shinto is a religion 03
01/03/2020      don't talk to me about religion, Paul 03
...
02/29/2020     march is the third month of the year 10982360967
02/29/2020     during march, there are some cold days.
...
03/01/2020     she is at church right now
...

因为我有100多个句子,所以我不能把所有的原始数据都放上去。

我使用的代码是:

df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

由于出现Key错误,我不得不按如下方式编辑代码:

df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

提取我用过的词str.extractall

清洁字符串是一个多步骤过程

创建数据框

import pandas as pd
from nltk.corpus import stopwords
import string

# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
        'Corpus': ['Paul: Examples of religion',
                   'Paul:shinto is a religion 03',
                   "don't talk to me about religion, Paul 03",
                   'march is the third month of the year 10982360967',
                   'during march, there are some cold days.',
                   'she is at church right now']}

test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)

|    | Datetime            | Corpus                                           |
|---:|:--------------------|:-------------------------------------------------|
|  0 | 2020-01-03 00:00:00 | Paul: Examples of religion                       |
|  1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03                     |
|  2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03         |
|  3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
|  4 | 2020-02-29 00:00:00 | during march, there are some cold days.          |
|  5 | 2020-03-01 00:00:00 | she is at church right now                       |

干净Corpus

  • remove_words 列表中添加额外的单词
    • 它们应该是小写的
  • 可以合并一些清洁步骤,但我不建议这样做
    • 循序渐进可以更轻松地确定您是否犯了错误
  • 这是一个文字清洗的小例子。
    • 关于这个主题的书籍很多。
    • 没有上下文分析
      • example = 'We march to the church in March.'
      • value_count 对于 example.lower() 中的 'march' 是 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words)  # add other words to exclude in lowercase

# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)

test.dropna(inplace=True)  # drop any na rows

# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '')  # remove numbers

test.Corpus = test.Corpus.str.replace(punc, ' ')  # remove punctuation 

test.Corpus = test.Corpus.str.replace('\s+', ' ')  # remove occurrences of more than one whitespace

test.Corpus = test.Corpus.str.strip()  # remove whitespace from beginning and end of string

test.Corpus = test.Corpus.str.lower()  # convert all to lowercase

test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words))  # remove words

|    | Datetime            | Corpus       |
|---:|:--------------------|:-------------|
|  0 | 2020-01-03 00:00:00 | ['religion'] |
|  1 | 2020-01-03 00:00:00 | ['religion'] |
|  2 | 2020-01-03 00:00:00 | ['religion'] |
|  3 | 2020-02-29 00:00:00 | ['march']    |
|  4 | 2020-02-29 00:00:00 | ['march']    |
|  5 | 2020-03-01 00:00:00 | ['church']   |

爆炸 Corpus & groupby

# explode list
test = test.explode('Corpus')

# dropna incase there are empty rows from filtering
test.dropna(inplace=True)

# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})

                     word_count
Datetime   Corpus              
2020-01-03 religion           3
2020-02-29 march              2
2020-03-01 church             1