通过 pandas.groupby.agg 循环时如何忽略单词的其他实例?
How to ignore other instances of a word when looping it via pandas.groupby.agg?
我有一个代码(见下文)用于匹配每个位置出现的单词。我的问题是它会读取该词的所有实例。
例如:这是我希望它做的,但下面的代码计算了 'help' 的所有出现次数,包括 'helping' 和 'helped'
tidytext2 | Location | occurrences
she used to help me | Aus | 1
help is on the way | UK | 1
Helping is a kind gift | UK | 0
She helped me when I needed it | Japan | 0
Why dont u help me? | SA | 1
Help me! Im hungry help | Rwanda | 2
words = [i[0] for i in pos_freq.most_common()]
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(i)
funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)
我使用以下代码获得了 positive_freq.most_common()。它returns
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]
您需要为此使用正则表达式:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')
如果你想不区分大小写:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')
我有一个代码(见下文)用于匹配每个位置出现的单词。我的问题是它会读取该词的所有实例。
例如:这是我希望它做的,但下面的代码计算了 'help' 的所有出现次数,包括 'helping' 和 'helped'
tidytext2 | Location | occurrences
she used to help me | Aus | 1
help is on the way | UK | 1
Helping is a kind gift | UK | 0
She helped me when I needed it | Japan | 0
Why dont u help me? | SA | 1
Help me! Im hungry help | Rwanda | 2
words = [i[0] for i in pos_freq.most_common()]
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(i)
funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)
我使用以下代码获得了 positive_freq.most_common()。它returns
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]
您需要为此使用正则表达式:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')
如果你想不区分大小写:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')