计算 pandas 列中的唯一单词
Counting unique words in a pandas column
我在处理以下数据时遇到了一些困难(来自 pandas 数据框):
Text
0 Selected moments from Fifa game t...
1 What I learned is that I am ...
3 Bill Gates kept telling us it was comi...
5 scenario created a month before the...
... ...
1899 Events for May 19 – October 7 - October CTOvision.com
1900 Office of Event Services and Campus Center Ope...
1901 How the CARES Act May Affect Gift Planning in ...
1902 City of Rohnert Park: Home
1903 iHeartMedia, Inc.
我需要提取每行的唯一字数(删除标点符号后)。所以,例如:
Unique
0 6
1 6
3 8
5 6
... ...
1899 8
1900 8
1901 9
1902 5
1903 2
我试过如下:
df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))
但我没有得到任何计数,只有一个单词列表(没有出现在该行中的频率)。
你能告诉我哪里出了问题吗?
如果不需要计算标点符号,请先删除所有标点符号。杠杆套。 str.split.map(set)
送你一套。之后计算集合中的元素。集不采用多个唯一元素。
连锁
df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()
逐步
df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()
所以,我只是根据评论更新它。此解决方案也考虑了标点符号。
df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)
试试这个
from collections import Counter
dict = {'A': {0:'John', 1:'Bob'},
'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates and Larry Ellison'}}
df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))
我在处理以下数据时遇到了一些困难(来自 pandas 数据框):
Text
0 Selected moments from Fifa game t...
1 What I learned is that I am ...
3 Bill Gates kept telling us it was comi...
5 scenario created a month before the...
... ...
1899 Events for May 19 – October 7 - October CTOvision.com
1900 Office of Event Services and Campus Center Ope...
1901 How the CARES Act May Affect Gift Planning in ...
1902 City of Rohnert Park: Home
1903 iHeartMedia, Inc.
我需要提取每行的唯一字数(删除标点符号后)。所以,例如:
Unique
0 6
1 6
3 8
5 6
... ...
1899 8
1900 8
1901 9
1902 5
1903 2
我试过如下:
df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))
但我没有得到任何计数,只有一个单词列表(没有出现在该行中的频率)。
你能告诉我哪里出了问题吗?
如果不需要计算标点符号,请先删除所有标点符号。杠杆套。 str.split.map(set)
送你一套。之后计算集合中的元素。集不采用多个唯一元素。
连锁
df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()
逐步
df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()
所以,我只是根据评论更新它。此解决方案也考虑了标点符号。
df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)
试试这个
from collections import Counter
dict = {'A': {0:'John', 1:'Bob'},
'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates and Larry Ellison'}}
df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))