Python 计算 pandas 中子字符串的出现次数,按行附加不同的字符串作为列
Python Count occurrences of a substring in pandas by row appending distinct string as column
初始说明:我无法使用许多第三方软件包,而且很可能我无法使用您建议的任何软件包。尝试保留 Pandas、NumPy 或 Python 3.7 内置库的解决方案。我的最终目标是像图表一样的单词气泡,其中单词频率由 categoricalInt
编码
假设我有一个 pandas 数据框,例如:
index | categoricalInt1 | categoricalInt2 | sanitizedStrings
0 | -4 | -5 | some lowercase strings
1 | 2 | 4 | addtnl lowercase strings here
2 | 3 | 3 | words
有没有比将 sanitizedStrings
中的每个值迭代到 return 结构如
更简单的方法
index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0 | -4 | -5 | ... | 1 | 1 | 1 | 0 | 0 | 0
1 | 2 | 4 | ... | 0 | 1 | 1 | 1 | 1 | 0
2 | 3 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1
我的总体目标很简单:按分类分组计算所有子字符串。我已经设法将字符串聚合在一起并压缩到分类箱中,但我正在努力将计数放在一起。
到目前为止,我的代码如下所示:
df['Comments'] = df['Comments'].str.lower()
punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')
pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings
# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)
print(groupedComments)
在意识到我需要通过 categoricalInt
对这些字符串进行分类之前,我使用了以下函数:
groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()
如果我能按行而不是跨列堆叠到 return,我敢打赌我们会非常接近。
这不是一个特别优雅的解决方案,我不确定您正在处理多少数据,但您可以使用应用函数来添加额外的列。
阅读您的评论后,您似乎也在考虑按分类列进行分组。
这也可以通过调整您创建的列来实现。
import pandas as pd
import numpy as np
##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])
#create apply function
def add_cols(row):
col_dict = {}
new_cols = row['sanitizedStrings'].split(' ')
for col in new_cols:
if col not in col_dict.keys():
col_dict[col]=1
else:
col_dict[col]+=1
for key,value in col_dict.items():
#add _ so we can query these columns later
row['_'+key] = value
return row
#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df
_addtnl _here _lowercase _some _string _strings _words \
0 0.0 0.0 1.0 1.0 1.0 0.0 0.0
1 1.0 1.0 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
categoricalInt1 sanitizedStrings
0 -4 some lowercase string
1 3 addtnl lowercase strings here
2 2 words
#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group
addtnl here lowercase some string strings words
categoricalInt1
-4 0.0 0.0 1.0 1.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 1.0 1.0 1.0 0.0 0.0 1.0 0.0
初始说明:我无法使用许多第三方软件包,而且很可能我无法使用您建议的任何软件包。尝试保留 Pandas、NumPy 或 Python 3.7 内置库的解决方案。我的最终目标是像图表一样的单词气泡,其中单词频率由 categoricalInt
假设我有一个 pandas 数据框,例如:
index | categoricalInt1 | categoricalInt2 | sanitizedStrings
0 | -4 | -5 | some lowercase strings
1 | 2 | 4 | addtnl lowercase strings here
2 | 3 | 3 | words
有没有比将 sanitizedStrings
中的每个值迭代到 return 结构如
index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0 | -4 | -5 | ... | 1 | 1 | 1 | 0 | 0 | 0
1 | 2 | 4 | ... | 0 | 1 | 1 | 1 | 1 | 0
2 | 3 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1
我的总体目标很简单:按分类分组计算所有子字符串。我已经设法将字符串聚合在一起并压缩到分类箱中,但我正在努力将计数放在一起。
到目前为止,我的代码如下所示:
df['Comments'] = df['Comments'].str.lower()
punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')
pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings
# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)
print(groupedComments)
在意识到我需要通过 categoricalInt
对这些字符串进行分类之前,我使用了以下函数:
groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()
如果我能按行而不是跨列堆叠到 return,我敢打赌我们会非常接近。
这不是一个特别优雅的解决方案,我不确定您正在处理多少数据,但您可以使用应用函数来添加额外的列。
阅读您的评论后,您似乎也在考虑按分类列进行分组。
这也可以通过调整您创建的列来实现。
import pandas as pd
import numpy as np
##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])
#create apply function
def add_cols(row):
col_dict = {}
new_cols = row['sanitizedStrings'].split(' ')
for col in new_cols:
if col not in col_dict.keys():
col_dict[col]=1
else:
col_dict[col]+=1
for key,value in col_dict.items():
#add _ so we can query these columns later
row['_'+key] = value
return row
#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df
_addtnl _here _lowercase _some _string _strings _words \
0 0.0 0.0 1.0 1.0 1.0 0.0 0.0
1 1.0 1.0 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
categoricalInt1 sanitizedStrings
0 -4 some lowercase string
1 3 addtnl lowercase strings here
2 2 words
#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group
addtnl here lowercase some string strings words
categoricalInt1
-4 0.0 0.0 1.0 1.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 1.0 1.0 1.0 0.0 0.0 1.0 0.0