Python 计算 pandas 中子字符串的出现次数，按行附加不同的字符串作为列

Question

初始说明：我无法使用许多第三方软件包，而且很可能我无法使用您建议的任何软件包。尝试保留 Pandas、NumPy 或 Python 3.7 内置库的解决方案。我的最终目标是像图表一样的单词气泡，其中单词频率由 categoricalInt

编码

假设我有一个 pandas 数据框，例如：

index | categoricalInt1 | categoricalInt2 | sanitizedStrings 
0     |    -4           |    -5           |   some lowercase strings
1     |     2           |     4           |   addtnl lowercase strings here
2     |     3           |     3           |   words

有没有比将 sanitizedStrings 中的每个值迭代到 return 结构如

更简单的方法

index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0     |     -4          |    -5           |      ...         |  1   |    1      |   1     |   0    |  0  | 0
1     |      2          |     4           |      ...         |  0   |    1      |   1     |   1    |  1  | 0
2     |      3          |     3           |      ...         |  0   |    0      |   0     |   0    |  0  | 1

我的总体目标很简单：按分类分组计算所有子字符串。我已经设法将字符串聚合在一起并压缩到分类箱中，但我正在努力将计数放在一起。

到目前为止，我的代码如下所示：

df['Comments'] = df['Comments'].str.lower()

punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')

pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings

# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)

print(groupedComments)

在意识到我需要通过 categoricalInt 对这些字符串进行分类之前，我使用了以下函数： groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()

如果我能按行而不是跨列堆叠到 return，我敢打赌我们会非常接近。

Answer 1

这不是一个特别优雅的解决方案，我不确定您正在处理多少数据，但您可以使用应用函数来添加额外的列。

阅读您的评论后，您似乎也在考虑按分类列进行分组。

这也可以通过调整您创建的列来实现。

import pandas as pd
import numpy as np

##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])

#create apply function
def add_cols(row):
    col_dict = {}
    new_cols = row['sanitizedStrings'].split(' ')
    for col in new_cols:
        if col not in col_dict.keys():
            col_dict[col]=1
        else:
            col_dict[col]+=1
    for key,value in col_dict.items():
        #add _ so we can query these columns later
        row['_'+key] = value
    return row

#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df

   _addtnl  _here  _lowercase  _some  _string  _strings  _words  \
0      0.0    0.0         1.0    1.0      1.0       0.0     0.0   
1      1.0    1.0         1.0    0.0      0.0       1.0     0.0   
2      0.0    0.0         0.0    0.0      0.0       0.0     1.0   

   categoricalInt1               sanitizedStrings  
0               -4          some lowercase string  
1                3  addtnl lowercase strings here  
2                2                          words 

#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group


                 addtnl  here  lowercase  some  string  strings  words
categoricalInt1                                                       
-4                  0.0   0.0        1.0   1.0     1.0      0.0    0.0
 2                  0.0   0.0        0.0   0.0     0.0      0.0    1.0
 3                  1.0   1.0        1.0   0.0     0.0      1.0    0.0

Python 计算 pandas 中子字符串的出现次数，按行附加不同的字符串作为列

Python Count occurrences of a substring in pandas by row appending distinct string as column

python

grouping

dataframe

pandas

word-frequency