Groupby 和 A）连接匹配的字符串（和或子字符串）B）对值求和

Question

我有 df:

row_numbers    ID     code        amount
   1           med    a           1
   2           med    a, b        1
   3           med    b, c        1
   4           med    c           1
   5           med    d           10
   6           cad    a, b        1 
   7           cad    a, b, d     0
   8           cad    e           2

粘贴上面的df:

我想对 column-ID 和 A)Combin the strings if substring/string matches(on column-code) B)对列的值求和-数量.

预期结果：

解释：

column-row_numbers在df这里没有作用。我只是在这里解释输出。

A) 在列-ID 上分组并查看列-code，第 1 行字符串，即 a 与第 2 行的子字符串匹配. row2 的子字符串，即 b 与 row3 的子字符串匹配。 row3 的子字符串，即 c 与 row4 的字符串匹配，因此组合了 row1、row2、row3 和 row4。 row5 字符串与 string/substring 中的任何一个都不匹配，因此它是单独的组。 B) 在此基础上添加 row1、row2、row3 和 row4 值。和第 5 行作为单独的组。

提前感谢您的时间和想法:)。

编辑 - 1

实时粘贴。

预期输出：

解释：

必须对 column-id 进行分组并连接 column-code 的值并对 column-[= 的值求和51=]单位和体积。它用颜色编码了列-code 的匹配（待联系）值。第 1 行有 link，第 5 行和第 9 行。第 9 行与第 3 行依次 link。因此组合 row1、row5、row9、row3。 Simliarly row2 和 row7 等等。 row8 没有 link 具有 group-med(column-id) 中的任何值，因此将作为单独的行。

谢谢！

Answer 1

我认为执行您想要的内容的三个主要想法是：

创建一个累加器数据结构（在本例中为 DataFrame）
迭代一对行，在每次迭代中你有 (currentRow, nextRow)
下一行中当前行的模式匹配和累积行中的模式匹配

尚不完全清楚您要查找的模式完全匹配，因此我假设如果 currentRow 代码的任何字母在下一个字母上，则将它们连接起来。

使用 data.csv（带空格分隔符）作为示例：

row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2

import pandas as pd
from itertools import zip_longest

def generate_pairs(group):    
    ''' generate pairs (currentRow, nextRow) '''
    group_curriterrows = group.iterrows()
    group_nextiterrows = group.iterrows()
    group_nextiterrows.__next__()
    zip_list = zip_longest(group_curriterrows, group_nextiterrows) 
    return zip_list

def generate_lists_to_check(currRow, nextRow, accumulated_rows):
    ''' generate list if any next letters are in current ones and
    another list if any next letters are in the accumulated codes '''
    currLetters = str(currRow["code"]).split(",")
    nextLetters = str(nextRow["code"]).split(",")
    letter_inNext = [letter in nextLetters for letter in currLetters]
    unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
    letter_inHistory = [any(letter in unq for letter in nextLetters) 
                                           for unq in unique_acc_codes]
    return letter_inNext, letter_inHistory

def create_newRow(accumulated_rows, nextRow):
  nextRow["row_numbers"] = str(nextRow["row_numbers"])
  accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
  return accumulated_rows

def update_existingRow(accumulated_rows, match_idx, Row):
  accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
  accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
  accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
  accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
  return accumulated_rows

if __name__ == "__main__":
    df = pd.read_csv("extended.tsv",sep=" ")
    groups = pd.DataFrame(columns=df.columns) 
    for ID, group in df.groupby(["ID"], sort=False):
        accumulated_rows = pd.DataFrame(columns=df.columns)
        group_firstRow = group.iloc[0]
        accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values 
        row_numbers = str(group_firstRow.values[0])
        accumulated_rows.set_value(0,'row_numbers',row_numbers) 
        zip_list = generate_pairs(group)
        for (currRow_idx, currRow), Next  in zip_list:
            if not (Next is None):
                (nextRow_idx, nextRow) = Next
                letter_inNext, letter_inHistory = \
                        generate_lists_to_check(currRow, nextRow, accumulated_rows) 

                if any(letter_inNext) :
                  accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow) 
                elif any(letter_inHistory):
                  matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
                  first_match_idx = matches[0]
                  accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
                  for match_idx in matches[1:]:
                      accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
                      accumulated_rows = accumulated_rows.drop(match_idx) 

                elif not any(letter_inNext):
                  accumulated_rows = create_newRow(accumulated_rows, nextRow)
        groups = groups.append(accumulated_rows)
    groups.reset_index(inplace=True,drop=True)
    print(groups)

OUTPUT 正常行顺序使用当前代码中的列量删除行，因为第一个示例没有列量：

  row_numbers   ID         code amount
0           1  med  a,a,b,b,c,c      4
1           5  med            d     10
2           6  cad    a,b,a,b,d      1
3           8  cad            e      2

输出新示例：

  row_numbers   ID                                   code amount volume
0     1,5,9,3  med  CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18      4      4
1         2,7  med                     CO-16,CO-B20,CO-16      3      3
2         4,6  med                    CO-252,CO-252,CO-45      3      3
3           8  med                                 OA-258      1      1
4       10,13  cad                     PR-96,PR-96,CO-243      4      4
5       11,12  cad                     PR-87,OA-258,PR-87      3      3

Answer 2

更新：从你最新的样本数据来看，这不是一个简单的数据修改。没有向量化的解决方案。它与每组 ID 中的 graph theory. You need to find connected components 相关，并对每个连接的组件进行计算。

将每个字符串视为一个 node 图形。如果 2 个字符串重叠，则它们是连接的节点。对于每一个node，你需要遍历与其相连的所有路径。通过这些路径对所有连接的节点进行计算。这种遍历可以通过使用Depth-first search逻辑来完成。

但是，在处理深度优先搜索之前，您需要将字符串预处理为 set 以检查重叠。

方法一：递归

执行以下操作：

定义一个函数dfs递归运行深度优先搜索
定义一个函数 gfunc 以与 groupby apply 一起使用。此函数将遍历每组 ID 和 return 所需数据帧的元素。
去除每个字符串中的任何空格并拆分和转换它们使用 replace、split 和 map 设置并将其分配给新列 new_code 到 df
使用函数 gfunc 在 ID 和 apply 上调用 groupby。调用 droplevel 和 reset_index 以获得所需的输出

代码如下：

import numpy as np

def dfs(node, index, glist, g_checked_rows):        
    ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
    g_checked_rows.add(index)   
    for j, s in glist:
        if j not in g_checked_rows and not node.isdisjoint(s):      
            t_arr = dfs(s, j, glist, g_checked_rows)
            ret_arr[0]  += ', ' + t_arr[0]
            ret_arr[1:] += t_arr[1:]

    return ret_arr

def gfunc(x):   
    checked_rows = set()    
    final = []  
    code_list = list(x.new_code.items())
    for i, row in code_list:
        if i not in checked_rows:       
            final.append(dfs(row, i, code_list, checked_rows))

    return pd.DataFrame(final, columns=['code','units','vol'])

df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()

Out[16]:
    ID                                        code  units  vol
0  med  CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18      4    4
1  med                        CO-16, CO-B20, CO-16      3    3
2  med                       CO-252, CO-252, CO-45      3    3
3  med                                      OA-258      1    1
4  cad                        PR-96, PR-96, CO-243      4    4
5  cad                        PR-87, OA-258, PR-87      3    3

注意：我假设您的 pandas 版本是 0.24+。如果<0.24，最后一步需要用reset_index和drop代替droplevel和reset_index如下

df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)

方法二：迭代

为了使这个完整，我使用迭代过程而不是递归实现了 gfunc 的一个版本。迭代过程只需要一个函数。不过功能比较复杂。迭代过程逻辑如下

将第一个节点推送到 deque。检查deque是否为空，弹出最上面的节点。
如果节点没有被标记为checked，处理它并标记为checked
以节点列表的相反顺序找到它的所有邻居没有标记的，推到deque
判断deque是否为空，从deque顶部弹出一个节点从第 2 步开始的过程

代码如下：

def gfunc_iter(x):  
    checked_rows = set()    
    final = []  
    q = deque() 
    code_list = list(x.new_code.items())
    code_list_rev = code_list[::-1]
    for i, row in code_list:
        if i not in checked_rows:           
            q.append((i, row))
            ret_arr = np.array(['', 0, 0], dtype='O')
            while (q):
                n, node = q.pop()
                if n in  checked_rows:
                    continue                
                ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
                if not ret_arr[0]:
                    ret_arr = ret_arr_child.copy()
                else:
                    ret_arr[0]  += ', ' + ret_arr_child[0]
                    ret_arr[1:] += ret_arr_child[1:]                    
                checked_rows.add(n)         
                #push to `q` all neighbors in the reversed list of nodes    
                for j, s in code_list_rev: 
                    if j not in checked_rows and not node.isdisjoint(s):
                        q.append((j, s))
            final.append(ret_arr)

    return pd.DataFrame(final, columns=['code','units','vol'])

df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()

Groupby 和 A）连接匹配的字符串（和或子字符串）B）对值求和

Groupby and A)Concate matching strings(and or substring) B)Sum the values

string-concatenation

pattern-matching

pandas

pandas-groupby