Pandas - 使用附加条件连接来自多个组的选定行

Pandas - concatenating selected rows from several groups with additional condition

我有一个例子

+─────────────+────────+────────+────────+
| main_group  | COL_A  | COL_B  | COL_C  |
+─────────────+────────+────────+────────+
| 0           | TXT1   |        | None   |
| 0           | TXT2   |        | None   |
| 0           | 5      |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
| 1           | TXT11  |        | None   |
| 1           | TXT12  |        | None   |
| 1           | 0.50   |        | None   |
| 1           | 0.45   | 0.45   | 0      |
| 1           | 0.31   | 0.31   | 1      |
| 1           | 0.35   | 0.35   | 1      |
| 1           | 0.73   | 0.73   | 1      |
| 2           | 0.5    |        | None   |
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   |        | None   |
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
+─────────────+────────+────────+────────+

我想将 COL_A 中的值从 COL_C 列中最后一次出现的带有 None 的行移动到每个 main_group。 此值应移至 COL_B 列组中的第一个元素,然后应删除之前的内容。

下面是 main_group == 0 的样子:

+─────────────+────────+────────+────────+
| main_group  | COL_A  | COL_B  | COL_C  |
+─────────────+────────+────────+────────+
| 0           | TXT1   | 5      | None   | <--- value "5" from the last row with "None" in `COL_C` in `main_group` == 0 was moved to the first row in the same group
| 0           | TXT2   |        | None   |
| 0           | 5      |        | None   | <--- After that this row should be deleted
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
+─────────────+────────+────────+────────+
+─────────────+────────+────────+────────+
| main_group  | COL_A  | COL_B  | COL_C  |
+─────────────+────────+────────+────────+
| 2           | 0.5    | 0.5    | None   | <---  value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   | 4.46   | None   | <---  value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
+─────────────+────────+────────+────────+

此操作后,df 应如下所示:

+─────────────+────────+────────+────────+
| main_group  | COL_A  | COL_B  | COL_C  |
+─────────────+────────+────────+────────+
| 0           | TXT1   | 5      | None   |
| 0           | TXT2   |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
| 1           | TXT11  | 0.50   | None   |
| 1           | TXT12  |        | None   |
| 1           | 0.45   | 0.45   | 0      |
| 1           | 0.31   | 0.31   | 1      |
| 1           | 0.35   | 0.35   | 1      |
| 1           | 0.73   | 0.73   | 1      |
| 2           | 0.5    | 0.5    | None   |
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   | 4.46   | None   |
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
+─────────────+────────+────────+────────+

在最后一步中,我想在每个 main_group 中连接选定的 COL_A 列,其中 COL_C 中的值为 None

示例:

+─────────────+────────+────────+────────+
| main_group  | COL_A  | COL_B  | COL_C  |
+─────────────+────────+────────+────────+
| 0           | TXT1   | 5      | None   |
| 0           | TXT2   |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |

↓↓↓↓↓↓↓↓↓↓↓

+─────────────+────────────+────────+────────+
| main_group  | COL_A      | COL_B  | COL_C  |
+─────────────+────────────+────────+────────+
| 0           | TXT1 TXT2  | 5      | None   | <--- If there are more than 1 row with "None" in column `COL_C` in each group, then values in column `COL_A` should be "merged" into one row, and all others should be deleted
| 0           | 1.93       | 1.93   | 0      |
| 0           | 7.60       | 7.60   | 1      |
| 0           | 2.46       | 2.46   | 1      |

我的第一个解决方案是 .loc 组中 COL_C 列中具有值“None”的那些行,然后分配给第一个元素(.iloc) 最后一行的值。 但是,这个解决方案不太正确,而且我还确信可以使用 .groupby 而不是在每个组之后迭代和搜索元素来完成,但我不能正确地做到这一点。

我通过这种方式得到的解决方案:

+─────────────+────────────+────────+────────+
| main_group  | COL_A      | COL_B  | COL_C  |
+─────────────+────────────+────────+────────+
| 0           | TXT1 TXT2  | 5      | None   |
| NaN         | NaN        | NaN    | NaN    |
| NaN         | NaN        | NaN    | NaN    |
| 0           | 1.93       | 1.93   | 0      |
| 0           | 7.60       | 7.60   | 1      |
| 0           | 2.46       | 2.46   | 1      |

部分传输正确,但行中仍有 NaN 值不应再存在。 当然我可以删除那些行并重新索引 df 但是这个解决方案依赖于循环,这对于大 df.

来说肯定效率不高

如何使用 .loc.iloc 对各个组进行这些循环并交换值的解决方法?

IIUC,尝试以下(评论中的解释):

#create indicator column for where COL_C is None
df["indicator"] = df["COL_C"].isnull()

#get the index of the last None value for each main_group
max_null = df.groupby("main_group")["indicator"].transform(lambda x: x.cumsum().idxmax())

#move the COL_A to COL_B for the first index of each group
df["COL_B"] = df["COL_B"].where(df.groupby("main_group").cumcount().ne(0), max_null.map(df["COL_A"]))

# #remove the last rows with None value for each main_group
df = df.drop(max_null.unique()).reset_index(drop=True)

# #concatenate COL_A per main_group
strings = df.groupby("main_group").apply(lambda x: x[x["indicator"]]["COL_A"].str.cat(sep=","))

#assign concatenated strings to COL_A
df["COL_A"] = df["COL_A"].where(~df["indicator"], df["main_group"].map(strings))

#drop duplicates from COL_A per group and drop the indicator column
df = df.drop_duplicates(["main_group","COL_A"]).drop("indicator", axis=1).reset_index(drop=True)

>>> df
    main_group       COL_A COL_B  COL_C
0            0   TXT1,TXT2     5    NaN
1            0        1.93  1.93    0.0
2            0         7.6   7.6    1.0
3            0        2.46  2.46    1.0
4            1  TXT1,TXT12   0.5    NaN
5            1        0.45  0.45    0.0
6            1        0.31  0.31    1.0
7            1        0.35  0.35    1.0
8            1        0.73  0.73    1.0
9            2        4.15  4.15    0.0
10           2        2.98  2.98    0.0
11           2        1.53  1.53    0.0
12           3           4   4.0    0.0
13           3        0.95  0.95    1.0
14           3        1.35  1.35    1.0
15           3        1.79  1.79    1.0

这是获得您要求的最终结果的方法:

print('\nInput df:'); print(df)

df = df.assign(range_index=df.index)
gb = df[df['COL_C'].isna()].groupby(['main_group'])
df2 = pd.concat([
    gb.nth(0)['range_index'], 
    gb.last()['COL_A'].copy().rename('COL_B_update'), 
    gb['COL_A'].apply(list).str.slice(stop=-1).str.join(' ')
    ], axis=1).set_index('range_index')
emptyColA = df2['COL_A'].str.len() == 0
df2.loc[emptyColA, 'COL_A'] = df2.loc[emptyColA, 'COL_B_update']

print('\ndf2:'); print(df2)

df = df.join(df2, on='range_index', rsuffix='_list')

print('\ndf just after join():'); print(df)

df.loc[~df.COL_A_list.isna(), 'COL_A_update'] = df.COL_A_list
df.loc[~df['COL_C'].isna(), 'COL_A_update'] = df.COL_A
df = df.loc[~df.COL_A_update.isna()].drop(columns=['range_index', 'COL_A_list'])

print('\ndf after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:'); print(df)

df.loc[df['COL_C'].isna(), 'COL_B'] = df.loc[df['COL_C'].isna(), 'COL_B_update']
df.loc[df['COL_C'].isna(), 'COL_A'] = df.loc[df['COL_C'].isna(), 'COL_A_update']
df = df.drop(columns=['COL_B_update', 'COL_A_update']).rename(columns={'COL_A_update':'COL_A'}).reset_index(drop=True)

print('\nOutput df after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:'); print(df)

解释:

  • 在新列中复制索引range_index
  • main_group 上为 COL_C 行 None
  • 创建一个 groupby 对象
  • groupby 对象:
    • 使用nth(0)获取COL_C
    • 中每组连续None值第一行的range_index
    • 使用 last()Series.rename() 创建列 COL_B_update,并从 COL_A
    • 复制所需的值
    • Series.str 上使用 apply(list),以及 slice()join()(一个序列访问器,令人困惑的是,它作用于 list 而不是比 str) 将每个组的 COL_A 值中除一个以外的所有值转换为 space 分隔的字符串,这些值
    • concat这三个系列以range_index为索引df2
  • 使用 join 添加新列 COL_B_updateCOL_A_list(从 df2 重命名 COL_A
  • 创建一个新列 COL_A_update 包含 COL_A 包含它们的行的列表字符串,并包含 COL_A 值的行 COL_C 不是 None
  • 删除所有其他行(即删除所有行,但每个连续行块中的第一行,其中 COL_C 是 None),并删除中间列 range_indexCOL_A_list
  • 使用COL_B_updateCOL_A_update更新COL_CCOL_BCOL_A中的None行,并删除中间列COL_A_updateCOL_B_update.

输入:

   main_group  COL_A COL_B COL_C
0           0   TXT1        None
1           0   TXT2        None
2           0      5        None
3           0   1.93  1.93     0
4           0   7.60  7.60     1
5           0   2.46  2.46     1
6           1  TXT11        None
7           1  TXT12        None
8           1   0.50        None
9           1   0.45  0.45     0
10          1   0.31  0.31     1
11          1   0.35  0.35     1
12          1   0.73  0.73     1
13          2    0.5        None
14          2   4.15  4.15     0
15          2   2.98  2.98     0
16          2   1.53  1.53     0
17          3   4.46        None
18          3   4.00  4.00     0
19          3   0.95  0.95     1
20          3   1.35  1.35     1
21          3   1.79  1.79     1

这是 df2 就在 join 之前:

            COL_B_update        COL_A
range_index
0                      5    TXT1 TXT2
6                   0.50  TXT11 TXT12
13                   0.5          0.5
17                  4.46         4.46

df 紧跟在 join 之后,新列 range_indexCOL_B_updateCOL_A_list:

   main_group  COL_A COL_B COL_C  range_index COL_B_update   COL_A_list
0           0   TXT1        None            0            5    TXT1 TXT2
1           0   TXT2        None            1          NaN          NaN
2           0      5        None            2          NaN          NaN
3           0   1.93  1.93     0            3          NaN          NaN
4           0   7.60  7.60     1            4          NaN          NaN
5           0   2.46  2.46     1            5          NaN          NaN
6           1  TXT11        None            6         0.50  TXT11 TXT12
7           1  TXT12        None            7          NaN          NaN
8           1   0.50        None            8          NaN          NaN
9           1   0.45  0.45     0            9          NaN          NaN
10          1   0.31  0.31     1           10          NaN          NaN
11          1   0.35  0.35     1           11          NaN          NaN
12          1   0.73  0.73     1           12          NaN          NaN
13          2    0.5        None           13          0.5          0.5
14          2   4.15  4.15     0           14          NaN          NaN
15          2   2.98  2.98     0           15          NaN          NaN
16          2   1.53  1.53     0           16          NaN          NaN
17          3   4.46        None           17         4.46         4.46
18          3   4.00  4.00     0           18          NaN          NaN
19          3   0.95  0.95     1           19          NaN          NaN
20          3   1.35  1.35     1           20          NaN          NaN
21          3   1.79  1.79     1           21          NaN          NaN

这是创建 COL_A_update、删除不需要的行并删除中间列 range_indexCOL_A_list 后的 df

   main_group  COL_A COL_B COL_C COL_B_update COL_A_update
0           0   TXT1        None            5    TXT1 TXT2
3           0   1.93  1.93     0          NaN         1.93
4           0   7.60  7.60     1          NaN         7.60
5           0   2.46  2.46     1          NaN         2.46
6           1  TXT11        None         0.50  TXT11 TXT12
9           1   0.45  0.45     0          NaN         0.45
10          1   0.31  0.31     1          NaN         0.31
11          1   0.35  0.35     1          NaN         0.35
12          1   0.73  0.73     1          NaN         0.73
13          2    0.5        None          0.5          0.5
14          2   4.15  4.15     0          NaN         4.15
15          2   2.98  2.98     0          NaN         2.98
16          2   1.53  1.53     0          NaN         1.53
17          3   4.46        None         4.46         4.46
18          3   4.00  4.00     0          NaN         4.00
19          3   0.95  0.95     1          NaN         0.95
20          3   1.35  1.35     1          NaN         1.35
21          3   1.79  1.79     1          NaN         1.79

Output 更新 COL_ACOL_B 并删除中间列 COL_A_updateCOL_B_update:

   main_group        COL_A COL_B COL_C
0           0    TXT1 TXT2     5  None
1           0         1.93  1.93     0
2           0         7.60  7.60     1
3           0         2.46  2.46     1
4           1  TXT11 TXT12  0.50  None
5           1         0.45  0.45     0
6           1         0.31  0.31     1
7           1         0.35  0.35     1
8           1         0.73  0.73     1
9           2          0.5   0.5  None
10          2         4.15  4.15     0
11          2         2.98  2.98     0
12          2         1.53  1.53     0
13          3         4.46  4.46  None
14          3         4.00  4.00     0
15          3         0.95  0.95     1
16          3         1.35  1.35     1
17          3         1.79  1.79     1