遍历数据框两次：哪种方法最理想？

Question

我正在尝试在 Power BI 中为 Sankey 图表创建一个数据框，它需要像这样的源和目标。

id	Source	Destination
1	Starting a	next point b
1	next point b	final point c
1	final point c	end
2	Starting a	next point b
2	next point b
3	Starting a	next point b
3	next point b	final point c
3	final point c	end

我有这样一个数据框：

ID	flow
1	Starting a
1	next point b
1	final point c
2	Starting a
2	next point b
3	Starting a
3	next point b
3	final point c

我尝试像下面这样遍历数据框两次：

for index, row in df.iterrows():
  for j, r in df.iterrows():
    if row['ID'] == r['ID']:
        if (index + 1 == j) & ("final point c" not in row['flow']):
            df['Destination'][index] = df['flow'][j]
        elif "final point c" in row['flow']:
            df['Destination'][index] = 'End of flow'

由于它在同一个数据帧上迭代两次，当记录很大时，处理起来会花费很多时间。

有更好的方法吗？我尝试查看所有类似的问题，但找不到与我的问题相关的任何内容。

Answer 1

您可以使用 groupby+shift 和一些掩码：

end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
                               .mask(end, end.map({True: 'end'}))
                 )
         .rename(columns={'flow': 'source'})
       )

输出：

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b            NaN
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end

替代 combine_first 填充 NaN：

end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
         .rename(columns={'flow': 'source'})
       )

输出：

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b               
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end

遍历数据框两次：哪种方法最理想？

Iterating over a dataframe twice: which is the ideal way?

for-loop

dataframe

pandas

sankey-diagram