遍历数据框两次:哪种方法最理想?
Iterating over a dataframe twice: which is the ideal way?
我正在尝试在 Power BI 中为 Sankey 图表创建一个数据框,它需要像这样的源和目标。
id
Source
Destination
1
Starting a
next point b
1
next point b
final point c
1
final point c
end
2
Starting a
next point b
2
next point b
3
Starting a
next point b
3
next point b
final point c
3
final point c
end
我有这样一个数据框:
ID
flow
1
Starting a
1
next point b
1
final point c
2
Starting a
2
next point b
3
Starting a
3
next point b
3
final point c
我尝试像下面这样遍历数据框两次:
for index, row in df.iterrows():
for j, r in df.iterrows():
if row['ID'] == r['ID']:
if (index + 1 == j) & ("final point c" not in row['flow']):
df['Destination'][index] = df['flow'][j]
elif "final point c" in row['flow']:
df['Destination'][index] = 'End of flow'
由于它在同一个数据帧上迭代两次,当记录很大时,处理起来会花费很多时间。
有更好的方法吗?我尝试查看所有类似的问题,但找不到与我的问题相关的任何内容。
您可以使用 groupby
+shift
和一些掩码:
end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
.mask(end, end.map({True: 'end'}))
)
.rename(columns={'flow': 'source'})
)
输出:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b NaN
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end
替代 combine_first
填充 NaN:
end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
.rename(columns={'flow': 'source'})
)
输出:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end
我正在尝试在 Power BI 中为 Sankey 图表创建一个数据框,它需要像这样的源和目标。
id | Source | Destination |
---|---|---|
1 | Starting a | next point b |
1 | next point b | final point c |
1 | final point c | end |
2 | Starting a | next point b |
2 | next point b | |
3 | Starting a | next point b |
3 | next point b | final point c |
3 | final point c | end |
我有这样一个数据框:
ID | flow |
---|---|
1 | Starting a |
1 | next point b |
1 | final point c |
2 | Starting a |
2 | next point b |
3 | Starting a |
3 | next point b |
3 | final point c |
我尝试像下面这样遍历数据框两次:
for index, row in df.iterrows():
for j, r in df.iterrows():
if row['ID'] == r['ID']:
if (index + 1 == j) & ("final point c" not in row['flow']):
df['Destination'][index] = df['flow'][j]
elif "final point c" in row['flow']:
df['Destination'][index] = 'End of flow'
由于它在同一个数据帧上迭代两次,当记录很大时,处理起来会花费很多时间。
有更好的方法吗?我尝试查看所有类似的问题,但找不到与我的问题相关的任何内容。
您可以使用 groupby
+shift
和一些掩码:
end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
.mask(end, end.map({True: 'end'}))
)
.rename(columns={'flow': 'source'})
)
输出:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b NaN
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end
替代 combine_first
填充 NaN:
end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
.rename(columns={'flow': 'source'})
)
输出:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end