如何在 Pandas 数据框中有条件地 select 行
How do I conditionally select rows in a Pandas data frame by
我有以下 Pandas 数据框(显示前十行):
index_x time_x total_def_x index_y time_y total_def_y event_time
0 2 2005.25394 15.72761 3 2005.25667 8.66223 2005.254962
1 4 2005.25941 11.31783 5 2005.26215 2.79943 2005.260101
2 11 2005.27858 8.74810 12 2005.28131 8.50871 2005.279085
3 18 2005.29774 6.31637 19 2005.30048 10.0420 2005.297804
4 52 2005.39083 0.18209 53 2005.39357 4.42270 2005.393209
5 65 2005.42642 2.71002 66 2005.42916 2.61663 2005.428290
6 106 2005.53867 -0.86598 107 2005.54141 0.26263 2005.539240
7 173 2005.72211 7.91387 174 2005.72485 -4.00652 2005.724622
8 201 2005.79877 4.09495 202 2005.80151 8.35356 2005.800502
9 217 2005.84257 6.63870 218 2005.84531 -1.81069 2005.843362
...
我想做的是select次(time_x
或time_y
)和相应的变形值(total_def_x
或total_def_y
)时间最接近 event_time
并将值放在数据框中。到目前为止,我为实现这一目标而编写的代码如下:
nearest_df = pd.DataFrame(columns=["time", "total_def"])
for et in new_df["event_time"]:
if abs(et - new_df["time_x"].values) < abs(et - new_df["time_y"].values):
nearest_df.append(new_df["time_x", "total_def_x"])
else:
nearest_df.append(new_df["time_y", "total_def_y"])
然而,每次尝试重写这个returns这个错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
当我像这样修改代码时 if (abs(et - new_df['time_x'].values) < abs(et - new_df['time_y'].values)).all():
,我得到这个错误:
KeyError: ('time_x', 'total_def_x')
预期输出的一个示例是这样的数据框 (nearest_df),因为 time_x 和 time_y 与 event_time 的差异较小者将是selected 以及它们各自的变形(total_def_x 或 y):
time total_def
2005.25667 8.66223
2005.25941 11.31783
2005.27858 8.74810
如有任何帮助,我们将不胜感激。
你可以试试这个:
# Create temporary columns
df["dist_x"] = (df["event_time"] - df["time_x"]).abs()
df["dist_y"] = (df["event_time"] - df["time_y"]).abs()
# Select proper rows
df_x = df.loc[df["dist_x"] < df["dist_y"], ["time_x", "total_def_x"]]
df_y = df.loc[df["dist_x"] >= df["dist_y"], ["time_y", "total_def_y"]]
# Rename and append results
df_x.columns = df_y.columns = ["time", "total_def"]
new_df = pd.concat(objs=[df_x, df_y]).sort_index()
print(new_df)
# Outputs
time total_def
0 2005.25394 15.72761
1 2005.25941 11.31783
2 2005.27858 8.74810
3 2005.29774 6.31637
4 2005.39357 4.42270
5 2005.42916 2.61663
6 2005.53867 -0.86598
7 2005.72485 -4.00652
8 2005.80151 8.35356
9 2005.84257 6.63870
我有以下 Pandas 数据框(显示前十行):
index_x time_x total_def_x index_y time_y total_def_y event_time
0 2 2005.25394 15.72761 3 2005.25667 8.66223 2005.254962
1 4 2005.25941 11.31783 5 2005.26215 2.79943 2005.260101
2 11 2005.27858 8.74810 12 2005.28131 8.50871 2005.279085
3 18 2005.29774 6.31637 19 2005.30048 10.0420 2005.297804
4 52 2005.39083 0.18209 53 2005.39357 4.42270 2005.393209
5 65 2005.42642 2.71002 66 2005.42916 2.61663 2005.428290
6 106 2005.53867 -0.86598 107 2005.54141 0.26263 2005.539240
7 173 2005.72211 7.91387 174 2005.72485 -4.00652 2005.724622
8 201 2005.79877 4.09495 202 2005.80151 8.35356 2005.800502
9 217 2005.84257 6.63870 218 2005.84531 -1.81069 2005.843362
...
我想做的是select次(time_x
或time_y
)和相应的变形值(total_def_x
或total_def_y
)时间最接近 event_time
并将值放在数据框中。到目前为止,我为实现这一目标而编写的代码如下:
nearest_df = pd.DataFrame(columns=["time", "total_def"])
for et in new_df["event_time"]:
if abs(et - new_df["time_x"].values) < abs(et - new_df["time_y"].values):
nearest_df.append(new_df["time_x", "total_def_x"])
else:
nearest_df.append(new_df["time_y", "total_def_y"])
然而,每次尝试重写这个returns这个错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
当我像这样修改代码时 if (abs(et - new_df['time_x'].values) < abs(et - new_df['time_y'].values)).all():
,我得到这个错误:
KeyError: ('time_x', 'total_def_x')
预期输出的一个示例是这样的数据框 (nearest_df),因为 time_x 和 time_y 与 event_time 的差异较小者将是selected 以及它们各自的变形(total_def_x 或 y):
time total_def
2005.25667 8.66223
2005.25941 11.31783
2005.27858 8.74810
如有任何帮助,我们将不胜感激。
你可以试试这个:
# Create temporary columns
df["dist_x"] = (df["event_time"] - df["time_x"]).abs()
df["dist_y"] = (df["event_time"] - df["time_y"]).abs()
# Select proper rows
df_x = df.loc[df["dist_x"] < df["dist_y"], ["time_x", "total_def_x"]]
df_y = df.loc[df["dist_x"] >= df["dist_y"], ["time_y", "total_def_y"]]
# Rename and append results
df_x.columns = df_y.columns = ["time", "total_def"]
new_df = pd.concat(objs=[df_x, df_y]).sort_index()
print(new_df)
# Outputs
time total_def
0 2005.25394 15.72761
1 2005.25941 11.31783
2 2005.27858 8.74810
3 2005.29774 6.31637
4 2005.39357 4.42270
5 2005.42916 2.61663
6 2005.53867 -0.86598
7 2005.72485 -4.00652
8 2005.80151 8.35356
9 2005.84257 6.63870