pandas DataFrame:按间隔匹配数据框和字典

pandas DataFrame: match a dataframe and a dict by intervals

我有一个关于 DataFrame 的问题。我有一个数据框,间隔为 0.1 秒,特征属于该间隔。我想添加一个列,其中包含来自先前算法的预测(此间隔是静音还是声音)。我有一本字典,其中包含每个录音的所有预测静音间隔。我的数据框将如下所示。这里 df 在 audio_id==0 上过滤并在 interval_x.

上排序
 audio_id interval_x    interval_y  predicted_value
0   0   0.579367    0.679367    0
1   0   0.679367    0.779367    0
2   0   0.779367    0.879367    0
3   0   0.879367    0.979367    0
4   0   0.979367    1.079367    0
... ... ... ... ...
518 0   50.805830   50.905830   0
519 0   50.905830   51.005830   0
520 0   51.005830   51.105830   0
521 0   51.105830   51.205830   0
522 0   51.205830   51.212938   0

我的包含静音间隔的字典如下所示:

{'0': [[1.4501383219954658, 2.058138321995466],
 [3.298138321995466, 4.762138321995465],
 [7.682138321995467, 8.266138321995465],
 [11.266138321995466, 11.938138321995465],
 [13.242138321995466, 13.706138321995466],
 [16.73013832199547, 17.82613832199547],
 [24.53813832199547, 25.130138321995467],
 [26.394138321995467, 27.042138321995466],
 [28.21013832199547, 28.722138321995466]],

'1': [[0.0, 0.31253968253968023],
 [4.296539682539681, 5.040539682539681],
 [8.64053968253968, 9.296539682539679],

等每个音频文件。

执行此操作的有效方法是什么?

这是一个解决方案,使用 merge_asof 将间隔与最接近的静音时间相匹配。 d 是问题中的字典,intervals 是数据框。

silent_times = pd.DataFrame.from_records([(file, from_time, to_time) for file, values in d.items() 
                                          for [from_time, to_time] in values], 
                         columns = ["audio_id", "from_time", "to_time"])
silent_times.audio_id = silent_times.audio_id.astype(int)
res = pd.DataFrame()
for inx in intervals.audio_id.unique():
    intervals_slice = intervals[intervals.audio_id == inx]
    silent_times_slice = silent_times[silent_times.audio_id == inx]
    t = pd.merge_asof(intervals_slice, silent_times_slice, left_on=["interval_x"], right_on=["from_time"])   
    t.loc[(t.interval_x>=t.from_time) & (t.interval_y <=t.to_time), "predicted_value"] = 1
    res = res.append(t)

问题中数据帧的结果,以及此静默间隔:

d = {'0': [
 [1.4501383219954658, 2.058138321995466],
 [3.298138321995466, 4.762138321995465],
 [7.682138321995467, 8.266138321995465],
 [50.01, 51.01]           
 ],
 '1': [
 [0.0, 0.31253968253968023],
 [4.296539682539681, 5.040539682539681],
 [8.64053968253968, 9.296539682539679]]}

如下:

print(res[["audio_id_x", "interval_x", "interval_y", "predicted_value"]])
   audio_id_x  interval_x  interval_y  predicted_value
0           0    0.579367    0.679367                0
1           0    0.679367    0.779367                0
2           0    0.779367    0.879367                0
3           0    0.879367    0.979367                0
4           0    0.979367    1.079367                0
5           0   50.805830   50.905830                1
6           0   50.905830   51.005830                1
7           0   51.005830   51.105830                0
8           0   51.105830   51.205830                0
9           0   51.205830   51.212938                0