如何在不使用 for 循环的情况下根据另一个 DataFrame 的值对 pandas DataFrame 进行切片？

Question

我有一个 DataFrame df1：

df1.head() =    
             id    type    position
dates
2000-01-03  17378   600       400
2000-01-03   4203   600       150
2000-01-03  18321   600      5000
2000-01-03   6158   600      1000
2000-01-03    886   600     10000
2000-01-03  17127   600       800
2000-01-03  18317  1300       110
2000-01-03   5536   600       207
2000-01-03   5132   600     20000
2000-01-03  18191   600      2000

还有第二个 DataFrame df2：

df2.head() = 

               dt_f       dt_l
id_y  id_x
670   715   2000-02-14 2003-09-30
704   2963  2000-02-11 2004-01-13
886   18350 2000-02-09 2001-09-24
1451  18159 2005-11-14 2007-03-06
2175  8648  2007-02-28 2007-09-19
2236  18321 2001-04-05 2002-07-02
2283  2352  2007-03-07 2007-09-19
      6694  2007-03-07 2007-09-17
      13865 2007-04-19 2007-09-19
      14348 2007-08-10 2007-09-19
      15415 2007-03-07 2007-09-19
2300  2963  2001-05-30 2007-09-26

我需要为 id_x 的每个值切片 df1，并计算区间 dt_f:dt_l 内的行数。对于 id_y 的值，必须再次执行此操作。最后，结果应在 df2 上合并，输出以下 DataFrame：

df_result.head() = 

               dt_f       dt_l     n_x   n_y
id_y  id_x
670   715   2000-02-14 2003-09-30   8     10 
704   2963  2000-02-11 2004-01-13   13    25 
886   18350 2000-02-09 2001-09-24   32    75
1451  18159 2005-11-14 2007-03-06   48    6

其中 n_x(n_y) 对应 id_x(id_y 的每个值的区间 dt_f:dt_l 中包含的行数。

这是我使用的 for 循环：

idx_list = df2.index.tolist()
k = 1 
for j in idx_list: 
    n_y = df1[df1.id == j[0]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count() 
    n_x = df1[df1.id == j[1]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()

是否可以不使用 for 循环来实现？ DataFrame df1 包含大约 30000 行，恐怕循环会大大减慢处理速度，因为这只是整个脚本的一小部分。

Answer 1

你想要这样的东西：

#Merge the tables together - making sure we keep the index column
mg = df1.reset_index().merge(df2, left_on = 'id', right_on = 'id_x')

#Select only the rows that are within the start and end
mg = mg[(mg['index'] > mg['dt_f']) & (mg['index'] < mg['dt_l'])]

#Finally count by id_x
mg.groupby('id_x').count()

之后您需要整理列并重复 id_y。

如何在不使用 for 循环的情况下根据另一个 DataFrame 的值对 pandas DataFrame 进行切片？

How to slice pandas DataFrame based on values from another Dataframe without using for-loop?

python

merge

vectorization

slice

pandas