如何使用 Python 中的嵌套 if 和循环对代码进行矢量化？

Question

我有一个如下所示的数据框

df = pd.DataFrame({
    'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
    'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
    'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''

在这个操作中，我正在执行如下代码所示的操作。这段代码工作正常并产生预期的输出，但我不能将这种方法用于真实数据集，因为它有超过百万条记录。

t1 = df['PEEP']
for i in t1.index:
   if i >=2:
      print("current value is  ", t1[i])
      print("preceding 1st (n-1) ", t1[i-1])
      print("preceding 2nd (n-2) ", t1[i-2])
         if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
            r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
            print("rule 1 output is ", r1_output)
            if t1[i] >= r1_output + 3:
                print("found a value for rule 2", t1[i])
                print("check for next value is same as current value", t1[i+1])
                if (t1[i]==t1[i+1]):
                    print("fake flag is being set")
                    df['fake_flag'][i] = 'fake_vac'

但是，我无法将其应用于真实数据，因为它有超过百万条记录。我正在学习 Python，您能帮助我了解如何在 Python 中对我的代码进行向量化吗？

你可以参考这个postrelated post来理解其中的逻辑。因为我的逻辑是正确的，所以我创建了这个 post 主要是为了在向量化和固定我的代码方面寻求帮助

我希望我的输出如下所示

subject_id = 1

subject_id = 2

是否有任何高效优雅的方法来加快我对百万记录数据集的代码操作

Answer 1

这个有用吗？

df.groupby('subject_id')\
  .rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)

输出：

subject_id    
1           0     False
            1     False
            2      True
            3     False
            4     False
            5     False
            6      True
            7     False
            8      True
            9     False
            10     True
            11    False
            12    False
            13    False
            14    False
            15    False
            16    False
            17    False
            18     True
            19    False
2           20    False
            21    False
            22    False
            23    False
            24    False
            25    False
            26     True
            27    False
            28    False
            29    False
            30    False
            31    False
            32     True
            33    False
            34     True
            35    False
            36    False
            37     True
            38    False
            39    False
Name: PEEP, dtype: bool

详情：

使用 groupby 使用 'subject_id'
应用 rolling n=3 或 window 三码。
使用 -1 索引和 subtact 查看 window 中的最后一个值使用索引 window 中前两个值的最大值切片。

Answer 2

不确定这背后的故事是什么，但您当然可以独立向量化三个 if 并将它们组合在一起，

con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))

df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')

编辑（Groupby SubjectID）

con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))

df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})

如何使用 Python 中的嵌套 if 和循环对代码进行矢量化？

How to vectorize code with nested if and loops in Python?

python

vectorization

python-3.x

pandas

pandas-groupby

详情：