如何向量化使用数据框的行和列元素的函数

Question

我在数据框中有两个输入，我需要创建一个输出，该输出取决于两个输入（同一行，不同列），但也取决于其先前的值（同一列，前一行）。

此数据框命令将创建我需要的示例：

df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])

规则很简单：

如果input_1为1，则输出为1（input_1为触发函数）
只要 input_2 也是 1，输出将保持为 1。（input_2 有点像记忆功能）
对于所有其他的，输出将为 0

行按时间顺序排列，我的意思是，第 0 行输出影响第 1 行输出，第 1 行输出影响第 2 行输出，依此类推。所以输出取决于 input_1、input_2，但也取决于它自己的先前值。

我可以编写代码使其循环遍历数据帧，使用 iloc 计算和分配值，但速度非常慢。我需要运行通过数以万计的数据帧的数千行，所以我正在寻找最有效的方法（最好是矢量化）。它可以与 numpy 或您知道的其他 library/method。

我搜索并发现了一些关于向量化和行循环的问题，但我仍然不知道如何使用这些技术。示例问题：How to iterate over rows in a DataFrame in Pandas?. Also this one, What is the most efficient way to loop through dataframes with pandas?

感谢您的帮助

Answer 1

如果我没理解错的话，你想知道如何计算列 output。你可以这样做，例如：

df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)

打印：

    input_1  input_2  output  output_2
0         0        0       0         0
1         0        1       0         0
2         0        0       0         0
3         1        1       1         1
4         0        1       1         1
5         0        1       1         1
6         0        0       0         0
7         0        1       0         0
8         0        1       0         0
9         1        1       1         1
10        1        1       1         1
11        0        1       1         1
12        0        1       1         1
13        1        1       1         1
14        0        1       1         1
15        0        1       1         1
16        0        0       0         0
17        0        1       0         0

Answer 2

正如您在上面的讨论中所解释的那样，我们只有两个使用 pandas 数据帧加载的输入：

df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

我们必须使用以下规则创建输出：

#1 if input_1 is one the output is one
#2 if both inputs is zero the output is zero
#3 if input_1 is zero and input_2 is one the output holds the previous value
#4 the initial output value is zero

要生成输出，我们可以

复制input_1到输出
如果 input_1 为零且 input_2 为一，则用先前的值更新输出

由于上述规则，我们不需要更新第一个输出

df['output'] = df.input_1

for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

print(df)

输出为：

>>> print(df)
    input_1  input_2  output
0         0        0       0
1         0        1       0
2         0        0       0
3         1        1       1
4         0        1       1
5         0        1       1
6         0        0       0
7         0        1       0
8         0        1       0
9         1        1       1
10        1        1       1
11        0        1       1
12        0        1       1
13        1        1       1
14        0        1       1
15        0        1       1
16        0        0       0
17        0        1       0

更新1

更快的方法是修改@Andrej提出的公式

df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)

未经修改，他的公式会为输入组合 [1, 0] 创建错误的输出。它保留以前的输出而不是将其设置为 1。

更新2

这只是为了比较结果

df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

df['output'] = df.input_1
for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)

结果是：

>>> print(df)
    input_1  input_2  output  output_1  output_2
0         0        0       0         0         0
1         1        0       1         1         0
2         0        1       1         1         0
3         1        1       1         1         1
4         0        1       1         1         1
5         0        1       1         1         1
6         0        0       0         0         0
7         0        1       0         0         0
8         0        1       0         0         0
9         1        1       1         1         1
10        1        1       1         1         1
11        0        1       1         1         1
12        0        1       1         1         1
13        1        1       1         1         1
14        0        1       1         1         1
15        0        1       1         1         1
16        0        0       0         0         0
17        0        1       0         0         0

如何向量化使用数据框的行和列元素的函数

How to vectorize a function that uses both row and column elements of a dataframe

python

numpy

vectorization

pandas