pandas 数据框列上带有计数器的矢量化函数
Vectorized function with counter on pandas dataframe column
考虑这个 pandas 数据框,其中当 value
低于 5(任何阈值)时 condition
列为 1。
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
我想要的是让所有小于 5 的连续值具有相同的 id,所有大于 5 的值具有 0(或 NA 或负值,没关系,它们只需要相同)。我想创建一个名为 new_id
的新列,其中包含如下这些累积 ID:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
在一个非常低效的 for 循环中,我会这样做(有效):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
但这非常低效,而且我有一个非常大的数据集。因此,我尝试了不同类型的矢量化,但到目前为止我未能阻止它在连续点的每个“集群”内计数:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
我也尝试过将 apply()
与自定义 if else 函数一起使用,但似乎这不允许我使用计数器。
已经有大量关于此的类似帖子,但其中 none 对连续的行保持相同的 ID。
示例帖子是:
Maintain count in python list comprehension
python pandas conditional cumulative sum
欢迎来到 SO!为什么不只依赖基数 Python 呢?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
看起来像这样
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
编辑:
您也可以使用 numba,这对我来说大大加快了函数速度:大约 1 秒到 ~60 毫秒。
您应该在函数中输入 numpy 数组才能使用它,这意味着您必须 df["condition"].values
.
from numba import njit
import numpy as np
@njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)
您可以像第一次尝试那样使用cumsum()
,只是稍微修改一下:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
考虑这个 pandas 数据框,其中当 value
低于 5(任何阈值)时 condition
列为 1。
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
我想要的是让所有小于 5 的连续值具有相同的 id,所有大于 5 的值具有 0(或 NA 或负值,没关系,它们只需要相同)。我想创建一个名为 new_id
的新列,其中包含如下这些累积 ID:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
在一个非常低效的 for 循环中,我会这样做(有效):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
但这非常低效,而且我有一个非常大的数据集。因此,我尝试了不同类型的矢量化,但到目前为止我未能阻止它在连续点的每个“集群”内计数:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
我也尝试过将 apply()
与自定义 if else 函数一起使用,但似乎这不允许我使用计数器。
已经有大量关于此的类似帖子,但其中 none 对连续的行保持相同的 ID。
示例帖子是:
Maintain count in python list comprehension
欢迎来到 SO!为什么不只依赖基数 Python 呢?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
看起来像这样
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
编辑:
您也可以使用 numba,这对我来说大大加快了函数速度:大约 1 秒到 ~60 毫秒。
您应该在函数中输入 numpy 数组才能使用它,这意味着您必须 df["condition"].values
.
from numba import njit
import numpy as np
@njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)
您可以像第一次尝试那样使用cumsum()
,只是稍微修改一下:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']