pandas 中的矢量化比较
Vectorizing comparison in pandas
这里回答了早期版本的问题:
现在,我添加了 Machine
的新条件:
+---------+-----+-------+---------+
| Machine | nr | Time | Event |
+---------+-----+-------+---------+
| a | 70 | 8 | 1 |
| a | 70 | 0 | 1 |
| b | 70 | 0 | 1 |
| c | 74 | 52 | 1 |
| c | 74 | 12 | 2 |
| c | 74 | 0 | 2 |
+---------+-----+-------+---------+
我想将事件分配到最后一列。每个 Machine
的第一个条目默认为 1。也就是说,如果它是一个新的 Machine
,Event
从 1 重新开始。
If Time[i] < 7 and nr[i] != nr[i-1], then Event[i]=Event[i-1]+1.
If Time[i] < 7 and nr[i] = nr[i-1], then Event[i]=Event[i-1]
If Time[i] > 7 then Event[i]=Event[i-1]+1.
如何有效地对其进行矢量化?我想避免循环。
我尝试用
来扩充现有的解决方案
m = df.Machine.ne(df.Machine.shift())
o = np.select([t & n, t & ~n, m], [1, 0, 1], 1)
但这并没有将新 Machine
的 Event
重置为 1,我意识到,只是增加了它。关于如何合并这个的任何指示?
以下应该会产生您要查找的输出:
# Given you have a dataframe as df
# Create a series for grouping and looking for consecutive runs
mach_nr = df["Machine"] + df["nr"].astype("str")
mach_nr_runs = mach_nr.eq(mach_nr.shift())
# Groupby consecutive runs of each 'Machine'/'nr' combination by its
# that combination value, and take the cumulative sum of the equality
# of shifted combinations
df["Event"] = (
mach_nr_runs.groupby(mach_nr)
.cumsum()
.astype("int")
.add(1)
)
# Correct the rows where there were consecutive runs, and where 'Time' < 7
lt_7_runs = (df["Time"] < 7) & mach_nr_runs
df["Event"] -= (
lt_7_runs.groupby(mach_nr)
.cumsum()
.astype("int")
)
df
现在看起来像下面这样:
Machine nr Time Event
0 a 70 8 1
1 a 70 0 1
2 b 70 0 1
3 c 74 52 1
4 c 74 12 2
5 c 74 0 2
根据您之前的问题(及其出色的回答),您可以 groupby('machine')
并应用该函数,就好像您只有一个数据框一样。
def get_event(x):
t = x.Time.lt(7)
n = x.nr.ne(x.nr.shift())
o = np.select([t & n, t & ~n], [1, 0], 1)
o[0] = 1 # You say first value is 1
return pd.Series(o.cumsum(), index=x.index)
df['Event'] = df.groupby('Machine', group_keys=False).apply(get_event)
从您以前的解决方案开发。它在你的样本上看起来是正确的:
t = df.Time.lt(7)
n = df.nr.ne(df.nr.shift())
m = df.Machine.ne(df.Machine.shift())
df['Event'] = np.select([m | t & n, t & ~n], [1, 0], 1)
df['Event'] = df.groupby('Machine').Event.cumsum()
Out[279]:
Machine nr Time Event
0 a 70 8 1
1 a 70 0 1
2 b 70 0 1
3 c 74 52 1
4 c 74 12 2
5 c 74 0 2
这里回答了早期版本的问题:
现在,我添加了 Machine
的新条件:
+---------+-----+-------+---------+
| Machine | nr | Time | Event |
+---------+-----+-------+---------+
| a | 70 | 8 | 1 |
| a | 70 | 0 | 1 |
| b | 70 | 0 | 1 |
| c | 74 | 52 | 1 |
| c | 74 | 12 | 2 |
| c | 74 | 0 | 2 |
+---------+-----+-------+---------+
我想将事件分配到最后一列。每个 Machine
的第一个条目默认为 1。也就是说,如果它是一个新的 Machine
,Event
从 1 重新开始。
If Time[i] < 7 and nr[i] != nr[i-1], then Event[i]=Event[i-1]+1.
If Time[i] < 7 and nr[i] = nr[i-1], then Event[i]=Event[i-1]
If Time[i] > 7 then Event[i]=Event[i-1]+1.
如何有效地对其进行矢量化?我想避免循环。 我尝试用
来扩充现有的解决方案m = df.Machine.ne(df.Machine.shift())
o = np.select([t & n, t & ~n, m], [1, 0, 1], 1)
但这并没有将新 Machine
的 Event
重置为 1,我意识到,只是增加了它。关于如何合并这个的任何指示?
以下应该会产生您要查找的输出:
# Given you have a dataframe as df
# Create a series for grouping and looking for consecutive runs
mach_nr = df["Machine"] + df["nr"].astype("str")
mach_nr_runs = mach_nr.eq(mach_nr.shift())
# Groupby consecutive runs of each 'Machine'/'nr' combination by its
# that combination value, and take the cumulative sum of the equality
# of shifted combinations
df["Event"] = (
mach_nr_runs.groupby(mach_nr)
.cumsum()
.astype("int")
.add(1)
)
# Correct the rows where there were consecutive runs, and where 'Time' < 7
lt_7_runs = (df["Time"] < 7) & mach_nr_runs
df["Event"] -= (
lt_7_runs.groupby(mach_nr)
.cumsum()
.astype("int")
)
df
现在看起来像下面这样:
Machine nr Time Event
0 a 70 8 1
1 a 70 0 1
2 b 70 0 1
3 c 74 52 1
4 c 74 12 2
5 c 74 0 2
根据您之前的问题(及其出色的回答),您可以 groupby('machine')
并应用该函数,就好像您只有一个数据框一样。
def get_event(x):
t = x.Time.lt(7)
n = x.nr.ne(x.nr.shift())
o = np.select([t & n, t & ~n], [1, 0], 1)
o[0] = 1 # You say first value is 1
return pd.Series(o.cumsum(), index=x.index)
df['Event'] = df.groupby('Machine', group_keys=False).apply(get_event)
从您以前的解决方案开发。它在你的样本上看起来是正确的:
t = df.Time.lt(7)
n = df.nr.ne(df.nr.shift())
m = df.Machine.ne(df.Machine.shift())
df['Event'] = np.select([m | t & n, t & ~n], [1, 0], 1)
df['Event'] = df.groupby('Machine').Event.cumsum()
Out[279]:
Machine nr Time Event
0 a 70 8 1
1 a 70 0 1
2 b 70 0 1
3 c 74 52 1
4 c 74 12 2
5 c 74 0 2