如何用pandas-python递归构造一列dataframe?
How to constuct a column of data frame recursively with pandas-python?
给出这样一个数据框df
:
id_ val
11111 12
12003 22
88763 19
43721 77
...
我想添加一列 diff
到 df
,它的每一行等于,比方说,该行中的 val
减去 diff
在前一行乘以 0.4 然后在前一天加上 diff
:
diff = (val - diff_previousDay) * 0.4 + diff_previousDay
并且第一行中的 diff
等于该行中的 val * 4
。即预期的df
应该是:
id_ val diff
11111 12 4.8
12003 22 11.68
88763 19 14.608
43721 77 ...
我试过了:
mul = 0.4
df['diff'] = df.apply(lambda row: (row['val'] - df.loc[row.name, 'diff']) * mul + df.loc[row.name, 'diff'] if int(row.name) > 0 else row['val'] * mul, axis=1)
但是出现了这样的错误:
TypeError: ("unsupported operand type(s) for -: 'float' and 'NoneType'", 'occurred at index 1')
你知道如何解决这个问题吗?提前致谢!
您可以使用:
df.loc[0, 'diff'] = df.loc[0, 'val'] * 0.4
for i in range(1, len(df)):
df.loc[i, 'diff'] = (df.loc[i, 'val'] - df.loc[i-1, 'diff']) * 0.4 + df.loc[i-1, 'diff']
print (df)
id_ val diff
0 11111 12 4.8000
1 12003 22 11.6800
2 88763 19 14.6080
3 43721 77 39.5648
输入取决于先前步骤的结果的计算的迭代性质使矢量化复杂化。您或许可以将 apply 与执行与循环相同计算的函数一起使用,但在幕后这也是一个循环。
如果您在 pandas 中使用应用,则不应在 lambda 函数中再次使用数据框。
在所有情况下,lambda 函数中的对象都应该是 'row'。
递归函数不容易向量化。但是,您可以使用 numba
优化您的算法。这应该比常规循环更可取。
from numba import jit
@jit(nopython=True)
def foo(val):
diff = np.zeros(val.shape)
diff[0] = val[0] * 0.4
for i in range(1, diff.shape[0]):
diff[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
return diff
df['diff'] = foo(df['val'].values)
print(df)
id_ val diff
0 11111 12 4.8000
1 12003 22 11.6800
2 88763 19 14.6080
3 43721 77 39.5648
我只想为 jezrael 的回答添加另一个替代方案。我的答案是相似的,但我发现要快得多:
def calc_diff(val: pd.Series) -> pd.Series:
diff = pd.Series(0.0, index=range(len(val)))
diff[0] = val[0]
for i in range(1, len(val)):
result[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
return result
df['diff'] = calc_diff(df['val'])
我使用 10,000 行随机数进行了测试,结果是 194 毫秒,而 jezrael 的方法是 4 秒。
给出这样一个数据框df
:
id_ val
11111 12
12003 22
88763 19
43721 77
...
我想添加一列 diff
到 df
,它的每一行等于,比方说,该行中的 val
减去 diff
在前一行乘以 0.4 然后在前一天加上 diff
:
diff = (val - diff_previousDay) * 0.4 + diff_previousDay
并且第一行中的 diff
等于该行中的 val * 4
。即预期的df
应该是:
id_ val diff
11111 12 4.8
12003 22 11.68
88763 19 14.608
43721 77 ...
我试过了:
mul = 0.4
df['diff'] = df.apply(lambda row: (row['val'] - df.loc[row.name, 'diff']) * mul + df.loc[row.name, 'diff'] if int(row.name) > 0 else row['val'] * mul, axis=1)
但是出现了这样的错误:
TypeError: ("unsupported operand type(s) for -: 'float' and 'NoneType'", 'occurred at index 1')
你知道如何解决这个问题吗?提前致谢!
您可以使用:
df.loc[0, 'diff'] = df.loc[0, 'val'] * 0.4
for i in range(1, len(df)):
df.loc[i, 'diff'] = (df.loc[i, 'val'] - df.loc[i-1, 'diff']) * 0.4 + df.loc[i-1, 'diff']
print (df)
id_ val diff
0 11111 12 4.8000
1 12003 22 11.6800
2 88763 19 14.6080
3 43721 77 39.5648
输入取决于先前步骤的结果的计算的迭代性质使矢量化复杂化。您或许可以将 apply 与执行与循环相同计算的函数一起使用,但在幕后这也是一个循环。
如果您在 pandas 中使用应用,则不应在 lambda 函数中再次使用数据框。
在所有情况下,lambda 函数中的对象都应该是 'row'。
递归函数不容易向量化。但是,您可以使用 numba
优化您的算法。这应该比常规循环更可取。
from numba import jit
@jit(nopython=True)
def foo(val):
diff = np.zeros(val.shape)
diff[0] = val[0] * 0.4
for i in range(1, diff.shape[0]):
diff[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
return diff
df['diff'] = foo(df['val'].values)
print(df)
id_ val diff
0 11111 12 4.8000
1 12003 22 11.6800
2 88763 19 14.6080
3 43721 77 39.5648
我只想为 jezrael 的回答添加另一个替代方案。我的答案是相似的,但我发现要快得多:
def calc_diff(val: pd.Series) -> pd.Series:
diff = pd.Series(0.0, index=range(len(val)))
diff[0] = val[0]
for i in range(1, len(val)):
result[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
return result
df['diff'] = calc_diff(df['val'])
我使用 10,000 行随机数进行了测试,结果是 194 毫秒,而 jezrael 的方法是 4 秒。