使用 Pandas 数据帧(列操作)提高我的代码的性能
Improving performance for my code with Pandas dataframe (column operation )
我正在学习 Pandas 数据框和一个与性能优化相关的问题。由于我又慢又新,即使输出正确,我的代码似乎没有正确编写并且性能很差。
问题:我有 0 和 1 的位模式。我必须找到 1 的步长(用于我的分析的连续计数 1)。我的数据框是 200,000 列 x 200 行。它现在非常慢,正在寻找更好的方法来完成完整的解决方案或 'for loop' 替换所有列的矢量运算。示例:
Input: 1,1,1,1,0,0,1,1,0,0,1,1,1
Output: 4,4,4,4,0,0,2,2,0,0,3,3,3 (1 is replaced with the stride for 1)
我提取了示例代码以供审查。如果有人可以帮助菜鸟,我将不胜感激。
start_time = timeit.default_timer()
# Small sample
AA = [1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AB = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AC = [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AD = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
AE = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
df = pd.DataFrame({"A0": AA, "A1": AB, "A2": AB, "A3": AB, "A4": AB, "A5": AC, "A6": AD, "A7": AE, "A8": AE, "A9": AE})
# End of Debug Data Frame
df2=pd.DataFrame() # initialize to empty
print("Starting")
start_time = timeit.default_timer()
df1=pd.DataFrame(df != df.shift()).cumsum() # Operation-1: detects edges and increments at edge
print("Processing columns. Time=", timeit.default_timer() - start_time)
for c in df1.columns:
df2[c] = df1.groupby(c)[c].transform('count') * df[c] # This takes maximum time as I am counting column by column
print("Done Processing columns. Time=", timeit.default_timer() - start_time)
对于我的数据框(200,000 列 x 200 行),'for loop' 需要 700 秒
Starting
Processing columns. Time= 0.9377922620624304
Done Processing columns. Time= 701.7339988127351
Done generating data. Time= 702.0729111488909
这是一个替代方案,在样本数据帧上,不确定速度差异是否显着,但应该在更大的数据帧上。这个想法是使用 cumsum
along the rows (for each column at once), use mask
with original df as Boolean to replace by pd.NA
the values in the cumsumed df where df is 1. Then you need to play with some bfill
, ffill
and fillna
来获得预期的结果。
df_ = df.cumsum().mask(df.astype(bool)) # Removing pd.NaT helped
df2 = (df_.bfill() - df_.ffill().fillna(0)).fillna(0)
print(df2)
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 0 0 0 0 2 0 10 10 10
1 0 8 8 8 8 2 1 10 10 10
2 0 8 8 8 8 0 0 10 10 10
3 2 8 8 8 8 2 1 10 10 10
4 2 8 8 8 8 2 0 10 10 10
5 0 8 8 8 8 0 1 10 10 10
6 0 8 8 8 8 0 0 10 10 10
7 0 8 8 8 8 0 1 10 10 10
8 1 8 8 8 8 1 0 10 10 10
9 0 0 0 0 0 0 1 10 10 10
10 1 1 1 1 1 1 0 0 0 0
11 0 0 0 0 0 0 1 0 0 0
12 5 5 5 5 5 5 0 0 0 0
13 5 5 5 5 5 5 1 0 0 0
14 5 5 5 5 5 5 0 0 0 0
15 5 5 5 5 5 5 1 0 0 0
16 5 5 5 5 5 5 0 0 0 0
17 0 0 0 0 0 0 1 0 0 0
18 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0
我正在学习 Pandas 数据框和一个与性能优化相关的问题。由于我又慢又新,即使输出正确,我的代码似乎没有正确编写并且性能很差。
问题:我有 0 和 1 的位模式。我必须找到 1 的步长(用于我的分析的连续计数 1)。我的数据框是 200,000 列 x 200 行。它现在非常慢,正在寻找更好的方法来完成完整的解决方案或 'for loop' 替换所有列的矢量运算。示例:
Input: 1,1,1,1,0,0,1,1,0,0,1,1,1
Output: 4,4,4,4,0,0,2,2,0,0,3,3,3 (1 is replaced with the stride for 1)
我提取了示例代码以供审查。如果有人可以帮助菜鸟,我将不胜感激。
start_time = timeit.default_timer()
# Small sample
AA = [1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AB = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AC = [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
AD = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
AE = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
df = pd.DataFrame({"A0": AA, "A1": AB, "A2": AB, "A3": AB, "A4": AB, "A5": AC, "A6": AD, "A7": AE, "A8": AE, "A9": AE})
# End of Debug Data Frame
df2=pd.DataFrame() # initialize to empty
print("Starting")
start_time = timeit.default_timer()
df1=pd.DataFrame(df != df.shift()).cumsum() # Operation-1: detects edges and increments at edge
print("Processing columns. Time=", timeit.default_timer() - start_time)
for c in df1.columns:
df2[c] = df1.groupby(c)[c].transform('count') * df[c] # This takes maximum time as I am counting column by column
print("Done Processing columns. Time=", timeit.default_timer() - start_time)
对于我的数据框(200,000 列 x 200 行),'for loop' 需要 700 秒
Starting
Processing columns. Time= 0.9377922620624304
Done Processing columns. Time= 701.7339988127351
Done generating data. Time= 702.0729111488909
这是一个替代方案,在样本数据帧上,不确定速度差异是否显着,但应该在更大的数据帧上。这个想法是使用 cumsum
along the rows (for each column at once), use mask
with original df as Boolean to replace by pd.NA
the values in the cumsumed df where df is 1. Then you need to play with some bfill
, ffill
and fillna
来获得预期的结果。
df_ = df.cumsum().mask(df.astype(bool)) # Removing pd.NaT helped
df2 = (df_.bfill() - df_.ffill().fillna(0)).fillna(0)
print(df2)
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 0 0 0 0 2 0 10 10 10
1 0 8 8 8 8 2 1 10 10 10
2 0 8 8 8 8 0 0 10 10 10
3 2 8 8 8 8 2 1 10 10 10
4 2 8 8 8 8 2 0 10 10 10
5 0 8 8 8 8 0 1 10 10 10
6 0 8 8 8 8 0 0 10 10 10
7 0 8 8 8 8 0 1 10 10 10
8 1 8 8 8 8 1 0 10 10 10
9 0 0 0 0 0 0 1 10 10 10
10 1 1 1 1 1 1 0 0 0 0
11 0 0 0 0 0 0 1 0 0 0
12 5 5 5 5 5 5 0 0 0 0
13 5 5 5 5 5 5 1 0 0 0
14 5 5 5 5 5 5 0 0 0 0
15 5 5 5 5 5 5 1 0 0 0
16 5 5 5 5 5 5 0 0 0 0
17 0 0 0 0 0 0 1 0 0 0
18 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0