如何使用 numpy 广播按条件组合多索引列值
How to combine multi-index column values on condition using numpy broadcasting
我有一个问题,我 99% 确定有一个 numpy 广播解决方案,但我无法弄清楚。假设我有以下数据框:
iterables = [['US', 'DE'], ['A', 'B'], [1, 2, 3, 4, 5]]
idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
df3 = pd.DataFrame(data=np.random.randn(20,2), index=idx3)
print(d3)
c1 c2
v1 v2 v3
US A 1 -0.023208 -1.047208
2 1.128917 0.292252
3 -0.441574 0.038714
4 1.057893 1.313874
5 0.938736 -0.130192
B 1 -0.479439 -0.311465
2 -1.730325 -1.300829
3 -0.112920 -0.269385
4 1.436866 0.197434
5 1.659529 2.107746
DE A 1 0.533169 -0.539891
2 0.225635 1.406626
3 -0.928966 0.979749
4 -0.109132 0.862450
5 -0.481120 1.425678
B 1 0.592646 -0.573862
2 -1.135009 -0.365472
3 0.728357 0.744631
4 0.156970 0.623244
5 -0.071628 -0.089194
现在假设我想要列 c3,这样对于索引级 v3 的值 1-3,c3 等于列 c1,对于索引级 v3 的值 3-5,c3 等于列 c2。
使用 apply this 应该很容易。
df3.reset_index('v3').apply(lambda df: df.c1 if df.v3<=3 else df.c2, axis=1)
但这是遍历每一行并检查条件。
使用布尔索引我可以得到这里:
bool1 = df3.loc[df3.index.get_level_values('v3')<=3,['c1']]
bool2 = df3.loc[df3.index.get_level_values('v3')>3,['c2']]
print bool1
c1
v1 v2 v3
US A 1 -0.023208
2 1.128917
3 -0.441574
B 1 -0.479439
2 -1.730325
3 -0.112920
DE A 1 0.533169
2 0.225635
3 -0.928966
B 1 0.592646
2 -1.135009
3 0.728357
print bool2
c2
v1 v2 v3
US A 4 1.313874
5 -0.130192
B 4 0.197434
5 2.107746
DE A 4 0.862450
5 1.425678
B 4 0.623244
5 -0.089194
但无法弄清楚如何将其恢复到我的原始数据框中。我觉得我基本上就在那里,但要 运行 进入死胡同。
根据您的代码
df3['c3']=pd.concat([bool1.rename(columns={'c1':'c3'}),bool2.rename(columns={'c2':'c3'})])
这就是我们通常做的事情 np.where
df3['c3']=np.where(df3.index.get_level_values('v3')<3,df3.c1,df3.c2)
df3
Out[1124]:
c1 c2 c3
v1 v2 v3
US A 1 0.141297 0.304322 0.141297
2 -0.532937 0.599611 -0.532937
3 0.480130 -0.601851 -0.601851
4 -0.208570 0.428122 0.428122
5 -0.775055 -1.842595 -1.842595
B 1 -0.985807 -0.259167 -0.985807
2 -0.211140 0.514273 -0.211140
3 0.006876 0.261158 0.261158
4 -1.001227 0.069682 0.069682
5 -0.937359 -0.364904 -0.364904
DE A 1 -0.510380 -1.815965 -0.510380
2 0.730677 1.901079 0.730677
3 -0.439140 1.068193 1.068193
4 0.183268 1.879705 1.879705
5 -1.455026 0.958647 0.958647
B 1 1.491328 2.139492 1.491328
2 -0.035495 1.487377 -0.035495
3 -0.503681 0.837837 0.837837
4 -2.320546 0.683476 0.683476
5 -2.407492 0.962752 0.962752
您可以使用 Series.where()
函数来高效地执行此操作。例如:
>>> df = df3.reset_index('v3')
>>> df['c3'] = df['c1'].where(df['v3'] <= 3, df['c2'])
>>> df
v3 c1 c2 c3
v1 v2
US A 1 0.220979 -1.361330 0.220979
A 2 -0.902486 0.931644 -0.902486
A 3 -0.324257 0.582866 -0.324257
A 4 0.130595 0.809319 0.809319
A 5 -1.432045 -1.299859 -1.299859
B 1 -0.221528 -1.171605 -0.221528
B 2 -0.025748 -0.244276 -0.025748
B 3 -0.842640 -0.381956 -0.842640
B 4 3.051674 0.675167 0.675167
B 5 -0.232921 -0.553047 -0.553047
DE A 1 0.011917 0.528074 0.011917
A 2 0.793363 -1.037817 0.793363
A 3 -0.647931 0.458625 -0.647931
A 4 0.675414 0.775137 0.775137
A 5 0.648263 0.462900 0.462900
B 1 -0.040314 1.427158 -0.040314
B 2 -1.354950 0.807179 -1.354950
B 3 -1.051297 -0.671725 -1.051297
B 4 0.305435 -0.482608 -0.482608
B 5 1.788918 0.527372 0.527372
选项 0
pd.Series.mask
和 pd.DataFrame.eval
df3.assign(c3=df3.c1.mask(df3.eval('v3 > 3'), df3.c2))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 1
pd.DataFrame.query
和 pd.Series.append
df3.assign(c3=df3.query('v3 in [1, 2, 3]').c1.append(df3.query('v3 in [4, 5]').c2))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 2
pd.IndexSlice
i1 = pd.IndexSlice[:, :, 1:3]
i2 = pd.IndexSlice[:, :, 4:5]
h = lambda d: d.loc[i1, 'c1'].append(d.loc[i2, 'c2'])
df3.assign(c3=df3.sort_index().pipe(h))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 3
棘手 pd.DataFrame.eval
和 numpy 切片
df3.assign(c3=df3.values[np.arange(len(df3)), df3.eval('v3').gt(3).astype(int)])
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
这是利用 broadcasting
-
的 NumPy 方法
# Get index level and the indices
l = df3.index.names.index('v3')
i = np.array(df3.index.levels[l])
# Get array data
a = df3.values.reshape(-1,len(i),2)
# Finally use np.where to choose between the two cols based on indices
df3['c3'] = (np.where(i <= 3,a[:,:,0],a[:,:,1])).ravel()
运行时测试
接近 -
def reset_where_app(df3): # @jakevdp's soln
df = df3.reset_index('v3')
df['c3'] = df['c1'].where(df['v3'] <= 3, df['c2'])
return df
def concat_app(df3): # @Wen's soln
bool1 = df3.loc[df3.index.get_level_values('v3')<=3,['c1']]
bool2 = df3.loc[df3.index.get_level_values('v3')>3,['c2']]
df3['c3']=pd.concat([bool1.rename(columns={'c1':'c3'}),\
bool2.rename(columns={'c2':'c3'})])
return df3
def assign_app(df3): # @piRSquared's soln-1
return df3.assign(c3=df3.c1.mask(df3.eval('v3 > 3'), df3.c2))
def indexslice_app(df3): # @piRSquared's soln-2
i1 = pd.IndexSlice[:, :, 1:3]
i2 = pd.IndexSlice[:, :, 4:5]
h = lambda d: d.loc[i1, 'c1'].append(d.loc[i2, 'c2'])
return df3.assign(c3=df3.sort_index().pipe(h))
def assigneval_app(df3): # @piRSquared's soln-3
return df3.assign(c3=df3.values[np.arange(len(df3)),\
df3.eval('v3').gt(3).astype(int)])
def numpy_app(df3):
l = df3.index.names.index('v3')
i = np.array(df3.index.levels[l])
a = df3.values.reshape(-1,len(i),2)
df3['c3'] = (np.where(i <= 3,a[:,:,0],a[:,:,1])).ravel()
return df3
问题中发布的示例数据的时间 -
In [256]: iterables = [['US', 'DE'], ['A', 'B'], [1, 2, 3, 4, 5]]
...: idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: df3 = pd.DataFrame(data=np.random.randn(20,2), index=idx3)
...: df3.columns = [['c1','c2']]
In [257]: %timeit reset_where_app(df3.copy())
...: %timeit concat_app(df3.copy())
...: %timeit assign_app(df3.copy())
...: %timeit indexslice_app(df3.copy())
...: %timeit assigneval_app(df3.copy())
...: %timeit numpy_app(df3.copy())
100 loops, best of 3: 1.87 ms per loop
100 loops, best of 3: 3.97 ms per loop
100 loops, best of 3: 2.77 ms per loop
100 loops, best of 3: 3.92 ms per loop
1000 loops, best of 3: 1.43 ms per loop
1000 loops, best of 3: 314 µs per loop
更大数据集的计时(100 x 100 x 100 数据,2 列)-
In [258]: np.random.seed(0)
...: n = 100
...: r = range(1,n+1)
...: l = len(r)
...: iterables = [r,r,r]
...: pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: df3 = pd.DataFrame(data=np.random.randn(l*l*l,2), index=idx3)
...: df3.columns = [['c1','c2']]
In [259]: %timeit reset_where_app(df3.copy())
...: %timeit concat_app(df3.copy())
...: %timeit assign_app(df3.copy())
...: %timeit indexslice_app(df3.copy())
...: %timeit assigneval_app(df3.copy())
...: %timeit numpy_app(df3.copy())
10 loops, best of 3: 42.6 ms per loop
1 loop, best of 3: 318 ms per loop
10 loops, best of 3: 62.2 ms per loop
1 loop, best of 3: 725 ms per loop
10 loops, best of 3: 27.2 ms per loop
100 loops, best of 3: 6 ms per loop
我有一个问题,我 99% 确定有一个 numpy 广播解决方案,但我无法弄清楚。假设我有以下数据框:
iterables = [['US', 'DE'], ['A', 'B'], [1, 2, 3, 4, 5]]
idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
df3 = pd.DataFrame(data=np.random.randn(20,2), index=idx3)
print(d3)
c1 c2
v1 v2 v3
US A 1 -0.023208 -1.047208
2 1.128917 0.292252
3 -0.441574 0.038714
4 1.057893 1.313874
5 0.938736 -0.130192
B 1 -0.479439 -0.311465
2 -1.730325 -1.300829
3 -0.112920 -0.269385
4 1.436866 0.197434
5 1.659529 2.107746
DE A 1 0.533169 -0.539891
2 0.225635 1.406626
3 -0.928966 0.979749
4 -0.109132 0.862450
5 -0.481120 1.425678
B 1 0.592646 -0.573862
2 -1.135009 -0.365472
3 0.728357 0.744631
4 0.156970 0.623244
5 -0.071628 -0.089194
现在假设我想要列 c3,这样对于索引级 v3 的值 1-3,c3 等于列 c1,对于索引级 v3 的值 3-5,c3 等于列 c2。
使用 apply this 应该很容易。
df3.reset_index('v3').apply(lambda df: df.c1 if df.v3<=3 else df.c2, axis=1)
但这是遍历每一行并检查条件。 使用布尔索引我可以得到这里:
bool1 = df3.loc[df3.index.get_level_values('v3')<=3,['c1']]
bool2 = df3.loc[df3.index.get_level_values('v3')>3,['c2']]
print bool1
c1
v1 v2 v3
US A 1 -0.023208
2 1.128917
3 -0.441574
B 1 -0.479439
2 -1.730325
3 -0.112920
DE A 1 0.533169
2 0.225635
3 -0.928966
B 1 0.592646
2 -1.135009
3 0.728357
print bool2
c2
v1 v2 v3
US A 4 1.313874
5 -0.130192
B 4 0.197434
5 2.107746
DE A 4 0.862450
5 1.425678
B 4 0.623244
5 -0.089194
但无法弄清楚如何将其恢复到我的原始数据框中。我觉得我基本上就在那里,但要 运行 进入死胡同。
根据您的代码
df3['c3']=pd.concat([bool1.rename(columns={'c1':'c3'}),bool2.rename(columns={'c2':'c3'})])
这就是我们通常做的事情 np.where
df3['c3']=np.where(df3.index.get_level_values('v3')<3,df3.c1,df3.c2)
df3
Out[1124]:
c1 c2 c3
v1 v2 v3
US A 1 0.141297 0.304322 0.141297
2 -0.532937 0.599611 -0.532937
3 0.480130 -0.601851 -0.601851
4 -0.208570 0.428122 0.428122
5 -0.775055 -1.842595 -1.842595
B 1 -0.985807 -0.259167 -0.985807
2 -0.211140 0.514273 -0.211140
3 0.006876 0.261158 0.261158
4 -1.001227 0.069682 0.069682
5 -0.937359 -0.364904 -0.364904
DE A 1 -0.510380 -1.815965 -0.510380
2 0.730677 1.901079 0.730677
3 -0.439140 1.068193 1.068193
4 0.183268 1.879705 1.879705
5 -1.455026 0.958647 0.958647
B 1 1.491328 2.139492 1.491328
2 -0.035495 1.487377 -0.035495
3 -0.503681 0.837837 0.837837
4 -2.320546 0.683476 0.683476
5 -2.407492 0.962752 0.962752
您可以使用 Series.where()
函数来高效地执行此操作。例如:
>>> df = df3.reset_index('v3')
>>> df['c3'] = df['c1'].where(df['v3'] <= 3, df['c2'])
>>> df
v3 c1 c2 c3
v1 v2
US A 1 0.220979 -1.361330 0.220979
A 2 -0.902486 0.931644 -0.902486
A 3 -0.324257 0.582866 -0.324257
A 4 0.130595 0.809319 0.809319
A 5 -1.432045 -1.299859 -1.299859
B 1 -0.221528 -1.171605 -0.221528
B 2 -0.025748 -0.244276 -0.025748
B 3 -0.842640 -0.381956 -0.842640
B 4 3.051674 0.675167 0.675167
B 5 -0.232921 -0.553047 -0.553047
DE A 1 0.011917 0.528074 0.011917
A 2 0.793363 -1.037817 0.793363
A 3 -0.647931 0.458625 -0.647931
A 4 0.675414 0.775137 0.775137
A 5 0.648263 0.462900 0.462900
B 1 -0.040314 1.427158 -0.040314
B 2 -1.354950 0.807179 -1.354950
B 3 -1.051297 -0.671725 -1.051297
B 4 0.305435 -0.482608 -0.482608
B 5 1.788918 0.527372 0.527372
选项 0
pd.Series.mask
和 pd.DataFrame.eval
df3.assign(c3=df3.c1.mask(df3.eval('v3 > 3'), df3.c2))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 1
pd.DataFrame.query
和 pd.Series.append
df3.assign(c3=df3.query('v3 in [1, 2, 3]').c1.append(df3.query('v3 in [4, 5]').c2))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 2
pd.IndexSlice
i1 = pd.IndexSlice[:, :, 1:3]
i2 = pd.IndexSlice[:, :, 4:5]
h = lambda d: d.loc[i1, 'c1'].append(d.loc[i2, 'c2'])
df3.assign(c3=df3.sort_index().pipe(h))
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
选项 3
棘手 pd.DataFrame.eval
和 numpy 切片
df3.assign(c3=df3.values[np.arange(len(df3)), df3.eval('v3').gt(3).astype(int)])
c1 c2 c3
v1 v2 v3
US A 1 -0.725168 0.267357 -0.725168
2 0.737184 -0.675266 0.737184
3 0.860002 1.158949 0.860002
4 -0.243702 -0.036094 -0.036094
5 -0.700788 0.042080 0.042080
B 1 0.955489 0.207721 0.955489
2 1.167202 -1.132584 1.167202
3 1.937948 -1.476343 1.937948
4 0.385508 0.731786 0.731786
5 -1.356454 -1.815996 -1.815996
DE A 1 -0.164354 -1.354613 -0.164354
2 -0.264868 0.182453 -0.264868
3 1.768679 0.568956 1.768679
4 -1.790169 -0.298174 -0.298174
5 -1.242662 1.445414 1.445414
B 1 -0.081639 -0.464066 -0.081639
2 0.071672 0.409464 0.071672
3 -0.770912 -0.432803 -0.770912
4 -1.616662 -0.642879 -0.642879
5 -0.815786 0.991889 0.991889
这是利用 broadcasting
-
# Get index level and the indices
l = df3.index.names.index('v3')
i = np.array(df3.index.levels[l])
# Get array data
a = df3.values.reshape(-1,len(i),2)
# Finally use np.where to choose between the two cols based on indices
df3['c3'] = (np.where(i <= 3,a[:,:,0],a[:,:,1])).ravel()
运行时测试
接近 -
def reset_where_app(df3): # @jakevdp's soln
df = df3.reset_index('v3')
df['c3'] = df['c1'].where(df['v3'] <= 3, df['c2'])
return df
def concat_app(df3): # @Wen's soln
bool1 = df3.loc[df3.index.get_level_values('v3')<=3,['c1']]
bool2 = df3.loc[df3.index.get_level_values('v3')>3,['c2']]
df3['c3']=pd.concat([bool1.rename(columns={'c1':'c3'}),\
bool2.rename(columns={'c2':'c3'})])
return df3
def assign_app(df3): # @piRSquared's soln-1
return df3.assign(c3=df3.c1.mask(df3.eval('v3 > 3'), df3.c2))
def indexslice_app(df3): # @piRSquared's soln-2
i1 = pd.IndexSlice[:, :, 1:3]
i2 = pd.IndexSlice[:, :, 4:5]
h = lambda d: d.loc[i1, 'c1'].append(d.loc[i2, 'c2'])
return df3.assign(c3=df3.sort_index().pipe(h))
def assigneval_app(df3): # @piRSquared's soln-3
return df3.assign(c3=df3.values[np.arange(len(df3)),\
df3.eval('v3').gt(3).astype(int)])
def numpy_app(df3):
l = df3.index.names.index('v3')
i = np.array(df3.index.levels[l])
a = df3.values.reshape(-1,len(i),2)
df3['c3'] = (np.where(i <= 3,a[:,:,0],a[:,:,1])).ravel()
return df3
问题中发布的示例数据的时间 -
In [256]: iterables = [['US', 'DE'], ['A', 'B'], [1, 2, 3, 4, 5]]
...: idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: df3 = pd.DataFrame(data=np.random.randn(20,2), index=idx3)
...: df3.columns = [['c1','c2']]
In [257]: %timeit reset_where_app(df3.copy())
...: %timeit concat_app(df3.copy())
...: %timeit assign_app(df3.copy())
...: %timeit indexslice_app(df3.copy())
...: %timeit assigneval_app(df3.copy())
...: %timeit numpy_app(df3.copy())
100 loops, best of 3: 1.87 ms per loop
100 loops, best of 3: 3.97 ms per loop
100 loops, best of 3: 2.77 ms per loop
100 loops, best of 3: 3.92 ms per loop
1000 loops, best of 3: 1.43 ms per loop
1000 loops, best of 3: 314 µs per loop
更大数据集的计时(100 x 100 x 100 数据,2 列)-
In [258]: np.random.seed(0)
...: n = 100
...: r = range(1,n+1)
...: l = len(r)
...: iterables = [r,r,r]
...: pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: idx3 = pd.MultiIndex.from_product(iterables, names=['v1', 'v2', 'v3'])
...: df3 = pd.DataFrame(data=np.random.randn(l*l*l,2), index=idx3)
...: df3.columns = [['c1','c2']]
In [259]: %timeit reset_where_app(df3.copy())
...: %timeit concat_app(df3.copy())
...: %timeit assign_app(df3.copy())
...: %timeit indexslice_app(df3.copy())
...: %timeit assigneval_app(df3.copy())
...: %timeit numpy_app(df3.copy())
10 loops, best of 3: 42.6 ms per loop
1 loop, best of 3: 318 ms per loop
10 loops, best of 3: 62.2 ms per loop
1 loop, best of 3: 725 ms per loop
10 loops, best of 3: 27.2 ms per loop
100 loops, best of 3: 6 ms per loop