使用 pandas 系列操作
Operation with pandas Series
假设有两个系列
a = [1,2,3,4,5]
,
b = [60,7,80,9,100]
我想创建一个新变量,其计算如下:
C = a/b if b >10 else a/b +1
我可以通过以下方式使用列表理解来做到这一点:
C = [a[i] \b[i] if b[i] > 10 else a[i] \b[i] +1 for i in range(len(b))]
我的问题如下:
是否有其他方法(例如使用 lambda、map、apply 等)来避免 for 循环?
(系列a,b,c也可以是a的一部分pd.Dataframe)
第一个想法是除以值并按条件添加 1
- 将掩码转换为整数 1
和 0
:
c = a/b + (b <=10).astype(int)
#alternative
#c = a/b + (~(b > 10)).astype(int)
或添加由numpy.where
创建的数组:
c = a/b + np.where(b > 10, 0, 1)
如果想分2次也是可以的(大数据应该慢一点)
c = pd.Series(np.where(b >10, a/b, a/b +1), index=a.index)
print (c)
0 0.016667
1 1.285714
2 0.037500
3 1.444444
4 0.050000
dtype: float64
设置:
a = pd.Series([1,2,3,4,5])
b = pd.Series([60,7,80,9,100])
性能:
np.random.seed(2019)
a = pd.Series(np.random.randint(1,100, size=100000))
b = pd.Series(np.random.randint(1,100, size=100000))
In [322]: %timeit [a[i] /b[i] if b[i] > 10 else a[i] /b[i] +1 for i in range(len(b))]
3.08 s ± 84.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [323]: %timeit a/b + (b <=10).astype(int)
1.71 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [324]: %timeit a/b + np.where(b > 10, 0, 1)
1.67 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [325]: %timeit np.where(b >10, a/b, a/b +1)
2.7 ms ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [326]: %timeit pd.Series(np.where(b >10, a/b, a/b +1), index=a.index)
2.74 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
假设有两个系列
a = [1,2,3,4,5]
,
b = [60,7,80,9,100]
我想创建一个新变量,其计算如下:
C = a/b if b >10 else a/b +1
我可以通过以下方式使用列表理解来做到这一点:
C = [a[i] \b[i] if b[i] > 10 else a[i] \b[i] +1 for i in range(len(b))]
我的问题如下:
是否有其他方法(例如使用 lambda、map、apply 等)来避免 for 循环? (系列a,b,c也可以是a的一部分pd.Dataframe)
第一个想法是除以值并按条件添加 1
- 将掩码转换为整数 1
和 0
:
c = a/b + (b <=10).astype(int)
#alternative
#c = a/b + (~(b > 10)).astype(int)
或添加由numpy.where
创建的数组:
c = a/b + np.where(b > 10, 0, 1)
如果想分2次也是可以的(大数据应该慢一点)
c = pd.Series(np.where(b >10, a/b, a/b +1), index=a.index)
print (c)
0 0.016667
1 1.285714
2 0.037500
3 1.444444
4 0.050000
dtype: float64
设置:
a = pd.Series([1,2,3,4,5])
b = pd.Series([60,7,80,9,100])
性能:
np.random.seed(2019)
a = pd.Series(np.random.randint(1,100, size=100000))
b = pd.Series(np.random.randint(1,100, size=100000))
In [322]: %timeit [a[i] /b[i] if b[i] > 10 else a[i] /b[i] +1 for i in range(len(b))]
3.08 s ± 84.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [323]: %timeit a/b + (b <=10).astype(int)
1.71 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [324]: %timeit a/b + np.where(b > 10, 0, 1)
1.67 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [325]: %timeit np.where(b >10, a/b, a/b +1)
2.7 ms ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [326]: %timeit pd.Series(np.where(b >10, a/b, a/b +1), index=a.index)
2.74 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)