使用多处理没有改进
No improvements using multiprocessing
我测试了map
、mp.dummy.Pool.map
和mp.Pool.map
的性能
import itertools
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# wrapper function
def wrap(args): return args[0](*args[1:])
# make data arrays
x = np.random.rand(30, 100000)
y = np.random.rand(30, 100000)
# map
%timeit -n10 map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
# mp.dummy.Pool.map
for i in range(2, 16, 2):
print 'Thread Pool ', i, ' : ',
t = ThreadPool(i)
%timeit -n10 t.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
t.close()
t.join()
# mp.Pool.map
for i in range(2, 16, 2):
print 'Process Pool ', i, ' : ',
p = mp.Pool(i)
%timeit -n10 p.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
p.close()
p.join()
产出
# in this case, one CPU core usage reaches 100%
10 loops, best of 3: 3.16 ms per loop
# in this case, all CPU core usages reach ~80%
Thread Pool 2 : 10 loops, best of 3: 4.03 ms per loop
Thread Pool 4 : 10 loops, best of 3: 3.3 ms per loop
Thread Pool 6 : 10 loops, best of 3: 3.16 ms per loop
Thread Pool 8 : 10 loops, best of 3: 4.48 ms per loop
Thread Pool 10 : 10 loops, best of 3: 4.19 ms per loop
Thread Pool 12 : 10 loops, best of 3: 4.03 ms per loop
Thread Pool 14 : 10 loops, best of 3: 4.61 ms per loop
# in this case, all CPU core usages reach 80-100%
Process Pool 2 : 10 loops, best of 3: 71.7 ms per loop
Process Pool 4 : 10 loops, best of 3: 128 ms per loop
Process Pool 6 : 10 loops, best of 3: 165 ms per loop
Process Pool 8 : 10 loops, best of 3: 145 ms per loop
Process Pool 10 : 10 loops, best of 3: 259 ms per loop
Process Pool 12 : 10 loops, best of 3: 176 ms per loop
Process Pool 14 : 10 loops, best of 3: 176 ms per loop
多线程确实提高了速度。由于锁定,这是可以接受的。
多进程拖慢了很多速度,令人惊讶。我有八个 3.78 MHz CPU,每个有 4 个内核。
如果将x
和y
的形状增大为(300, 10000)
,即增大10倍,也可以看到类似的结果。
但是对于像 (20, 1000)
、
这样的小数组
10 loops, best of 3: 28.9 µs per loop
Thread Pool 2 : 10 loops, best of 3: 429 µs per loop
Thread Pool 4 : 10 loops, best of 3: 632 µs per loop
...
Process Pool 2 : 10 loops, best of 3: 525 µs per loop
Process Pool 4 : 10 loops, best of 3: 660 µs per loop
...
- 多处理和多线程具有相似的性能。
- 单个进程要快得多。 (由于多处理和多线程的开销?)
总之,这么简单的函数执行起来,multiprocessing 执行的这么差,真是出乎意料。怎么会这样?
按照@TrevorMerrifield 的建议,我修改了代码以避免将大数组传递给 wrap
。
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
n = 30
m = 1000
# make data in wrap
def wrap(i):
x = np.random.rand(m)
y = np.random.rand(m)
return np.correlate(x, y)
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
产出
Single process :10 loops, best of 3: 688 µs per loop
---
Thread Pool 4 : 10 loops, best of 3: 1.67 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 854 µs per loop
- 没有改进。
我尝试了另一种方法,将索引传递给 wrap
以从全局数组 x
和 y
.
获取数据
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# make data arrays
n = 30
m = 10000
x = np.random.rand(n, m)
y = np.random.rand(n, m)
def wrap(i): return np.correlate(x[i], y[i])
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
产出
Single process :10 loops, best of 3: 133 µs per loop
---
Thread Pool 4 : 10 loops, best of 3: 2.23 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 10.4 ms per loop
- 太糟糕了......
我尝试了另一个简单的例子(不同wrap
)。
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
# make data arrays
n = 30
m = 10000
# No big arrays passed to wrap
def wrap(i): return sum(range(i, i+m))
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
i = 4
print 'Thread Pool %2d : '%(i),
t = ThreadPool(i)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(i),
p = Pool(i)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
时间:
10 loops, best of 3: 4.28 ms per loop
---
Thread Pool 4 : 10 loops, best of 3: 5.8 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 2.06 ms per loop
- 现在
multiprocessing
更快。
但是如果 m
变大 10 倍(即 100000
),
Single process :10 loops, best of 3: 48.2 ms per loop
---
Thread Pool 4 : 10 loops, best of 3: 61.4 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 43.3 ms per loop
- 同样,没有改善。
您正在将 wrap
映射到 (a, b, c)
,其中 a
是一个函数,b
和 c
是 100K 元素向量。所有这些数据在发送到池中选定的进程时都会被 pickle,然后在到达它时被 unpickle。这是为了确保进程对数据具有互斥访问权限。
你的问题是pickle过程比correlation贵。作为准则,您希望最大限度地减少进程之间发送的信息量,并最大限度地增加每个进程所做的工作量,同时仍分布在系统的核心数量上。
如何做到这一点取决于您要解决的实际问题。通过调整您的玩具示例,使您的向量更大一些(100 万个元素)并在 wrap
函数中随机生成,通过使用具有 4 个元素的进程池,我可以获得比单核快 2 倍的速度。代码如下所示:
def wrap(a):
x = np.random.rand(1000000)
y = np.random.rand(1000000)
return np.correlate(x, y)
p = Pool(4)
p.map(wrap, range(30))
我测试了map
、mp.dummy.Pool.map
和mp.Pool.map
import itertools
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# wrapper function
def wrap(args): return args[0](*args[1:])
# make data arrays
x = np.random.rand(30, 100000)
y = np.random.rand(30, 100000)
# map
%timeit -n10 map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
# mp.dummy.Pool.map
for i in range(2, 16, 2):
print 'Thread Pool ', i, ' : ',
t = ThreadPool(i)
%timeit -n10 t.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
t.close()
t.join()
# mp.Pool.map
for i in range(2, 16, 2):
print 'Process Pool ', i, ' : ',
p = mp.Pool(i)
%timeit -n10 p.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
p.close()
p.join()
产出
# in this case, one CPU core usage reaches 100%
10 loops, best of 3: 3.16 ms per loop
# in this case, all CPU core usages reach ~80%
Thread Pool 2 : 10 loops, best of 3: 4.03 ms per loop
Thread Pool 4 : 10 loops, best of 3: 3.3 ms per loop
Thread Pool 6 : 10 loops, best of 3: 3.16 ms per loop
Thread Pool 8 : 10 loops, best of 3: 4.48 ms per loop
Thread Pool 10 : 10 loops, best of 3: 4.19 ms per loop
Thread Pool 12 : 10 loops, best of 3: 4.03 ms per loop
Thread Pool 14 : 10 loops, best of 3: 4.61 ms per loop
# in this case, all CPU core usages reach 80-100%
Process Pool 2 : 10 loops, best of 3: 71.7 ms per loop
Process Pool 4 : 10 loops, best of 3: 128 ms per loop
Process Pool 6 : 10 loops, best of 3: 165 ms per loop
Process Pool 8 : 10 loops, best of 3: 145 ms per loop
Process Pool 10 : 10 loops, best of 3: 259 ms per loop
Process Pool 12 : 10 loops, best of 3: 176 ms per loop
Process Pool 14 : 10 loops, best of 3: 176 ms per loop
多线程确实提高了速度。由于锁定,这是可以接受的。
多进程拖慢了很多速度,令人惊讶。我有八个 3.78 MHz CPU,每个有 4 个内核。
如果将x
和y
的形状增大为(300, 10000)
,即增大10倍,也可以看到类似的结果。
但是对于像 (20, 1000)
、
10 loops, best of 3: 28.9 µs per loop
Thread Pool 2 : 10 loops, best of 3: 429 µs per loop
Thread Pool 4 : 10 loops, best of 3: 632 µs per loop
...
Process Pool 2 : 10 loops, best of 3: 525 µs per loop
Process Pool 4 : 10 loops, best of 3: 660 µs per loop
...
- 多处理和多线程具有相似的性能。
- 单个进程要快得多。 (由于多处理和多线程的开销?)
总之,这么简单的函数执行起来,multiprocessing 执行的这么差,真是出乎意料。怎么会这样?
按照@TrevorMerrifield 的建议,我修改了代码以避免将大数组传递给 wrap
。
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
n = 30
m = 1000
# make data in wrap
def wrap(i):
x = np.random.rand(m)
y = np.random.rand(m)
return np.correlate(x, y)
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
产出
Single process :10 loops, best of 3: 688 µs per loop
---
Thread Pool 4 : 10 loops, best of 3: 1.67 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 854 µs per loop
- 没有改进。
我尝试了另一种方法,将索引传递给 wrap
以从全局数组 x
和 y
.
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# make data arrays
n = 30
m = 10000
x = np.random.rand(n, m)
y = np.random.rand(n, m)
def wrap(i): return np.correlate(x[i], y[i])
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
产出
Single process :10 loops, best of 3: 133 µs per loop
---
Thread Pool 4 : 10 loops, best of 3: 2.23 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 10.4 ms per loop
- 太糟糕了......
我尝试了另一个简单的例子(不同wrap
)。
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
# make data arrays
n = 30
m = 10000
# No big arrays passed to wrap
def wrap(i): return sum(range(i, i+m))
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
i = 4
print 'Thread Pool %2d : '%(i),
t = ThreadPool(i)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(i),
p = Pool(i)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()
时间:
10 loops, best of 3: 4.28 ms per loop
---
Thread Pool 4 : 10 loops, best of 3: 5.8 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 2.06 ms per loop
- 现在
multiprocessing
更快。
但是如果 m
变大 10 倍(即 100000
),
Single process :10 loops, best of 3: 48.2 ms per loop
---
Thread Pool 4 : 10 loops, best of 3: 61.4 ms per loop
---
Process Pool 4 : 10 loops, best of 3: 43.3 ms per loop
- 同样,没有改善。
您正在将 wrap
映射到 (a, b, c)
,其中 a
是一个函数,b
和 c
是 100K 元素向量。所有这些数据在发送到池中选定的进程时都会被 pickle,然后在到达它时被 unpickle。这是为了确保进程对数据具有互斥访问权限。
你的问题是pickle过程比correlation贵。作为准则,您希望最大限度地减少进程之间发送的信息量,并最大限度地增加每个进程所做的工作量,同时仍分布在系统的核心数量上。
如何做到这一点取决于您要解决的实际问题。通过调整您的玩具示例,使您的向量更大一些(100 万个元素)并在 wrap
函数中随机生成,通过使用具有 4 个元素的进程池,我可以获得比单核快 2 倍的速度。代码如下所示:
def wrap(a):
x = np.random.rand(1000000)
y = np.random.rand(1000000)
return np.correlate(x, y)
p = Pool(4)
p.map(wrap, range(30))