python 中令人尴尬的并行问题
embarrassingly parallel problem in python
我有 634 个 *.npy 文件,每个文件都包含一个形状为 (8194, 76) 的二维 numpy 数组。我想以不同的频率对每列使用 STL 分解五次。所以我要做的是:
for file in files:
for column in columns:
for freq in frequencies:
res = STL(file[:,column], period = freq)
decomposed = np.vstack((res.trend, res.seasonal, res.resid)).T
np.save(decompoesd)
最终分解后的形状应该是(8194,1140)。我怎样才能并行化这个?因为在串行实施中 运行 需要 2 个多月的时间。
你可以这样做:
from concurrent.futures import ProcessPoolExecutor
FILES = ["a", "b", "c", "d", "e", "f", "g", "h"]
def simulate_cpu_bound(file):
2 ** 100000000 # cpu heavy task
# or just use time.sleep(n), where n - number of seconds
return file
if __name__ == '__main__':
with ProcessPoolExecutor(8) as f:
res = f.map(simulate_cpu_bound, FILES)
res = list(res)
print(res)
我有 634 个 *.npy 文件,每个文件都包含一个形状为 (8194, 76) 的二维 numpy 数组。我想以不同的频率对每列使用 STL 分解五次。所以我要做的是:
for file in files:
for column in columns:
for freq in frequencies:
res = STL(file[:,column], period = freq)
decomposed = np.vstack((res.trend, res.seasonal, res.resid)).T
np.save(decompoesd)
最终分解后的形状应该是(8194,1140)。我怎样才能并行化这个?因为在串行实施中 运行 需要 2 个多月的时间。
你可以这样做:
from concurrent.futures import ProcessPoolExecutor
FILES = ["a", "b", "c", "d", "e", "f", "g", "h"]
def simulate_cpu_bound(file):
2 ** 100000000 # cpu heavy task
# or just use time.sleep(n), where n - number of seconds
return file
if __name__ == '__main__':
with ProcessPoolExecutor(8) as f:
res = f.map(simulate_cpu_bound, FILES)
res = list(res)
print(res)