为什么 Dask 使用 from_pandas 计算数据帧比直接用 dask 读取更快？

Question

我在 dask 中有运行相同的数据集，但方式不同。我发现一种方法比其他方法快将近 10 倍！！！我试图找到原因，但没有成功。

1。完全与 dask

import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
               sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
              sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)

#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()

Out: 9.77 seg

2。 Pandas + 达斯克

import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
                      sep='\r', header=None, names=['ingles'])

pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
                      sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)

#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()

Out: 1.31 seg

有人知道为什么吗？还有其他方法可以更快吗？

Answer 1

注意：

dd.read_csv(...) 实际上没有读任何东西。它只是构建计算树的一个步骤。
直到你运行计算时，整个计算树构建到目前为止实际执行，包括两个数据帧的读取。

因此在第一个变体中，定时操作包括：

读取两个数据帧，
重新分区，
最后合并本身。

在第二种变体中，就时间而言，情况有所不同。两个DataFrame之前都已经读取过，所以定时操作仅包括 repartition 和 merge.

显然源数据帧很大，读取它们需要花费相当长的时间，未在第二个变体中考虑。

尝试另一个测试：创建一个函数：

读取两个 DataFrame pd.read_csv(...)
执行剩余步骤（重新分区 和合并）。

然后计算这个函数的执行时间。

我想，执行时间甚至可能比长第一个变体，因为：

在第一个变体中，两个数据帧都被读取同时（按不同内核），
在上面提出的测试中，顺序读取。

为什么 Dask 使用 from_pandas 计算数据帧比直接用 dask 读取更快？

Why Dask compute faster the dataframes using from_pandas, than reading directly with dask?

python

python-3.x

pandas

dask

dask-distributed

1。完全与 dask

2。 Pandas + 达斯克