Apache Beam 管道步骤不是运行并行？ (Python)

Question

我使用了略微修改过的 wordcount 示例 (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py)，将过程函数替换为以下内容：

  def process(self, element):
    """Returns an iterator over the words of this element.
    The element is a line of text.  If the line is blank, note that, too.
    Args:
      element: the element being processed
    Returns:
      The processed element.
    """
    import random
    import time
    n = random.randint(0, 1000)
    time.sleep(5)
    logging.getLogger().warning('PARALLEL START? ' + str(n))
    time.sleep(5)

    text_line = element.strip()
    if not text_line:
      self.empty_line_counter.inc(1)
    words = re.findall(r'[\w\']+', text_line, re.UNICODE)
    for w in words:
      self.words_counter.inc()
      self.word_lengths_counter.inc(len(w))
      self.word_lengths_dist.update(len(w))

    time.sleep(5)
    logging.getLogger().warning('PARALLEL END? ' + str(n))
    time.sleep(5)

    return words

想法是检查该步骤是否正在并行执行。例如，预期的输出是：

PARALLEL START? 447
PARALLEL START? 994
PARALLEL END? 447
PARALLEL START? 351
PARALLEL START? 723
PARALLEL END? 994
PARALLEL END? 351
PARALLEL END? 723

然而，实际结果是这样的，说明这一步不是运行并发的：

PARALLEL START? 447
PARALLEL END? 447
PARALLEL START? 994
PARALLEL END? 994
PARALLEL START? 351
PARALLEL END? 351
PARALLEL START? 723
PARALLEL END? 723

我试过使用手动设置 direct_num_workers 的 LocalRunner，以及对多个工作人员使用 DataflowRunner，但都无济于事。如何确保这一步实际上是运行并行？

更新：发现 here 的多处理模式看起来很有希望。但是，在 Windows 命令行 (python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/) 上，我在使用它时收到以下错误：

Exception in thread run_worker:
Traceback (most recent call last):
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
        self.run()
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\apache_beam\runners\portability\local_job_service.py", line 218, in run
        p = subprocess.Popen(self._worker_command_line, shell=True, env=env_dict)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
        restore_signals, start_new_session)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1119, in _execute_child
        args = list2cmdline(args)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 530, in list2cmdline
        needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

Answer 1

标准 Apache Beam 示例使用非常小的数据输入：gs://dataflow-samples/shakespeare/kinglear.txt 只有几 KB，因此不能很好地拆分工作。

Apache Beam 通过拆分输入数据来实现并行化。例如，如果您有很多文件，每个文件将被并行使用。如果您有一个非常大的文件，Beam 能够将该文件拆分成多个段，这些段将被并行使用。

你的代码最终应该显示并行性是正确的 - 但尝试使用（明显）更大的输入。

Apache Beam 管道步骤不是运行并行？ (Python)

Apache Beam pipeline step not running in parallel? (Python)

python

parallel-processing

google-cloud-dataflow

apache-beam

Apache Beam 管道步骤不是 运行 并行？ (Python)

Apache Beam pipeline step not running in parallel? (Python)

python

parallel-processing

google-cloud-dataflow

apache-beam

Apache Beam 管道步骤不是运行并行？ (Python)