GNU Parallel to 运行 Python 大文件脚本

Question

我有一个文件，每行包含 XML 个元素，需要将其转换为 JSON。我写了一个 Python 脚本，它进行转换但以串行模式运行。我有两种使用 Hadoop 或 GNU Parallel 的选择，我已经尝试过 Hadoop 并想看看 GNU 如何提供帮助，肯定会很简单。

我的Python代码如下：

import sys import json import xmltodict with open('/path/sample.xml') as fd: for line in fd: o=xmltodict.parse(line) t=json.dumps(o) with open('sample.json', 'a') as out: out.write(t+ "\n") 那么我可以使用 GNU parallel 直接处理大文件还是需要拆分它？

或者这样对吗： cat sample.xml | parallel python xmltojson.py >sample.json

谢谢

Answer 1

您需要将 Python 代码更改为 UNIX 过滤器，即从标准输入 (stdin) 读取并写入标准输出 (stdout) 的程序。未测试：

import fileinput
import sys
import json
import xmltodict

for line in fileinput.input():
        o=xmltodict.parse(line)
        t=json.dumps(o)
        print t + "\n"

然后你在 GNU Parallel 中使用 --pipepart:

parallel --pipepart -a sample.xml --block -1 python my_script.py

GNU Parallel to 运行 Python 大文件脚本

GNU Parallel to run Python script on huge file

parallel-processing

gnu

python-2.7