如何加快 SpaCy 的依赖解析？

Question

我正在使用 spacy 专门获取许多文件（大约 12 GB 的压缩文件）中的所有 amod（形容词修饰符）。我试着让它在一个只有 2.8 MB 的文件夹上工作，但花了 4 分钟来处理它！

到目前为止，这是我的代码：

with open("descriptions.txt", "w") as outf:
    canParse = False
    toParse = ""
    for file in getNextFile():
        # Open zip file and get text out of it
        with zipfile.ZipFile(file) as zf:
            with io.TextIOWrapper(zf.open(os.path.basename(file)[:-3]+"txt"), encoding="utf-8") as f:
                for line in f.readlines():
                    if line[0:35] == "*** START OF THIS PROJECT GUTENBERG":
                        canParse = True
                    elif line[0:33] == "*** END OF THIS PROJECT GUTENBERG":
                        break
                    if canParse:
                        if line.find(".") != -1:
                            toParse += line[0:line.find(".")+1]

                            sents = nlp(toParse)
                            for token in sents:
                                if token.dep_ == "amod":
                                    outf.write(token.head.text + "," + token.text + "\n")

                            toParse = ""
                            toParse += line[line.find(".")+1:len(line)]
                        else:
                            toParse += line

对于这个非常具体的用例，是否有加速 spacy（或我的 python 一般代码）的方法？

Answer 1

稍微调整一下您的代码以使用 nlp.pipe()，它可以批量处理文本并且速度更快，并禁用您不需要的组件（使用下面的 nlp.pipe() 或当加载模型）。

for doc in nlp.pipe(texts, disable=["tagger", "ner"]):
    # process, e.g.:
    print(doc)

查看更多详细信息和示例：https://spacy.io/usage/processing-pipelines#processing

您可能还想将多处理与 nlp.pipe() 的 n_process 参数一起使用。

如何加快 SpaCy 的依赖解析？

How to speed up SpaCy for dependency parsing?

python-3.x

spacy