如何解析大型 DOCX 文件并找出在 python 中出现 n 次的键 words/strings？

Question

我有非常大的 DOCX 文件，我希望能够解析这些文件并能够建立一个数据库来显示 word/string 在文档中的出现频率。据我所知，这绝对不是一件容易的事。我只是希望得到一些关于我可以用来帮助我解决这个问题的库的方向。

这是一个可能看起来像的例子。结构不一致，因此也会使事情复杂化。任何方向将不胜感激！！！

Answer 1

Python 基于解决方案

如果（根据您的评论）您能够在 Python 中执行此操作，请查看以下片段：

所以首先要认识到的是，docx 文件实际上是包含许多 XML 文件的 .zip 存档。大多数文本内容将存储在 word/document.xml 中。 Word 使用编号列表执行一些复杂的操作，这将要求您还加载其他 XML，如 styles.xml。

DOCX 文件的标记可能很麻烦，因为文档的结构是 w:p（段落）和任意 w:r（运行）。这些运行基本上是 'a bit of typing'，所以它可以是一个字母，也可以是几个单词。

我们使用中的 UpdateableZipFile。这主要是因为我们还希望能够编辑文档，因此您可以只使用其中的片段。

import UpdateableZipFile
from lxml import etree

source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
         'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
        } #you might need a few more namespace definitions if you get funky docx inputs

document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.

# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)

然后您可以将文本提供给 NLP，例如 Spacy：

import spacy

nlp = spacy.load("en_core_web_sm")
word_counts = {}

for paragraph in paragraph_list:
    doc = nlp(paragraph)
    for token in doc:
        if token.text in word_counts:
            word_counts[token.text]+=1
        else:
            word_counts[token.text]=1

Spacy 将为您标记文本，并且可以在命名实体识别、词性标注等方面做更多的事情。

如何解析大型 DOCX 文件并找出在 python 中出现 n 次的键 words/strings？

How can I parse a large DOCX file and pick out key words/strings that appear n number of times in python?

c#

nlp

docx

tokenize

Python 基于解决方案