Python-docx 提取的字符串缺少一个词
Python-docx Extracted String Missing a Word
我不明白为什么 "Delaware" 这个词没有从下面的代码中提取出来。提取每个其他字符。任何人都可以提供从下面的 Docx 文件中提取单词 "Delaware" 的代码,而无需手动更改文件吗?
输入:
import docx
import io
import requests
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
for text in docx.Document(file).paragraphs:
print(text.text)
输出:
APPLICABLE LAW
This Agreement is to be construed and interpreted according to the laws of the State of , excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.
最奇怪的是,如果我对文档中的单词 "Delaware"(ee.gg., bold/unbold,键入单词)进行任何操作,然后保存它,下次我 运行 代码时, "Delaware" 一词不再丢失。但是,仅保存文件而不更改单词并不能解决问题。您可能会说解决方案是手动更改单词,但实际上我正在处理成千上万的此类文档,并且逐个手动更改每个文档没有任何意义。
的答案似乎提供了为什么这个 "Delaware" 可能无法提取的原因,但它没有提供解决方案。谢谢
我相信@smci 是对的。这很可能解释为:。然而,这并没有提供解决方案。
我认为在这种情况下我们唯一的选择是返回读取 XML 文件。例如,从网页 http://etienned.github.io/posts/extract-text-from-word-docx-simply/ 考虑这个功能(简化):
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
import io
import requests
def get_docx_text(path):
"""Take the path of a docx file as argument, return the text in unicode."""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
print(get_docx_text(file))
我们得到:
APPLICABLE LAW
This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.
我也曾尝试使用 Python-docx 查找电子邮件,但没有成功。
pip install docx2txt
这对我有用,可能有一些不必要的'\n',如果需要用space替换它们
import docx2txt
string = docx2txt.process("filepathandname.docx")
我不明白为什么 "Delaware" 这个词没有从下面的代码中提取出来。提取每个其他字符。任何人都可以提供从下面的 Docx 文件中提取单词 "Delaware" 的代码,而无需手动更改文件吗?
输入:
import docx
import io
import requests
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
for text in docx.Document(file).paragraphs:
print(text.text)
输出:
APPLICABLE LAW This Agreement is to be construed and interpreted according to the laws of the State of , excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.
最奇怪的是,如果我对文档中的单词 "Delaware"(ee.gg., bold/unbold,键入单词)进行任何操作,然后保存它,下次我 运行 代码时, "Delaware" 一词不再丢失。但是,仅保存文件而不更改单词并不能解决问题。您可能会说解决方案是手动更改单词,但实际上我正在处理成千上万的此类文档,并且逐个手动更改每个文档没有任何意义。
我相信@smci 是对的。这很可能解释为:
我认为在这种情况下我们唯一的选择是返回读取 XML 文件。例如,从网页 http://etienned.github.io/posts/extract-text-from-word-docx-simply/ 考虑这个功能(简化):
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
import io
import requests
def get_docx_text(path):
"""Take the path of a docx file as argument, return the text in unicode."""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
print(get_docx_text(file))
我们得到:
APPLICABLE LAW
This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.
我也曾尝试使用 Python-docx 查找电子邮件,但没有成功。
pip install docx2txt
这对我有用,可能有一些不必要的'\n',如果需要用space替换它们
import docx2txt
string = docx2txt.process("filepathandname.docx")