xml解析returnshtml,如何获取其中的文字python

Question

我正在使用 minidom 解析 xbrl 文件。我使用 getElementsByTagName

找到以下内容

<table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">Independent auditor's report on the financial statements</td></tr></table><br><table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">We have audited the financial statements of KPMG Statsautoriseret Revisionspartnerselskab for the financial year 11 December 2013 – 31 December 2014. The financial statements comprise income statement, balance sheet, statement of changes in equity, cash flow statement accounting policies and notes. The financial statements are prepared in accordance with the Danish Financial Statements Act.</td></tr></table>

现在我只想从中获取文本，我应该如何进行？我应该从现在开始使用 beautifulsoup 吗？

整个文件可以在 here 找到，我正在查看的字段是 <arr:AuditorsReportOnFinancialStatements

Answer 1

soup = BeautifulSoup(auditorsReportOnAuditedFS[0].firstChild.data)
    items = soup.find_all('td')
    listForString = []
    for item in items:
        listForString.append(item.text.encode('utf-8').strip())
    result.append(' : '.join(['AuditorsReportOnFinancialStatements', ' - '.join(listForString)]))

这个有效

xml解析returnshtml,如何获取其中的文字python

xml parsing returns html, how to get the text of it python

python

xml

html-parsing

xml-parsing

python-2.7