从 BS4 到 lxml 解析器的代码转换

Question

我正在开展一个项目，使用 BS4 从本地存储的 HTML 文件中提取特定信息。因为我确实有相当多的文件（> 100 万）速度和性能是有一个有用的代码浏览所有文件的关键。直到现在我都在使用 BS4，因为我之前在网络爬虫上工作，我认为 BS4 非常简单和方便。但是，如果涉及到大数据，BS4 就很慢了。我读到了 lxml parser 和 html.parser ，这似乎是 HTML 文档中 python 中最简单和最快的解析器。

所以我的代码现在看起来像：

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def trade_spider():
    os.chdir(r"C:\Users\XXX")
    with stdout2file("output.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.findAll("ix:nonfraction"):
                    if re.match(".*SearchTag", item['name']):
                        print(file.split(os.path.sep)[-1], end="| ")
                        print(item['name'], end="| ")
                        print(item.get_text())
                        break
trade_spider()

它打开一个文本文件，进入我设置的目录 (os.chdir(..))，搜索所有以 .html 结尾的文件，读取内容，如果找到带有名称的标签属性 "SearchTag" 它获取相关的 HTML 文本并将其打印到我打开的文本文件中。一场比赛结束后有休息时间，下一场比赛将继续进行。所以我读到的是，BS4 在内存中完成这一切，这显着增加了处理时间。

这就是为什么我想使用 lxml（首选）或 html.parser 来更改我的代码。

你们中有谁是天才并且能够在不改变我最初对此的简单想法的情况下更改我的代码以使用 lxml 解析器？

感谢任何帮助，因为我完全被困住了....

更新：

import lxml.etree as et
import os
import glob

import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def skip_to(fle, line):
        with open(fle) as f:
            pos = 0
            cur_line = f.readline().strip()
            while not cur_line.startswith(line):
                pos = f.tell()
                cur_line = f.readline()
            f.seek(pos)
            return et.parse(f)


def trade_spider():
    os.chdir(r"F:_Independent Auditors Report")
    with stdout2file("auditfeesexpenses.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            xml = skip_to(file, "<?xml")
            tree = xml.getroot()
            nsmap = {"ix": tree.nsmap["ix"]}
            fractions = xml.xpath("//ix:nonFraction[contains(@name, 'AuditFeesExpenses')]", namespaces=nsmap)
            for fraction in fractions:
                print(file.split(os.path.sep)[-1], end="| ")
                print(fraction.get("name"), end="| ")
                print(fraction.text, end=" \n")
                break

trade_spider()

我收到此错误消息：

Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/lxmlparser.py", line 43, in <module>
    trade_spider()
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/lxmlparser.py", line 33, in trade_spider
    xml = skip_to(file, "<?xml")
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/lxmlparser.py", line 26, in skip_to
    return et.parse(f)
  File "lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:79720)
  File "parser.pxi", line 1803, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:116182)
  File "parser.pxi", line 1823, in lxml.etree._parseFilelikeDocument (src\lxml\lxml.etree.c:116474)
  File "parser.pxi", line 1718, in lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:115235)
  File "parser.pxi", line 1139, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\lxml.etree.c:110109)
  File "parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:103323)
  File "parser.pxi", line 679, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:104936)
  File "lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:10656)
  File "parser.pxi", line 362, in lxml.etree._FileReaderContext.copyToBuffer (src\lxml\lxml.etree.c:100828)
  File "C:\Users30p\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1789: character maps to <undefined>

Answer 1

根据 pastebin 中的 html 文件整理 html 需要做一些工作，下面找到 nonFraction 名称属性包含 'AuditFeesExpenses':

import lxml.etree as et

def skip_to(fle, line):
    with open(fle) as f:
        pos = 0
        cur_line = f.readline().strip()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return et.parse(f)

xml = skip_to("/home/padraic/Downloads/sample_html_file.html","<?xml")
tree = xml.getroot()
# one mapping is None ->  None: 'http://www.w3.org/1999/xhtml'
nsmap = {k: v for k, v in tree.nsmap.items() if k}

print(xml.xpath("//ix:nonFraction[contains(@name, 'AuditFeesExpenses')]", namespaces=nsmap))

输出：

[<Element {http://www.xbrl.org/2008/inlineXBRL}nonFraction at 0x7f5b9e91c560>, <Element {http://www.xbrl.org/2008/inlineXBRL}nonFraction at 0x7f5b9e91c5a8>]

要拉取文字和名字：

fractions = xml.xpath("//ix:nonFraction[contains(@name, 'AuditFeesExpenses')]", namespaces=nsmap)

for fraction in fractions:
    print(fraction.get("name"))
    print(fraction.text)

哪个会给你：

ns19:AuditFeesExpenses
1,850
ns19:AuditFeesExpenses
2,400

此外，如果您只是使用 ix 命名空间，您可以直接拉取它

tree = xml.getroot()
nsmap = {"ix":tree.nsmap["ix"]}

fractions = xml.xpath("//ix:nonFraction[contains(@name, 'AuditFeesExpenses')]", namespaces=nsmap)

for fraction in fractions:
    print(fraction.get("name"))
    print(fraction.text)

所以完整的代码：

def trade_spider():
    os.chdir(r"C:\Users\Independent Auditors Report")
    with stdout2file("auditfeesexpenses.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            xml = skip_to(file, "<?xml")
            tree = xml.getroot()
            nsmap = {"ix": tree.nsmap["ix"]}
            fractions = xml.xpath("//ix:nonFraction[contains(@name, 'AuditFeesExpenses')]", namespaces=nsmap)
            for fraction in fractions:
                print(file.split(os.path.sep)[-1], end="| ")
                print(fraction.get("name"), end="| ")
                print(fraction.text, end="|")

代替os.chdir你还可以：

for file in glob.iglob('C:/Users/Independent Auditors Report/**/*.html', recursive=True):

从 BS4 到 lxml 解析器的代码转换

Code transformation from BS4 to lxml parser

lxml

beautifulsoup

pycharm

python-3.x