如何通过将 BeautifulSoup 与装饰器一起使用来避免(嵌套)if/else 语句?

How to avoid (nested) if/else statements by using BeautifulSoup with a decorator?

问题描述

由于 BeautifulSoup 返回 soup objectNone,函数中的 if else 语句必须与通过 .find.find_all 将发生。

问题

如何使用装饰器(或类似方法)避免这种情况?

例子

假设有两个不同的 html 网站(使用这些示例片段):

# example with wanted class in html file
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>

# example without wanted class in another html file
<td class='no_translation'>
    foo
</td>

如果您在下方搜索片段,第一个 html 片段一切正常,但第二个片段您将得到:

>>> soup.find('td', class_='translation').find('span', class_='polish')
AttributeError: 'NoneType' object has no attribute 'find'

有两种明显的方法可以处理这个问题AttributeError:

# using if-else statements for every result of .find or .findall
def possibility_1():
    translation = soup.find('td', class_='translation')
    if translation:
        polish = translation.find('span', class_='polish')
        return polish
    return None

# use a try-except block for the problem
def possibility_2():
    try:
        translation = soup.find('td', class_='translation')
        polish = translation.find('span', class_='polish')
        return polish
    except AttributeError:
        return None

使用装饰函数的第三种解决方案怎么样?如何做到这一点?

@decorator_name
def get_desired_result():
    translation = soup.find('td', class_='translation')
    polish = translation.find('span', class_='polish')
    return polish

您可以考虑使用自己的函数而不是 .find.find_all,而不是装饰器并通过 if/then 手动检查 None

此外,返回普通旧版本有 2 个问题 None

  • 您不知道错误从何处传播,因此调试会很困难。
  • 返回 None 后,您可能最终会对 None 执行 soup.find_all("a")link["href"] 之类的操作。这对你一点帮助都没有。

所以你可以尝试这样的事情:

class PseudoNone(object):
    """"
    You can call it.
    You can beat it with a stick.
    It will return PseudoNone!
    And you can trace where the None did come from!!"""
    debug = {}
    def __init__(self, created_at):
        PseudoNone.debug[self] = created_at
    def __getattr__(self, attr):
        return self
    def __call__(self, *args, **kwargs):
        return self
    def __getitem__(self, item):
        return self
    def __bool__(self):
        return False

这个“None”不应该有这些问题。此外,每个实例在创建时都带有一些导致 None 的标识符。 PseudoNone.__call____getitem__ 导致的所有 'children None' 实际上只是内存中的同一个对象,因此在 PseudoNone.debug[obj] 中具有相同的初始失败原因。适合调试!

from bs4 import BeautifulSoup

xml = """
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>"""

def find_all(soup, *args, **kwargs):
    results = soup.find_all(*args, **kwargs)
    if not results:
        return PseudoNone((soup, args, kwargs))
    else:
        return results

def find(soup, *args, **kwargs):
    "As far as I know, BeautifulSoup.find is internally just BeautifulSoup.find_all(*args)[0]"
    results = find_all(soup, *args, **kwargs)
    return results[0]

soup = BeautifulSoup(xml)

translation = find(soup, 'td', class_='translation')

erroneous_translation = find(soup, 'td', class_='BADTRANSLATIONS')

...

print translation
    <td class="translation">
    <span class="italiano">ciao</span>
    <span class="french">au revoir</span>
    <span class="polish">czeĹÄ</span>
    </td>

print erroneous_translation
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print erroneous_translation("foo")
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print erroneous_translation["baz"]
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print find_all(erroneous_translation, "something")
    <__main__.PseudoNone object at 0x7fd4bcc18790>

天哪,这是一个 PseudoNone!那不是我想要的。我哪里错了!!?

print PseudoNone.debug[erroneous_translation]
    (<html><body><td class="translation">
    <span class="italiano">ciao</span>
    <span class="french">au revoir</span>
    <span class="polish">czeĹÄ</span>
    </td></body></html>, ('td',), {'class_': 'BADTRANSLATIONS'})

备注:

  • 使用 isinstance(qux, PseudoNone),而不是 ==None。 (我们不能继承NoneType
  • 如果 PseudoNone.debug 对于内存来说太大了,考虑在 PseudoNone.debug 的值中散列 *args**kwargs(and/or 利用 @functools.lru_cache 在 python3)
  • 这可能是黑客攻击。

感谢 @jonrsharpe 的评论(几乎是讨论)和 的回答,我会坚持装饰器的想法,但随着搜索信息的获取 return None.

这是我的装饰器作为一种可能的解决方案。

import sys
import inspect
from functools import wraps
from bs4 import BeautifulSoup

# Decorator with returning None and trace info if
# soup.find or soup.find_all fails at a certain point
def robust_soup(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except AttributeError:
            # just an example without formatting
            print inspect.getinnerframes(sys.exc_info()[2])
    return wrapper

现在我可以使用了

# a good working example
soup_good = BeautifulSoup("""
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>""")

# an example which would lead to AttributeError if not handled
soup_bad = BeautifulSoup("""
<td class='no_translation'>
    something uninteresting
</td>""")

@robust_soup
def get_desired_result(soup):
    translation = soup.find('td', class_='translation')
    polish = translation.find('span', class_='polish')
    print polish

>>> # with a soup containing information
>>> get_desired_result(soup_good)
<span class='polish'>cześć</span>

>>> # with a soup which normally fails
>>> get_desired_result(soup_bad)
# some debugging output from inspect module (also
# with information where last error occured!)
None