如何通过将 BeautifulSoup 与装饰器一起使用来避免(嵌套)if/else 语句?
How to avoid (nested) if/else statements by using BeautifulSoup with a decorator?
问题描述
由于 BeautifulSoup 返回 soup object
或 None
,函数中的 if else
语句必须与通过 .find
或 .find_all
将发生。
问题
如何使用装饰器(或类似方法)避免这种情况?
例子
假设有两个不同的 html 网站(使用这些示例片段):
# example with wanted class in html file
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>
# example without wanted class in another html file
<td class='no_translation'>
foo
</td>
如果您在下方搜索片段,第一个 html 片段一切正常,但第二个片段您将得到:
>>> soup.find('td', class_='translation').find('span', class_='polish')
AttributeError: 'NoneType' object has no attribute 'find'
有两种明显的方法可以处理这个问题AttributeError
:
# using if-else statements for every result of .find or .findall
def possibility_1():
translation = soup.find('td', class_='translation')
if translation:
polish = translation.find('span', class_='polish')
return polish
return None
# use a try-except block for the problem
def possibility_2():
try:
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
return polish
except AttributeError:
return None
使用装饰函数的第三种解决方案怎么样?如何做到这一点?
@decorator_name
def get_desired_result():
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
return polish
您可以考虑使用自己的函数而不是 .find
和 .find_all
,而不是装饰器并通过 if/then 手动检查 None
。
此外,返回普通旧版本有 2 个问题 None
。
- 您不知道错误从何处传播,因此调试会很困难。
- 返回
None
后,您可能最终会对 None
执行 soup.find_all("a")
或 link["href"]
之类的操作。这对你一点帮助都没有。
所以你可以尝试这样的事情:
class PseudoNone(object):
""""
You can call it.
You can beat it with a stick.
It will return PseudoNone!
And you can trace where the None did come from!!"""
debug = {}
def __init__(self, created_at):
PseudoNone.debug[self] = created_at
def __getattr__(self, attr):
return self
def __call__(self, *args, **kwargs):
return self
def __getitem__(self, item):
return self
def __bool__(self):
return False
这个“None
”不应该有这些问题。此外,每个实例在创建时都带有一些导致 None
的标识符。 PseudoNone.__call__
或 __getitem__
导致的所有 'children None
' 实际上只是内存中的同一个对象,因此在 PseudoNone.debug[obj]
中具有相同的初始失败原因。适合调试!
from bs4 import BeautifulSoup
xml = """
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>"""
def find_all(soup, *args, **kwargs):
results = soup.find_all(*args, **kwargs)
if not results:
return PseudoNone((soup, args, kwargs))
else:
return results
def find(soup, *args, **kwargs):
"As far as I know, BeautifulSoup.find is internally just BeautifulSoup.find_all(*args)[0]"
results = find_all(soup, *args, **kwargs)
return results[0]
soup = BeautifulSoup(xml)
translation = find(soup, 'td', class_='translation')
erroneous_translation = find(soup, 'td', class_='BADTRANSLATIONS')
...
print translation
<td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td>
print erroneous_translation
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation("foo")
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation["baz"]
<__main__.PseudoNone object at 0x7fd4bcc18790>
print find_all(erroneous_translation, "something")
<__main__.PseudoNone object at 0x7fd4bcc18790>
天哪,这是一个 PseudoNone!那不是我想要的。我哪里错了!!?
print PseudoNone.debug[erroneous_translation]
(<html><body><td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td></body></html>, ('td',), {'class_': 'BADTRANSLATIONS'})
备注:
- 使用
isinstance(qux, PseudoNone)
,而不是 ==None
。 (我们不能继承NoneType
)
- 如果
PseudoNone.debug
对于内存来说太大了,考虑在 PseudoNone.debug 的值中散列 *args
和 **kwargs
(and/or 利用 @functools.lru_cache
在 python3)
- 这可能是黑客攻击。
感谢 @jonrsharpe 的评论(几乎是讨论)和 的回答,我会坚持装饰器的想法,但随着搜索信息的获取 return None
.
这是我的装饰器作为一种可能的解决方案。
import sys
import inspect
from functools import wraps
from bs4 import BeautifulSoup
# Decorator with returning None and trace info if
# soup.find or soup.find_all fails at a certain point
def robust_soup(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except AttributeError:
# just an example without formatting
print inspect.getinnerframes(sys.exc_info()[2])
return wrapper
现在我可以使用了
# a good working example
soup_good = BeautifulSoup("""
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>""")
# an example which would lead to AttributeError if not handled
soup_bad = BeautifulSoup("""
<td class='no_translation'>
something uninteresting
</td>""")
@robust_soup
def get_desired_result(soup):
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
print polish
>>> # with a soup containing information
>>> get_desired_result(soup_good)
<span class='polish'>cześć</span>
>>> # with a soup which normally fails
>>> get_desired_result(soup_bad)
# some debugging output from inspect module (also
# with information where last error occured!)
None
问题描述
由于 BeautifulSoup 返回 soup object
或 None
,函数中的 if else
语句必须与通过 .find
或 .find_all
将发生。
问题
如何使用装饰器(或类似方法)避免这种情况?
例子
假设有两个不同的 html 网站(使用这些示例片段):
# example with wanted class in html file
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>
# example without wanted class in another html file
<td class='no_translation'>
foo
</td>
如果您在下方搜索片段,第一个 html 片段一切正常,但第二个片段您将得到:
>>> soup.find('td', class_='translation').find('span', class_='polish')
AttributeError: 'NoneType' object has no attribute 'find'
有两种明显的方法可以处理这个问题AttributeError
:
# using if-else statements for every result of .find or .findall
def possibility_1():
translation = soup.find('td', class_='translation')
if translation:
polish = translation.find('span', class_='polish')
return polish
return None
# use a try-except block for the problem
def possibility_2():
try:
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
return polish
except AttributeError:
return None
使用装饰函数的第三种解决方案怎么样?如何做到这一点?
@decorator_name
def get_desired_result():
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
return polish
您可以考虑使用自己的函数而不是 .find
和 .find_all
,而不是装饰器并通过 if/then 手动检查 None
。
此外,返回普通旧版本有 2 个问题 None
。
- 您不知道错误从何处传播,因此调试会很困难。
- 返回
None
后,您可能最终会对None
执行soup.find_all("a")
或link["href"]
之类的操作。这对你一点帮助都没有。
所以你可以尝试这样的事情:
class PseudoNone(object):
""""
You can call it.
You can beat it with a stick.
It will return PseudoNone!
And you can trace where the None did come from!!"""
debug = {}
def __init__(self, created_at):
PseudoNone.debug[self] = created_at
def __getattr__(self, attr):
return self
def __call__(self, *args, **kwargs):
return self
def __getitem__(self, item):
return self
def __bool__(self):
return False
这个“None
”不应该有这些问题。此外,每个实例在创建时都带有一些导致 None
的标识符。 PseudoNone.__call__
或 __getitem__
导致的所有 'children None
' 实际上只是内存中的同一个对象,因此在 PseudoNone.debug[obj]
中具有相同的初始失败原因。适合调试!
from bs4 import BeautifulSoup
xml = """
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>"""
def find_all(soup, *args, **kwargs):
results = soup.find_all(*args, **kwargs)
if not results:
return PseudoNone((soup, args, kwargs))
else:
return results
def find(soup, *args, **kwargs):
"As far as I know, BeautifulSoup.find is internally just BeautifulSoup.find_all(*args)[0]"
results = find_all(soup, *args, **kwargs)
return results[0]
soup = BeautifulSoup(xml)
translation = find(soup, 'td', class_='translation')
erroneous_translation = find(soup, 'td', class_='BADTRANSLATIONS')
...
print translation
<td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td>
print erroneous_translation
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation("foo")
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation["baz"]
<__main__.PseudoNone object at 0x7fd4bcc18790>
print find_all(erroneous_translation, "something")
<__main__.PseudoNone object at 0x7fd4bcc18790>
天哪,这是一个 PseudoNone!那不是我想要的。我哪里错了!!?
print PseudoNone.debug[erroneous_translation]
(<html><body><td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td></body></html>, ('td',), {'class_': 'BADTRANSLATIONS'})
备注:
- 使用
isinstance(qux, PseudoNone)
,而不是==None
。 (我们不能继承NoneType
) - 如果
PseudoNone.debug
对于内存来说太大了,考虑在 PseudoNone.debug 的值中散列*args
和**kwargs
(and/or 利用@functools.lru_cache
在 python3) - 这可能是黑客攻击。
感谢 @jonrsharpe 的评论(几乎是讨论)和 None
.
这是我的装饰器作为一种可能的解决方案。
import sys
import inspect
from functools import wraps
from bs4 import BeautifulSoup
# Decorator with returning None and trace info if
# soup.find or soup.find_all fails at a certain point
def robust_soup(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except AttributeError:
# just an example without formatting
print inspect.getinnerframes(sys.exc_info()[2])
return wrapper
现在我可以使用了
# a good working example
soup_good = BeautifulSoup("""
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>""")
# an example which would lead to AttributeError if not handled
soup_bad = BeautifulSoup("""
<td class='no_translation'>
something uninteresting
</td>""")
@robust_soup
def get_desired_result(soup):
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
print polish
>>> # with a soup containing information
>>> get_desired_result(soup_good)
<span class='polish'>cześć</span>
>>> # with a soup which normally fails
>>> get_desired_result(soup_bad)
# some debugging output from inspect module (also
# with information where last error occured!)
None