lxml / BeautifulSoup 解析器警告

Question

使用 Python 3，我试图通过使用 lxml 和 BeautifulSoup 来解析丑陋的 HTML（不在我的控制之下），如下所述： http://lxml.de/elementsoup.html

具体来说，我想使用 lxml，但我想使用 BeautifulSoup，因为就像我说的，它很丑 HTML 并且 lxml 会拒绝它靠自己。

上面的link表示："All you need to do is pass it to the fromstring() function:"

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

这就是我正在做的事情：

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

它工作在某种意义上我可以在那之后很好地操纵 HTML 。我的问题是每次我运行脚本时，我都会收到这个恼人的警告：

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

我的问题可能很明显：我没有实例化 BeautifulSoup 自己。我已经尝试将建议的参数添加到 fromstring 函数，但这只会给我错误：TypeError: 'str' object is not callable。至今网上查无果。

我想删除该警告消息。感谢帮助，提前致谢。

Answer 1

我必须阅读 lxml 和 BeautifulSoup 的源代码才能弄明白。

我在这里发布我自己的答案，以防将来其他人可能需要它。

有问题的 fromstring 函数定义如下：

def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):

**bsargs 参数最终被发送到 BeautifulSoup 构造函数，它是这样调用的（在另一个函数中，_parse）：

tree = beautifulsoup(source, **bsargs)

BeautifulSoup 构造函数定义如下：

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):

现在，回到问题中的警告，它建议将参数 "html.parser" 添加到 BeautifulSoup 的构造函数中。根据这个，那将是名为 features.

的参数

由于 fromstring 函数会将命名参数传递给 BeautifulSoup 的构造函数，我们可以通过命名 fromstring 函数的参数来指定解析器，如下所示：

root = fromstring(clean, features='html.parser')

噗。警告消失。

Answer 2

在使用 BeautifulSoup 时，我们总是做如下事情：

[variable] = BeautifulSoup([contents you want to analyze])

问题是：

如果您之前安装过“lxml”，BeautifulSoup会自动注意到它使用它作为praser。这不是错误，只是一个通知。

那么如何去除呢？

就像下面这样：

[variable] = BeautifulSoup([contents you want to analyze], features = "lxml")

"Based on the latest version of BeautifulSoup, 4.6.3"

注意不同版本的BeautifulSoup有不同的方式，或者语法，添加这个模式，仔细看通知消息。

祝你好运！

Answer 3

对于其他人初始化如下：

soup = BeautifulSoup(html_doc)

使用

soup = BeautifulSoup(html_doc, 'html.parser')

改为

lxml / BeautifulSoup 解析器警告

lxml / BeautifulSoup parser warning

python

lxml

beautifulsoup

python-3.x