是否可以将 bs4 汤对象与 lxml 一起使用?
Is it possible to use bs4 soup object with lxml?
我正在尝试同时使用 BS4 和 lxml
所以不是解析 html 页面两次,有没有办法在 lxml 中使用 soup 对象,反之亦然?
self.soup = BeautifulSoup(open(path), "html.parser")
我试过像这样将这个对象与 lxml 一起使用
doc = html.fromstring(self.soup)
这是抛出错误TypeError: expected string or bytes-like object
有没有这种用法?
我认为没有办法不通过字符串对象。
from bs4 import BeautifulSoup
import lxml.html
html = """
<html><body>
<div>
<p>test</p>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Soup to lxml.html
doc = lxml.html.fromstring(soup.prettify())
print (type(doc))
print (lxml.html.tostring(doc))
#lxml.html to soup
soup = BeautifulSoup(lxml.html.tostring(doc), 'html.parser')
print (type(soup))
print (soup.prettify())
输出:
<class 'lxml.html.HtmlElement'>
b'<html>\n <body>\n <div>\n <p>\n test\n </p>\n </div>\n </body>\n</html>'
<class 'bs4.BeautifulSoup'>
<html>
<body>
<div>
<p>
test
</p>
</div>
</body>
</html>
已根据评论更新:
您可以使用 lxml.etree 遍历文档对象:
# Soup to lxml.etree
doc = etree.fromstring(soup.prettify())
it = doc.getiterator()
for element in it:
print("%s - %s" % (element.tag, element.text.strip()))
我正在尝试同时使用 BS4 和 lxml 所以不是解析 html 页面两次,有没有办法在 lxml 中使用 soup 对象,反之亦然?
self.soup = BeautifulSoup(open(path), "html.parser")
我试过像这样将这个对象与 lxml 一起使用
doc = html.fromstring(self.soup)
这是抛出错误TypeError: expected string or bytes-like object
有没有这种用法?
我认为没有办法不通过字符串对象。
from bs4 import BeautifulSoup
import lxml.html
html = """
<html><body>
<div>
<p>test</p>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Soup to lxml.html
doc = lxml.html.fromstring(soup.prettify())
print (type(doc))
print (lxml.html.tostring(doc))
#lxml.html to soup
soup = BeautifulSoup(lxml.html.tostring(doc), 'html.parser')
print (type(soup))
print (soup.prettify())
输出:
<class 'lxml.html.HtmlElement'>
b'<html>\n <body>\n <div>\n <p>\n test\n </p>\n </div>\n </body>\n</html>'
<class 'bs4.BeautifulSoup'>
<html>
<body>
<div>
<p>
test
</p>
</div>
</body>
</html>
已根据评论更新:
您可以使用 lxml.etree 遍历文档对象:
# Soup to lxml.etree
doc = etree.fromstring(soup.prettify())
it = doc.getiterator()
for element in it:
print("%s - %s" % (element.tag, element.text.strip()))