如何获得 "subsoups" 和 concatenate/join 它们?
How to get "subsoups" and concatenate/join them?
我有一份 HTML 文件需要处理。为此,我使用 'beautifoulsoup'。现在我想从该文档中检索一些 "subsoups" 并将它们加入一个 soup 中,这样我以后可以将它用作需要 soup 对象的函数的参数。
如果不清楚,我给你举个例子...
from bs4 import BeautifulSoup
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(my_document)
# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]
# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)
# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)
目标是在 resulting_soup
中有一个 is/behaves 像具有以下内容的汤的对象:
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
有什么方便的方法吗?如果有比 find()
更好的检索 "subsoups" 的方法,我可以改用它。谢谢。
更新
Wondercricket 建议使用 连接包含找到的标签的字符串并将它们再次解析为新的 BeautifulSoup 对象。虽然这是解决问题的一种可能方法,但重新解析的时间可能比我想要的要长,尤其是当我想检索其中的大部分并且有很多此类文档需要处理时。 find()
returns一个bs4.element.Tag
。有没有办法在不将 Tag
转换为字符串并解析字符串的情况下将多个 Tag
连接成一个汤?
您可以使用 findAll
并传入您要使用的元素的 ids
。
import bs4
soup = bs4.BeautifulSoup(my_document)
#EDIT -> I discovered you do not need regex, you can pass in a list of `ids`
sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']})
#EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags.
sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser')
print(sub)
>>> <div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
SoupStrainer
会完全按照您的要求进行操作,作为奖励,您将获得性能提升,因为它会准确解析您想要解析的内容——而不是完整的文档树:
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
现在,soup
对象将只包含所需的元素:
<div id="first">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<p>
A paragraph.
</p>
</div>
<div id="third">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<a href="yet_another_doc.html">
A link
</a>
</div>
<p id="loner">
A paragraph.
</p>
Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?
在这种情况下,您可以制作一个search function来加入多个条件SoupStrainer
:
from bs4 import BeautifulSoup, SoupStrainer, ResultSet
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
<p class="myclass">test</p>
</body>
</html>
"""
def search(tag, attrs):
if tag == "p" and "myclass" in attrs.get("class", []):
return tag
if attrs.get("id") in ["first", "third", "loner"]:
return tag
parse_only = SoupStrainer(search)
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
print(soup.prettify())
我有一份 HTML 文件需要处理。为此,我使用 'beautifoulsoup'。现在我想从该文档中检索一些 "subsoups" 并将它们加入一个 soup 中,这样我以后可以将它用作需要 soup 对象的函数的参数。
如果不清楚,我给你举个例子...
from bs4 import BeautifulSoup
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(my_document)
# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]
# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)
# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)
目标是在 resulting_soup
中有一个 is/behaves 像具有以下内容的汤的对象:
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
有什么方便的方法吗?如果有比 find()
更好的检索 "subsoups" 的方法,我可以改用它。谢谢。
更新
Wondercricket 建议使用 find()
returns一个bs4.element.Tag
。有没有办法在不将 Tag
转换为字符串并解析字符串的情况下将多个 Tag
连接成一个汤?
您可以使用 findAll
并传入您要使用的元素的 ids
。
import bs4
soup = bs4.BeautifulSoup(my_document)
#EDIT -> I discovered you do not need regex, you can pass in a list of `ids`
sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']})
#EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags.
sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser')
print(sub)
>>> <div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
SoupStrainer
会完全按照您的要求进行操作,作为奖励,您将获得性能提升,因为它会准确解析您想要解析的内容——而不是完整的文档树:
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
现在,soup
对象将只包含所需的元素:
<div id="first">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<p>
A paragraph.
</p>
</div>
<div id="third">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<a href="yet_another_doc.html">
A link
</a>
</div>
<p id="loner">
A paragraph.
</p>
Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?
在这种情况下,您可以制作一个search function来加入多个条件SoupStrainer
:
from bs4 import BeautifulSoup, SoupStrainer, ResultSet
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
<p class="myclass">test</p>
</body>
</html>
"""
def search(tag, attrs):
if tag == "p" and "myclass" in attrs.get("class", []):
return tag
if attrs.get("id") in ["first", "third", "loner"]:
return tag
parse_only = SoupStrainer(search)
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
print(soup.prettify())