Cutting/Slicing 一个 HTML 文件用 BeautifulSoup 分成几块?
Cutting/Slicing an HTML document into pieces with BeautifulSoup?
我有一个 HTML 文档如下:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
我不需要文章中的引用,我想在第二个 h2 标签处切分文档。
显然我可以找到这样的 h2 标签列表:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
我不想获取 h2 标签的列表,我想在第二个 h2 标签处切分文档并将上述内容保存在一个新变量中。基本上我想要的期望输出是:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
执行此 "slicing"/剪切 HTML 文档而不是简单地查找标签并输出标签本身的最佳方法是什么?
您可以 remove/extract "References" 元素的每个同级元素和元素本身:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
打印:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>
你可以找到h2
在字符串中的位置,然后通过它找到一个子串:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]
我有一个 HTML 文档如下:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
我不需要文章中的引用,我想在第二个 h2 标签处切分文档。
显然我可以找到这样的 h2 标签列表:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
我不想获取 h2 标签的列表,我想在第二个 h2 标签处切分文档并将上述内容保存在一个新变量中。基本上我想要的期望输出是:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
执行此 "slicing"/剪切 HTML 文档而不是简单地查找标签并输出标签本身的最佳方法是什么?
您可以 remove/extract "References" 元素的每个同级元素和元素本身:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
打印:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>
你可以找到h2
在字符串中的位置,然后通过它找到一个子串:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]