用 BeautifulSoup 包装多个标签
Wrap multiple tags with BeautifulSoup
我正在编写一个 python 脚本,可以将 html 文档转换为 reveal.js 幻灯片。为此,我需要将多个标签包装在 <section>
标签内。
使用 wrap()
方法很容易将一个标签包裹在另一个标签中。但是我不知道如何包装多个标签。
举例说明,原文html:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<h1 id="first-paragraph">First paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<div>
<a href="http://link.com">Here's a link</a>
</div>
<h1 id="second-paragraph">Second paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<script src="lib/.js"></script>
</body>
</html>
"""
"""
我想将 <h1>
及其下一个标签包装在 <section>
标签内,如下所示:
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<section>
<h1 id="first-paragraph">First paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<div>
<a href="http://link.com">Here's a link</a>
</div>
</section>
<section>
<h1 id="second-paragraph">Second paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
</section>
<script src="lib/.js"></script>
</body>
</html>
我是这样选择的:
from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)
h1s = soup.find_all('h1')
for el in h1s:
els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'], el.next_elements)]
els.insert(0, el)
print(els)
输出:
[<h1 id="first-paragraph">First paragraph</h1>, 'First paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n ', <div><a href="http://link.com">Here's a link</a> </div>, '\n ', <a href="http://link.com">Here's a link</a>, "Here's a link", '\n ', '\n\n ']
[<h1 id="second-paragraph">Second paragraph</h1>, 'Second paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n\n ']
选择是正确的,但我看不到如何将每个选择包装在 <section>
标签中。
最后我找到了如何在这种情况下使用 wrap
方法。我需要了解 soup 对象中的每个更改都已就位。
from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)
# wrap all h1 and next siblings into sections
h1s = soup.find_all('h1')
for el in h1s:
els = [i for i in itertools.takewhile(
lambda x: x.name not in [el.name, 'script'],
el.next_siblings)]
section = soup.new_tag('section')
el.wrap(section)
for tag in els:
section.append(tag)
print(soup.prettify())
这给了我想要的输出。希望对您有所帮助。
我想我会考虑一下这个,因为它有点困难和混乱。基本上,使用 BeautifulSoup 编辑的 html_test 字符串,我向树中添加一个新的 div,将其锚定在然后循环它,wrapping/appending 在所有元素中, 直到它到达 , 包括未标记的字符串。 wrapTag 函数将所有元素追加到 div 中,直到到达最后一个 p。意识到 while 循环的主要事情是附加到 div 移动而不是复制 next_sibling 。所以如果我们使用硬列表,第 i 个位置将会移动和循环。希望有所帮助。卢卡斯
#!/usr/bin/env python3
#coding: utf-8
from platform import python_version
from bs4 import __version__ as bs_version, Tag, NavigableString, BeautifulSoup
try:
html_test = '<body><div class="div" id="1">beforePa<p>line 1a</p>betweenPab<p class="x">line 1b<b>bold in 1b</b></p>betweenPbc<p>line 1c</p>betweenPcd<p>line 1d</p>betweenPde<p class="y">line 1e</p>betweenPef<p>line 1f</p>afterPf</div><div class="div" id="2"><p>line 2a</p><p>line 2b</p><p>line 2c</p></div></body>'
html = BeautifulSoup(html_test, 'lxml')
print(html.prettify())
parser = 'lxml'
except:
parser = 'html.parser'
print('python: "'+python_version()+'", bs4 version: "'+bs_version+'", bs4 parser: "'+parser+'"')
def p(tag, sstr=""):
print(sstr+".. .")
print(tag, " ... ", type(tag))
print("text: ", tag.text, " ... ", type(tag.text))
print("string: ", tag.string, " ... ", type(tag.string))
print("contents: ", tag.contents, " ... ", type(tag.contents))
print()
return
def newTag(tag, attrs={}, tstr=""):
n = html.new_tag(tag)
if (len(attrs) > 0):
for k, v in attrs.items():
n[k] = v
if (len(tstr) > 0):
n.string = tstr
return n
def wrapTag(newTag, fromTagInclusive, toTagExclusive):
fromTagInclusive.wrap(newTag)
#p(fromTagInclusive.parent, "fromTag.parent")
n = fromTagInclusive.parent
c = 0
while 1:
c += 1
x = n.next_sibling
if (x is None):
break
n.append(x)
#print(c, x, n.next_sibling, isinstance(n.next_sibling, Tag), n.next_sibling.name if isinstance(n.next_sibling, Tag) else "~Tag", n.next_sibling.attrs if isinstance(n.next_sibling, Tag) else "~Tag")
#if isinstance(n.next_sibling, Tag) and (n.next_sibling.name == 'p') and ('class' in n.next_sibling.attrs) and ('y' in n.next_sibling['class']):
if (n.next_sibling == toTagExclusive):
break
return n, toTagExclusive
n = newTag('div', { 'class':"classx", 'id':"idx" }, 'here we are in classx idx')
p(n, "new div")
n, _ = wrapTag(n, html.find('p', {'class':"x"}), html.find('p', {'class':"y"}))
p(n, "wrapped div")
print(html.prettify())
exit()
我正在编写一个 python 脚本,可以将 html 文档转换为 reveal.js 幻灯片。为此,我需要将多个标签包装在 <section>
标签内。
使用 wrap()
方法很容易将一个标签包裹在另一个标签中。但是我不知道如何包装多个标签。
举例说明,原文html:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<h1 id="first-paragraph">First paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<div>
<a href="http://link.com">Here's a link</a>
</div>
<h1 id="second-paragraph">Second paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<script src="lib/.js"></script>
</body>
</html>
"""
"""
我想将 <h1>
及其下一个标签包装在 <section>
标签内,如下所示:
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<section>
<h1 id="first-paragraph">First paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
<div>
<a href="http://link.com">Here's a link</a>
</div>
</section>
<section>
<h1 id="second-paragraph">Second paragraph</h1>
<p>Some text...</p>
<p>Another text...</p>
</section>
<script src="lib/.js"></script>
</body>
</html>
我是这样选择的:
from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)
h1s = soup.find_all('h1')
for el in h1s:
els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'], el.next_elements)]
els.insert(0, el)
print(els)
输出:
[<h1 id="first-paragraph">First paragraph</h1>, 'First paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n ', <div><a href="http://link.com">Here's a link</a> </div>, '\n ', <a href="http://link.com">Here's a link</a>, "Here's a link", '\n ', '\n\n ']
[<h1 id="second-paragraph">Second paragraph</h1>, 'Second paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n\n ']
选择是正确的,但我看不到如何将每个选择包装在 <section>
标签中。
最后我找到了如何在这种情况下使用 wrap
方法。我需要了解 soup 对象中的每个更改都已就位。
from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)
# wrap all h1 and next siblings into sections
h1s = soup.find_all('h1')
for el in h1s:
els = [i for i in itertools.takewhile(
lambda x: x.name not in [el.name, 'script'],
el.next_siblings)]
section = soup.new_tag('section')
el.wrap(section)
for tag in els:
section.append(tag)
print(soup.prettify())
这给了我想要的输出。希望对您有所帮助。
我想我会考虑一下这个,因为它有点困难和混乱。基本上,使用 BeautifulSoup 编辑的 html_test 字符串,我向树中添加一个新的 div,将其锚定在然后循环它,wrapping/appending 在所有元素中, 直到它到达 , 包括未标记的字符串。 wrapTag 函数将所有元素追加到 div 中,直到到达最后一个 p。意识到 while 循环的主要事情是附加到 div 移动而不是复制 next_sibling 。所以如果我们使用硬列表,第 i 个位置将会移动和循环。希望有所帮助。卢卡斯
#!/usr/bin/env python3
#coding: utf-8
from platform import python_version
from bs4 import __version__ as bs_version, Tag, NavigableString, BeautifulSoup
try:
html_test = '<body><div class="div" id="1">beforePa<p>line 1a</p>betweenPab<p class="x">line 1b<b>bold in 1b</b></p>betweenPbc<p>line 1c</p>betweenPcd<p>line 1d</p>betweenPde<p class="y">line 1e</p>betweenPef<p>line 1f</p>afterPf</div><div class="div" id="2"><p>line 2a</p><p>line 2b</p><p>line 2c</p></div></body>'
html = BeautifulSoup(html_test, 'lxml')
print(html.prettify())
parser = 'lxml'
except:
parser = 'html.parser'
print('python: "'+python_version()+'", bs4 version: "'+bs_version+'", bs4 parser: "'+parser+'"')
def p(tag, sstr=""):
print(sstr+".. .")
print(tag, " ... ", type(tag))
print("text: ", tag.text, " ... ", type(tag.text))
print("string: ", tag.string, " ... ", type(tag.string))
print("contents: ", tag.contents, " ... ", type(tag.contents))
print()
return
def newTag(tag, attrs={}, tstr=""):
n = html.new_tag(tag)
if (len(attrs) > 0):
for k, v in attrs.items():
n[k] = v
if (len(tstr) > 0):
n.string = tstr
return n
def wrapTag(newTag, fromTagInclusive, toTagExclusive):
fromTagInclusive.wrap(newTag)
#p(fromTagInclusive.parent, "fromTag.parent")
n = fromTagInclusive.parent
c = 0
while 1:
c += 1
x = n.next_sibling
if (x is None):
break
n.append(x)
#print(c, x, n.next_sibling, isinstance(n.next_sibling, Tag), n.next_sibling.name if isinstance(n.next_sibling, Tag) else "~Tag", n.next_sibling.attrs if isinstance(n.next_sibling, Tag) else "~Tag")
#if isinstance(n.next_sibling, Tag) and (n.next_sibling.name == 'p') and ('class' in n.next_sibling.attrs) and ('y' in n.next_sibling['class']):
if (n.next_sibling == toTagExclusive):
break
return n, toTagExclusive
n = newTag('div', { 'class':"classx", 'id':"idx" }, 'here we are in classx idx')
p(n, "new div")
n, _ = wrapTag(n, html.find('p', {'class':"x"}), html.find('p', {'class':"y"}))
p(n, "wrapped div")
print(html.prettify())
exit()