使用 Beautiful Soup (bs4) 解析和修改内容
Parsing and modifying content with Beautiful Soup (bs4)
目标是仅修改现有 html 的内容。
例如,给定当前标记:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
假设,我想将 "™"
字符串附加到每个长度为 6 的单词。
预期结果:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces™ After a Period™ (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces™ After a Period™ (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
我是 python 的新手,对此遇到了麻烦。由于嵌套内容,我很难正确访问元素并返回预期结果。
这是我目前尝试过的方法:
soup = BeautifulSoup(markup, 'html.parser')
new_html = []
for tags in soup.contents:
for tag in tags:
if type(tag) != str:
split_tag = re.split(r"(\W+)", str(tag.string))
for word in split_tag:
if len(word) == 6 and word.isalpha():
word += "™"
tag.string = "".join(split_tag)
else:
str_obj.append(tag)
new_html.append(str(tag))
您可以将 .find_all(text=True)
与 .replace_with()
结合使用:
import re
from bs4 import BeautifulSoup
html_doc = """
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for s in soup.find_all(text=True):
new_s = re.sub(r"([a-zA-Z]{6,})", r"™", s)
s.replace_with(new_s)
print(soup.prettify())
# to have HTML entities:
# print(soup.prettify(formatter="html"))
打印:
<html lang="en" op="item">
<head>
<meta content="origin" name="referrer"/>
<title>
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</title>
</head>
<body>
<center>
<table border="0" class="fatitem">
<tr class="athing" id="25581282">
<td class="title">
<a class="titlelink">
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</a>
</td>
</tr>
</table>
</center>
</body>
</html>
目标是仅修改现有 html 的内容。
例如,给定当前标记:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
假设,我想将 "™"
字符串附加到每个长度为 6 的单词。
预期结果:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces™ After a Period™ (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces™ After a Period™ (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
我是 python 的新手,对此遇到了麻烦。由于嵌套内容,我很难正确访问元素并返回预期结果。
这是我目前尝试过的方法:
soup = BeautifulSoup(markup, 'html.parser')
new_html = []
for tags in soup.contents:
for tag in tags:
if type(tag) != str:
split_tag = re.split(r"(\W+)", str(tag.string))
for word in split_tag:
if len(word) == 6 and word.isalpha():
word += "™"
tag.string = "".join(split_tag)
else:
str_obj.append(tag)
new_html.append(str(tag))
您可以将 .find_all(text=True)
与 .replace_with()
结合使用:
import re
from bs4 import BeautifulSoup
html_doc = """
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for s in soup.find_all(text=True):
new_s = re.sub(r"([a-zA-Z]{6,})", r"™", s)
s.replace_with(new_s)
print(soup.prettify())
# to have HTML entities:
# print(soup.prettify(formatter="html"))
打印:
<html lang="en" op="item">
<head>
<meta content="origin" name="referrer"/>
<title>
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</title>
</head>
<body>
<center>
<table border="0" class="fatitem">
<tr class="athing" id="25581282">
<td class="title">
<a class="titlelink">
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</a>
</td>
</tr>
</table>
</center>
</body>
</html>