使用 Beautiful Soup (bs4) 解析和修改内容

Parsing and modifying content with Beautiful Soup (bs4)

目标是仅修改现有 html 的内容。

例如,给定当前标记:

<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces After a Period (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>

假设,我想将 "&#x2122;" 字符串附加到每个长度为 6 的单词。

预期结果:

<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces&#x2122; After a Period&#x2122; (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces&#x2122; After a Period&#x2122; (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>

我是 python 的新手,对此遇到了麻烦。由于嵌套内容,我很难正确访问元素并返回预期结果。

这是我目前尝试过的方法:

    soup = BeautifulSoup(markup, 'html.parser')
    new_html = []
    
    for tags in soup.contents:
        for tag in tags:
            if type(tag) != str:
                split_tag = re.split(r"(\W+)", str(tag.string))
                for word in split_tag:
                    if len(word) == 6 and  word.isalpha():
                        word += "&#x2122;"
                tag.string = "".join(split_tag)
            else:
                str_obj.append(tag)
            new_html.append(str(tag))

您可以将 .find_all(text=True).replace_with() 结合使用:

import re
from bs4 import BeautifulSoup

html_doc = """
<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces After a Period (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")


for s in soup.find_all(text=True):
    new_s = re.sub(r"([a-zA-Z]{6,})", r"™", s)
    s.replace_with(new_s)

print(soup.prettify())

# to have HTML entities:
# print(soup.prettify(formatter="html"))

打印:

<html lang="en" op="item">
 <head>
  <meta content="origin" name="referrer"/>
  <title>
   The Scientific™ Case for Two Spaces™ After a Period™ (2018)
  </title>
 </head>
 <body>
  <center>
   <table border="0" class="fatitem">
    <tr class="athing" id="25581282">
     <td class="title">
      <a class="titlelink">
       The Scientific™ Case for Two Spaces™ After a Period™ (2018)
      </a>
     </td>
    </tr>
   </table>
  </center>
 </body>
</html>