解析无效 HTML 并检索标签的文本以替换它

Parsing invalid HTML and retrieving tag´s text to replace it

我需要迭代无效HTML并从所有标签中获取一个文本值来更改它。

from bs4 import BeautifulSoup

html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

for tag in soup.find_all():
    print(tag.name)
    if tag.string:
        tag.string.replace_with("1")

print(soup)

结果是

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>

我知道如何替换文本,但 bs 找不到段落标记的文本。所以找不到文本“Začátek sklizně:”和“Otevřeno:”和“, denně”所以我无法替换它们。

我试过使用不同的解析器,例如 lxml 和 html5lib,但没有什么不同。 我尝试了 python 的 HTML 库,但它不支持更改 HTML 只能迭代它。

.string returns on a tag type object a NavigableString type object -> 你的标签只有一个字符串 child那么 returned 值就是那个字符串,如果 它没有 children 或超过一个 child 它会 return None.

场景对我来说不是很清楚,但这是基于您的评论的最后一种方法:

I need generic code to iterate any html and find all texts so I can work with them.

for tag in soup.find_all(text=True):
    tag.replace_with('1')

例子

from bs4 import BeautifulSoup

html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(text=True):
    tag.replace_with('1')

输出

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>