删除特定的 "span" 标签，同时保留 html 对象

Question

我正在使用 beautifulsoup 和 python 抓取一个网站，该网站有超过 100 个 span 标签。我想删除 2 个连续的 span 标签，其中第一个 span 标签有文本元素“READ MORE:”，第二个 span 标签是一些字符串。

<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
 <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
 <span>READ MORE: </span>,
 <span>Long queues form at airports as one million Aussies set to fly this Easter</span>,
 <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
 <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
 <span>READ MORE: </span>,
 <span>Four female backpackers killed in horror highway crash</span>,
 <span>The court also heard he had earned the title of a serial traffic offender.</span>,
 <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
 <span>Watfa will serve at least two years and three months for manslaughter.</span>,
 <span>He will be eligible for parole in early 2024.</span>

例如：我要删除以下 4 个标签

<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>
<span>READ MORE: </span>,
 <span>Four female backpackers killed in horror highway crash</span>

输出应该是：

<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
 <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
 <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
 <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
 <span>The court also heard he had earned the title of a serial traffic offender.</span>,
 <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
 <span>Watfa will serve at least two years and three months for manslaughter.</span>,
 <span>He will be eligible for parole in early 2024.</span>

如果有人能帮助我理解 python.cheers

中的逻辑，我将不胜感激

Answer 1

假设您抓取新闻网站每篇文章的文本，您应该改变您的策略。

清理树，同时 .decompose() 你不想刮掉的元素：

for e in soup.select('span:-soup-contains("READ MORE")'):
    e.find_next('span').decompose()
    e.decompose()

比select文章正文和提取文本：

soup.select_one('.article__body-croppable').get_text(' ', strip=True)

这导致：

A driver has been jailed over the death of a baby boy who was sitting on his lap during a crash in Sydney's south-west . Two cars collided at low speed in Lurnea on February 25, 2019. The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa. Peter Watfa has been jailed for at least two years and three months. (9News) Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred. The baby boy suffered fatal injuries when the driver's airbag deployed. A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care". An 11-month-old boy died in the crash. (9News) The court also heard he had earned the title of a serial traffic offender. In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs. Watfa will serve at least two years and three months for manslaughter. He will be eligible for parole in early 2024.

事实上，您也可以迭代 ResultSet 并创建一个包含所有有效 <span> 的新 list，但我认为这不是最佳选择：

[x for i, x in enumerate(results) if 'READ MORE' not in x.text and 'READ MORE' not in results[i-1].text]

删除特定的 "span" 标签，同时保留 html 对象

Remove specific "span" tag while preserving html object

html

beautifulsoup

web-scraping

python-3.x