重新格式化从 html 中提取的 line/div

Question

我目前无法重新格式化从网站提取的 div。

这是我目前拥有的：

<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>

但我想要类似这样的东西：

魁北克省过去24小时COVID-19演变最新数据：新增87例，感染总人数达61004人；过去 24 小时内没有死亡，加上 8 月 7 日至 12 日期间发生的 3 例死亡，总计 5,718 例；住院人数较前一日增加2人，累计151人。其中重症监护25人，增加2人；8月12日检测18596人，累计1428286人

我手动删除了它，但是有没有更省时的东西？

Answer 1

试试这样的：

soup.select_one('div[class="ce-bodytext"]').text.strip()

这应该会让您获得预期的输出。

Answer 2

试试这个

text = r'<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>'
import re
print(re.sub(r'<[^<>]*>', ' ', text))

Answer 3

尝试

str(bs4_obj.select('div')[0].text)

我不知道如何从 unicode 转换它，但它摆脱了 html 标签。

重新格式化从 html 中提取的 line/div

Reformat line/div extracted from html

html

python

extract

beautifulsoup

python-3.x