重新格式化从 html 中提取的 line/div

Reformat line/div extracted from html

我目前无法重新格式化从网站提取的 div。

这是我目前拥有的:

<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div> 

但我想要类似这样的东西:

魁北克省过去24小时COVID-19演变最新数据:新增87例,感染总人数达61004人;过去 24 小时内没有死亡,加上 8 月 7 日至 12 日期间发生的 3 例死亡,总计 5,718 例;住院人数较前一日增加2人,累计151人。其中重症监护25人,增加2人;8月12日检测18596人,累计1428286人

我手动删除了它,但是有没有更省时的东西?

试试这样的:

soup.select_one('div[class="ce-bodytext"]').text.strip()

这应该会让您获得预期的输出。

试试这个

text = r'<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>'
import re
print(re.sub(r'<[^<>]*>', ' ', text))

尝试

str(bs4_obj.select('div')[0].text)

我不知道如何从 unicode 转换它, 但它摆脱了 html 标签。