Beautiful Soup：如何从不一致的数据中提取 HTML 标签中的数据

Question

我想从以两种形式出现的标签中提取数据：

<td><div><font> Something else</font></div></td>

和

<td><div><font> Something <br/>else</font></div></td>

我正在使用 .string() 方法，在第一种情况下，它会给我所需的字符串 (Something else)，但在第二种情况下，它会给我 None。

有没有更好的方法或替代方法？

Answer 1

你总是可以使用正则表达式来处理这些事情！

import re
result = re.search('font>(.*?)</font',  str(scrapped_html))
print(result[1])

这将适用于您的情况。为避免捕获标签，您需要操作字符串。

通过 print("<br/>" in result[1]) 检查，如果字符串包含
标签，那么它将 return True，在这种情况下，您需要删除标签。

result = str(result[1]).split("<br/>") 这会给你一个列表 [' Something ', 'else']，加入他们以获得你的答案.. result = (" ").join(result)

这是完整的片段：

import re

result = re.search('font>(.*?)</font',  str(scrapped_html))

if "<br/>" in result[1]:
    result = str(result[1]).split("<br/>")
    result = (" ").join(result)
    print(result)
else:
    print(result[1])

我知道这是一个非常糟糕的解决方案，但它对你有用！

Answer 2

尝试使用 .text 属性而不是 .string

from bs4 import BeautifulSoup

html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'

if __name__ == '__main__':
    soup1 = BeautifulSoup(html1, 'html.parser')
    div1 = soup1.select_one('div')
    print(div1.text.strip())

    soup2 = BeautifulSoup(html2, 'html.parser')
    div2 = soup2.select_one('div')
    print(div2.text.strip())

输出：

Something else
Something else

Beautiful Soup：如何从不一致的数据中提取 HTML 标签中的数据

Beautiful Soup : How to extract data from HTML Tags from inconsistent data

python

beautifulsoup

html-parsing

python-3.x