从 HTML 中提取与 Python/BeautifulSoup 文本一致的链接

Question

有很多关于如何使用 BeautifulSoup 将 HTML 转换为文本的答案（例如）

关于如何使用 BeautifulSoup 从 HTML 中提取 link 的答案也有很多。

我需要的是一种将 HTML 变成纯文本版本的方法，但保留 link 与 link 附近的文本内联。例如，如果我有一些 HTML 看起来像这样：

<div>Click <a href="www.google.com">Here</a> to receive a quote</div>

最好将其转换为 "Click Here (www.google.com) to receive a quote."

这里的用例是我需要将电子邮件的 HTML 转换为纯文本版本，如果 link 在语义上位于 HTML，而不是在底部。不需要这种确切的语法。对于如何执行此操作的任何指导，我将不胜感激。谢谢！

Answer 1

import html2text

data = """
<div>Click <a href="www.google.com">Here</a> to receive a quote</div>
"""


print(html2text.html2text(data))

输出：

Click [Here](www.google.com) to receive a quote

Answer 2

如果你想要 beautifulsoup 解决方案，你可以从这个例子开始（它可能需要用真实世界的数据进行更多调整）：

data = '<div>Click <a href="www.google.com">Here</a> to receive a quote.</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

# append the text to the link
for a in soup.select('a[href]'):
    a.contents.append(soup.new_string(' ({})'.format(a['href'])))

# unwrap() all tags
for tag in soup.select('*'):
    tag.unwrap()

print(soup)

打印：

Click Here (www.google.com) to receive a quote.

从 HTML 中提取与 Python/BeautifulSoup 文本一致的链接

Extract Links from HTML In Line with Text with Python/BeautifulSoup

html

python

beautifulsoup

html-parsing