我如何从中提取 Href 和标题 HTML

Question

我的 bs4.element.ResultSet 有这样的格式：

    [<h3 class="foo1">
    <a href="someLink" title="someTitle">SomeTitle</a>
    </h3>,
    <h3 class="foo1">
    <a href="OtherLink" title="OtherTitle">OtherTitle</a>
    </h3>]

我希望能够提取并保存在元组中 [(title,href),(title2, href2)] 但我好像做不到

我最接近的尝试是

    link = soup.find('h3',class_='foo1').find('a').get('title')
    print(link)

但只有returns第2个以上的元素我怎样才能成功提取每个 href 和 title

Answer 1

Select 您的元素更具体，例如使用 css selectors 并遍历 ResultSet 以获取每个属性作为 tuples:

的列表

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href][title]')]

例子

from bs4 import BeautifulSoup
html = '''
<h3 class="foo1">
    <a href="someLink" title="someTitle">SomeTitle</a>
</h3>
<h3 class="foo1">
    <a href="OtherLink" title="OtherTitle">OtherTitle</a>
</h3>
'''
soup = BeautifulSoup(html)

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href]')]

输出

[('someTitle', 'someLink'), ('OtherTitle', 'OtherLink')]

Answer 2

代码：

soup.select('h3.foo1>a[href][title]').map(lambda link : (link.get("href"), link.get("title")))

解释：

soup.select('h3.foo1>a[href][title]')

选择具有 href 和 title 的所有 a 元素，它们是具有 foo1 [的 h3 元素的直接子元素 class.

.map(lambda link :

对于每个 a 元素，将它们分别替换为以下内容。

(link.get("href"), link.get("title"))

创建一个包含 link 的 href 和 title.

的元组

我如何从中提取 Href 和标题 HTML

How can i extract Href and title from this HTML

html

python

beautifulsoup

例子

输出