过滤掉列表中的 href 而不是 soup.find_all

Filter out href in a list instead of soup.find_all

您好,我用以下脚本过滤掉网站上的一些公告

 gdata_even=soup.find_all("li", {"class":"list2Col even "})
 gdata_odd=soup.find_all("li", {"class":"list2Col odd "})

最后根据item是否有某词只取gdata中的部分公告:

for l in range(len_data):
            if _checkDate(gdata_even[l].text):
                if _checkwordsV2(gdata_even[l].text):
                    pass
                else:
                    initial_list.append(gdata_even[l].text.encode("utf-8"))

            if _checkDate(gdata_odd[l].text):
                if _checkwordsV2(gdata_odd[l].text):
                    pass
                else:
                    initial_list.append(gdata_odd[l].text.encode("utf-8"))

我现在面临的问题是 gdata_even[l] 和 gdata_odd[l] 有以下输出:

<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>

这里我想获取嵌入在 href 中的项目的 link,但它不起作用:

    h3Url = gdata[l].find("a").get("href")
    print h3Url

有人可以帮忙吗,谢谢。

也许您获取 gdata 的方式有误,因为您的代码应该可以工作。

>>> from BeautifulSoup import BeautifulSoup
>>> doc='<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>'
>>> soup = BeautifulSoup(doc)
>>> h3Url = soup.find('a').get('href')
>>> print h3Url

/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134