过滤掉列表中的 href 而不是 soup.find_all
Filter out href in a list instead of soup.find_all
您好,我用以下脚本过滤掉网站上的一些公告
gdata_even=soup.find_all("li", {"class":"list2Col even "})
gdata_odd=soup.find_all("li", {"class":"list2Col odd "})
最后根据item是否有某词只取gdata中的部分公告:
for l in range(len_data):
if _checkDate(gdata_even[l].text):
if _checkwordsV2(gdata_even[l].text):
pass
else:
initial_list.append(gdata_even[l].text.encode("utf-8"))
if _checkDate(gdata_odd[l].text):
if _checkwordsV2(gdata_odd[l].text):
pass
else:
initial_list.append(gdata_odd[l].text.encode("utf-8"))
我现在面临的问题是 gdata_even[l] 和 gdata_odd[l] 有以下输出:
<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>
这里我想获取嵌入在 href 中的项目的 link,但它不起作用:
h3Url = gdata[l].find("a").get("href")
print h3Url
有人可以帮忙吗,谢谢。
也许您获取 gdata 的方式有误,因为您的代码应该可以工作。
>>> from BeautifulSoup import BeautifulSoup
>>> doc='<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>'
>>> soup = BeautifulSoup(doc)
>>> h3Url = soup.find('a').get('href')
>>> print h3Url
/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134
您好,我用以下脚本过滤掉网站上的一些公告
gdata_even=soup.find_all("li", {"class":"list2Col even "})
gdata_odd=soup.find_all("li", {"class":"list2Col odd "})
最后根据item是否有某词只取gdata中的部分公告:
for l in range(len_data):
if _checkDate(gdata_even[l].text):
if _checkwordsV2(gdata_even[l].text):
pass
else:
initial_list.append(gdata_even[l].text.encode("utf-8"))
if _checkDate(gdata_odd[l].text):
if _checkwordsV2(gdata_odd[l].text):
pass
else:
initial_list.append(gdata_odd[l].text.encode("utf-8"))
我现在面临的问题是 gdata_even[l] 和 gdata_odd[l] 有以下输出:
<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>
这里我想获取嵌入在 href 中的项目的 link,但它不起作用:
h3Url = gdata[l].find("a").get("href")
print h3Url
有人可以帮忙吗,谢谢。
也许您获取 gdata 的方式有误,因为您的代码应该可以工作。
>>> from BeautifulSoup import BeautifulSoup
>>> doc='<li class="list2Col even "><div class="indexCol"><span class="date">25 Aug 2015 12:00:06 AM CEST</span></div><div class="contentCol"><div class="categories">Frankfurt</div><h3><a href="/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134">FRA:Deletion of Instruments from XETRA - 25.08.2015-001</a></h3></div></li>'
>>> soup = BeautifulSoup(doc)
>>> h3Url = soup.find('a').get('href')
>>> print h3Url
/xetra-en/newsroom/xetra-newsboard/FRA-Deletion-of-Instruments-from-XETRA---25.08.2015-001/1913134