在 BeautifulSoup 中正确获取 href 标签

Question

我正在尝试从 link 获取 href links。看起来像这样

<div class="srTitleFull pcLink"><a style="display:block" name="000 Plus system requirements" title="000 Plus System Requirements" href="../games/index.php?g_id=21580&game=000 Plus">000 Plus</a></div><div class="srDescFull"><td>000+ is a bite-sized hardcore platformer. Its mini...</td></div><div class="srDateFull">Feb-10-2015</div>

<div class="srTitleFull pcLink"><a style="display:block" name="0RBITALIS system requirements" title="0RBITALIS System Requirements" href="../games/index.php?g_id=23521&game=0RBITALIS">0RBITALIS</a></div><div class="srDescFull"><td>0RBITALIS is a satellite launching simulator with ...</td></div><div class="srDateFull">May-28-2015</div><div class="srGenreFull">Sim</div><br /></div><div class="srRowFull"><div class="srTitleFull pcLink"><a style="display:block" name="10 Years After system requirements" title="10 Years After System Requirements" href="../games/index.php?g_id=22220&game=10 Years After">10 Years After</a></div>

所以我尝试获取那些 link，例如 ../games/index.php?g_id=21580&game=000 Plus 和 ../games/index.php?g_id=22220&game=10 Years After。我试过了；

from bs4 import BeautifulSoup
import urllib.request

r = urllib.request.Request('http://www.game-debate.com/games/index.php?year=2015',headers={'User-Agent': 'Mozilla/5.0'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)


url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
   for y in soup.find_all("a", href = True):
        url_list.append(y['href'])
for x in url_list:
    print (x)

这得到了 links，但打印部分永远存在。可能是因为 2 个 for 循环，我不止一次将 links 添加到列表中。我无法弄清楚如何一次获得这些 link 并将它们添加到列表中。

Answer 1

嵌套循环的问题在于，您在外循环和内循环中都使用了 soup.find_all()，要求 BeautifulSoup 搜索整棵树。您打算使用 x 循环变量来搜索内部链接，进行 "context-specific" 搜索，也就是说：

url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
   for y in x.find_all("a", href = True):  # < FIX applied here
        url_list.append(y['href'])

有更好的方法。

我会使用 CSS selector 来定位链接：

url_list = [a['href'] for a in soup.select(".srTitleFull > a")]

其中 .srTitleFull > a 将匹配直接位于 srTitleFull class.

元素内部的所有 a 元素

这样你就根本不需要嵌套循环了。

在 BeautifulSoup 中正确获取 href 标签

Get href tags properly in BeautifulSoup

python

beautifulsoup

html-parsing

web-scraping

python-3.4