在 BeautifulSoup 中正确获取 href 标签
Get href tags properly in BeautifulSoup
我正在尝试从 link 获取 href links。看起来像这样
<div class="srTitleFull pcLink"><a style="display:block" name="000 Plus system requirements" title="000 Plus System Requirements" href="../games/index.php?g_id=21580&game=000 Plus">000 Plus</a></div><div class="srDescFull"><td>000+ is a bite-sized hardcore platformer. Its mini...</td></div><div class="srDateFull">Feb-10-2015</div>
<div class="srTitleFull pcLink"><a style="display:block" name="0RBITALIS system requirements" title="0RBITALIS System Requirements" href="../games/index.php?g_id=23521&game=0RBITALIS">0RBITALIS</a></div><div class="srDescFull"><td>0RBITALIS is a satellite launching simulator with ...</td></div><div class="srDateFull">May-28-2015</div><div class="srGenreFull">Sim</div><br /></div><div class="srRowFull"><div class="srTitleFull pcLink"><a style="display:block" name="10 Years After system requirements" title="10 Years After System Requirements" href="../games/index.php?g_id=22220&game=10 Years After">10 Years After</a></div>
所以我尝试获取那些 link,例如 ../games/index.php?g_id=21580&game=000 Plus
和 ../games/index.php?g_id=22220&game=10 Years After
。我试过了;
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.Request('http://www.game-debate.com/games/index.php?year=2015',headers={'User-Agent': 'Mozilla/5.0'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in soup.find_all("a", href = True):
url_list.append(y['href'])
for x in url_list:
print (x)
这得到了 links,但打印部分永远存在。可能是因为 2 个 for 循环,我不止一次将 links 添加到列表中。我无法弄清楚如何一次获得这些 link 并将它们添加到列表中。
嵌套循环的问题在于,您在外循环和内循环中都使用了 soup.find_all()
,要求 BeautifulSoup
搜索整棵树。您打算使用 x
循环变量来搜索内部链接,进行 "context-specific" 搜索,也就是说:
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in x.find_all("a", href = True): # < FIX applied here
url_list.append(y['href'])
有更好的方法。
我会使用 CSS selector 来定位链接:
url_list = [a['href'] for a in soup.select(".srTitleFull > a")]
其中 .srTitleFull > a
将匹配直接位于 srTitleFull
class.
元素内部的所有 a
元素
这样你就根本不需要嵌套循环了。
我正在尝试从 link 获取 href links。看起来像这样
<div class="srTitleFull pcLink"><a style="display:block" name="000 Plus system requirements" title="000 Plus System Requirements" href="../games/index.php?g_id=21580&game=000 Plus">000 Plus</a></div><div class="srDescFull"><td>000+ is a bite-sized hardcore platformer. Its mini...</td></div><div class="srDateFull">Feb-10-2015</div>
<div class="srTitleFull pcLink"><a style="display:block" name="0RBITALIS system requirements" title="0RBITALIS System Requirements" href="../games/index.php?g_id=23521&game=0RBITALIS">0RBITALIS</a></div><div class="srDescFull"><td>0RBITALIS is a satellite launching simulator with ...</td></div><div class="srDateFull">May-28-2015</div><div class="srGenreFull">Sim</div><br /></div><div class="srRowFull"><div class="srTitleFull pcLink"><a style="display:block" name="10 Years After system requirements" title="10 Years After System Requirements" href="../games/index.php?g_id=22220&game=10 Years After">10 Years After</a></div>
所以我尝试获取那些 link,例如 ../games/index.php?g_id=21580&game=000 Plus
和 ../games/index.php?g_id=22220&game=10 Years After
。我试过了;
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.Request('http://www.game-debate.com/games/index.php?year=2015',headers={'User-Agent': 'Mozilla/5.0'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in soup.find_all("a", href = True):
url_list.append(y['href'])
for x in url_list:
print (x)
这得到了 links,但打印部分永远存在。可能是因为 2 个 for 循环,我不止一次将 links 添加到列表中。我无法弄清楚如何一次获得这些 link 并将它们添加到列表中。
嵌套循环的问题在于,您在外循环和内循环中都使用了 soup.find_all()
,要求 BeautifulSoup
搜索整棵树。您打算使用 x
循环变量来搜索内部链接,进行 "context-specific" 搜索,也就是说:
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in x.find_all("a", href = True): # < FIX applied here
url_list.append(y['href'])
有更好的方法。
我会使用 CSS selector 来定位链接:
url_list = [a['href'] for a in soup.select(".srTitleFull > a")]
其中 .srTitleFull > a
将匹配直接位于 srTitleFull
class.
a
元素
这样你就根本不需要嵌套循环了。