如何从使用 python beautifulsoup 的网站获取正确的 link?
How to get the proper link from a website using python beautifulsoup?
当我尝试抓取花名册链接时,我得到 https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).
pip install -U gazpacho
from gazpacho import get, Soup
url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
这不是抓取的问题,您得到的是页面上的 URL。而是 URL 将您重定向到最终的 URL,这是您需要的。
您可以使用 requests
库来获得最终的 URL:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
url = 'https://gwsports.com/roster.aspx?path=wpolo'
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
print(r.url) # URL after redirections
else:
print('Request failed')
这使得您的代码像这样:
from gazpacho import get, Soup
import requests
def get_final_url(url, root):
# Note this function assumes url is relative and always prepends root
# You may want to extend it to detect absolute URLs
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
return r.url # URL after redirections
else:
raise requests.HTTPError
url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root + link.attrs['href'], root) for link in links]
print(s)
输出
['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']
当我尝试抓取花名册链接时,我得到 https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).
pip install -U gazpacho
from gazpacho import get, Soup
url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
这不是抓取的问题,您得到的是页面上的 URL。而是 URL 将您重定向到最终的 URL,这是您需要的。
您可以使用 requests
库来获得最终的 URL:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
url = 'https://gwsports.com/roster.aspx?path=wpolo'
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
print(r.url) # URL after redirections
else:
print('Request failed')
这使得您的代码像这样:
from gazpacho import get, Soup
import requests
def get_final_url(url, root):
# Note this function assumes url is relative and always prepends root
# You may want to extend it to detect absolute URLs
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
return r.url # URL after redirections
else:
raise requests.HTTPError
url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root + link.attrs['href'], root) for link in links]
print(s)
输出
['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']