如何从使用 python beautifulsoup 的网站获取正确的 link？

Question

当我尝试抓取花名册链接时，我得到 https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).

pip install -U gazpacho

from gazpacho import get, Soup

url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)

Answer 1

这不是抓取的问题，您得到的是页面上的 URL。而是 URL 将您重定向到最终的 URL，这是您需要的。
您可以使用 requests 库来获得最终的 URL:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
    'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}

url = 'https://gwsports.com/roster.aspx?path=wpolo'

r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
    print(r.url) # URL after redirections
else:
    print('Request failed')

这使得您的代码像这样：

from gazpacho import get, Soup
import requests

def get_final_url(url, root):
  # Note this function assumes url is relative and always prepends root
  # You may want to extend it to detect absolute URLs
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
    'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}

  r = requests.get(url, allow_redirects=True, headers=headers)
  if r.status_code == 200:
    return r.url # URL after redirections
  else:
    raise requests.HTTPError

url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root + link.attrs['href'], root) for link in links]
print(s)

输出

['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']

如何从使用 python beautifulsoup 的网站获取正确的 link？

How to get the proper link from a website using python beautifulsoup?

python

web-scraping

gazpacho