抓取 td 内的链接

Question

下面的脚本有效，但我想添加项目的 href link 以产生更好的数据输出。任何帮助都可以。谢谢。

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"}
url = "https://bscscan.com/token/generic-tokenholders2?m=normal&a=0x0D0b63b32595957ae58D4dD60aa5409E79A5Aa96"

s = requests.Session()
r = s.get(url,headers=headers, timeout=5)
soupblockdetails = BeautifulSoup(r.content, 'html.parser')

for row in soupblockdetails.select("tr:has(td)")[:3]:  #max value is 50
   item1 = row.find_all("td")[0].text[0:].strip()
   item2 = row.find_all("td")[1].text[0:].strip()
   item3 = row.find_all("td")[2].text[0:].strip()
   print ("{:<2} {:<43}   {:>25}".format(item1, item2, item3))

当前输出：

1  KIPS: Locked Wallet                            1,870.828693386970691791
2  0xe72d1910c07420a99a2649f40910f692cd87309e         6.849012043043023775
3  0x138fe04c8f7da181765bde237ef5e78546677f5f         2.153134069327832213

需要输出：

1  KIPS: Locked Wallet                            1,870.828693386970691791      0x81e0ef68e103ee65002d3cf766240ed1c070334d      
2  0xe72d1910c07420a99a2649f40910f692cd87309e         6.849012043043023775      0xe72d1910c07420a99a2649f40910f692cd87309e      
3  0x138fe04c8f7da181765bde237ef5e78546677f5f         2.153134069327832213      0x138fe04c8f7da181765bde237ef5e78546677f5f

Answer 1

从第二个 <td> 调用 <a> 并使用 .get('href') 提取 href 值 - 仅获取参数值，只需拆分 url:

item4 = row.find_all("td")[1].a.get('href').split('a=')[-1]

在你的循环中：

for row in soupblockdetails.select("tr:has(td)")[:3]:  #max value is 50
    item1 = row.find_all("td")[0].text[0:].strip()
    item2 = row.find_all("td")[1].text[0:].strip()
    item3 = row.find_all("td")[2].text[0:].strip()
    item4 = row.find_all("td")[1].a.get('href').split('a=')[-1]
    print ("{:<2} {:<43}   {:>25} {}".format(item1, item2, item3, item4))

输出

1  KIPS: Locked Wallet                            1,870.828693386970691791 0x81e0ef68e103ee65002d3cf766240ed1c070334d
2  0xe72d1910c07420a99a2649f40910f692cd87309e         6.849012043043023775 0xe72d1910c07420a99a2649f40910f692cd87309e
3  0x138fe04c8f7da181765bde237ef5e78546677f5f         2.153134069327832213 0x138fe04c8f7da181765bde237ef5e78546677f5f

抓取 td 内的链接

Grabbing links inside the td

python

beautifulsoup

web-scraping

python-requests

输出