用漂亮的汤问题抓取数据

Question

我正在努力从这个网站上抓取宇航员的国家：https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order。我正在使用 BeautifulSoup 来执行此任务，但我遇到了一些问题。这是我的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')

for item in tags:
    name = item.select_one('bau astronaut_cell__title bold mr05')
    country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
    data.append([name,country])
    
df = pd.DataFrame(data)

df

df 正在返回一个空列表。不确定发生了什么。当我将代码从 for 循环中取出时，它似乎找不到 select_one 函数。功能应该来自 bs4 - 不确定为什么不起作用。另外，我是否缺少一种可重复的网络抓取模式？每次我尝试解决这类问题时，它似乎都是不同的野兽。

如有任何帮助，我们将不胜感激！谢谢！

Answer 1

该页面是使用 javascript 动态加载的，因此请求无法直接访问它。数据从另一个地址加载，并以 json 格式接收。您可以通过以下方式获得它：

url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)

加载后，您可以遍历它并检索相关信息。例如：

for astro in data['astronauts']:
  print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])

输出：

1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General

等等

然后您可以将输出加载到 pandas 数据框或其他任何地方。

Answer 2

url的数据是由javascript动态生成的，Beautifulsoup不能抓取动态的data.So，你可以使用selenium之类的自动化工具和Beautifulsoup.Here 我用 Beautifulsoup.Please 应用硒只是运行代码。

脚本：

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    #print(name.text)
    country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
    if country:
        country=country.get_text()
    #print(country)
    
    data.append([name, country])



cols=['name','country']
df = pd.DataFrame(data,columns=cols)

print(df)

输出：

name                   country
0       Bess, Cameron  United States of America
1          Bess, Lane  United States of America
2          Dick, Evan  United States of America
3       Taylor, Dylan  United States of America
4    Strahan, Michael  United States of America
..                ...                       ...
295     Jones, Thomas  United States of America
296      Sega, Ronald  United States of America
297     Usachov, Yury                    Russia
298   Fettman, Martin  United States of America
299       Wolf, David  United States of America

[300 rows x 2 columns]

用漂亮的汤问题抓取数据

Scraping Data with Beautiful Soup Issues

html

python

beautifulsoup

web-scraping