使用 Python Beautifulsoup 或 Selenium 进行网页抓取搜索

Web scraping search with Python Beautifulsoup or Selenium

我正在尝试创建一个网络爬虫,它能够从网站 https://www.superherodb.com/battle/create/#close.

收集来自不同超级英雄的战斗 win/loss 数据

我已经收集了所有超级英雄的名字,我想单独添加每个角色并收集他们与所有其他角色的战斗数据。比如超人VS所有人,雷神VS所有人等等……收集每个角色VS所有其他角色的战斗数据。

例如,https://www.superherodb.com/superman-vs-thor/90-103/ 包含超人对雷神的统计数据。

如果可能的话,我怎样才能以一种有条理和干净的方式抓取数据,我可以以字典形式收集所有数据,例如:{"Superman_vs_Thor": [46, 2, 52]}, {"Superman_vs_Spiderman": [98, 2]}?

我无法将您需要的信息转换成字典,但我能够抓取它们

代码如下:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.superherodb.com/superman-vs-thor/90-103/')
soup = BeautifulSoup(r.text, 'lxml')

battle = soup.find('h1', class_='h1-battle')
superman = soup.find('div', class_='battle-team-result lose')
thor = soup.find('div', class_='battle-team-result win')
average = soup.find('div', class_='battle-team-result draw')

print('Battle:', battle.text)
print('Superman stats:', superman.text)
print('Thor stats:', thor.text)
print('Average:', average.text)

试一试

from selenium.webdriver.common.by import By
from selenium import webdriver

driver = driver  = webdriver.Chrome()
driver.get("https://www.superherodb.com/superman-vs-thor/90-103/")
title = driver.find_element(By.CLASS_NAME,"h1-battle").text
characters = title.split("vs")
results = driver.find_elements(By.CLASS_NAME,"battle-team-result")

print('Title: ', title)

print(characters[0] + ': ' + results[0].text)
print('Draw: ', results[1].text)
print(characters[1] + ': ' + results[2].text)

您可以修复 win、loss、draw 部分的 .text 部分,但要获得您想要的值,您可以将这些值附加到数组中,同时检查该部分是否存在于页面上,然后抓取你说的名字是通过索引编入字典的。

wait=WebDriverWait(driver,10)
urls=['https://www.superherodb.com/superman-vs-thor/90-103/']
names=['Superman_vs_Thor']
complete_list={}
for indx,url in enumerate(urls):
    driver.get(url)
    battles=[]
    try:
        win=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result win']"))).text
        battles.append(win)
    except:
        pass
    try:
        draw=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result draw']"))).text
        battles.append(draw)
    except:
        pass
    try:
        loss=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result lose']"))).text
        battles.append(loss)
    except:
        pass
    complete_list[names[indx]]= battles

print(complete_list)

到目前为止给出了这个:

{'Superman_vs_Thor': ['912 wins (52%)', '35 (2%)', '806 wins (46%)']}

进口:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC