使用 Python Beautifulsoup 或 Selenium 进行网页抓取搜索
Web scraping search with Python Beautifulsoup or Selenium
我正在尝试创建一个网络爬虫,它能够从网站 https://www.superherodb.com/battle/create/#close.
收集来自不同超级英雄的战斗 win/loss 数据
我已经收集了所有超级英雄的名字,我想单独添加每个角色并收集他们与所有其他角色的战斗数据。比如超人VS所有人,雷神VS所有人等等……收集每个角色VS所有其他角色的战斗数据。
例如,https://www.superherodb.com/superman-vs-thor/90-103/ 包含超人对雷神的统计数据。
如果可能的话,我怎样才能以一种有条理和干净的方式抓取数据,我可以以字典形式收集所有数据,例如:{"Superman_vs_Thor": [46, 2, 52]}, {"Superman_vs_Spiderman": [98, 2]}?
我无法将您需要的信息转换成字典,但我能够抓取它们
代码如下:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.superherodb.com/superman-vs-thor/90-103/')
soup = BeautifulSoup(r.text, 'lxml')
battle = soup.find('h1', class_='h1-battle')
superman = soup.find('div', class_='battle-team-result lose')
thor = soup.find('div', class_='battle-team-result win')
average = soup.find('div', class_='battle-team-result draw')
print('Battle:', battle.text)
print('Superman stats:', superman.text)
print('Thor stats:', thor.text)
print('Average:', average.text)
试一试
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = driver = webdriver.Chrome()
driver.get("https://www.superherodb.com/superman-vs-thor/90-103/")
title = driver.find_element(By.CLASS_NAME,"h1-battle").text
characters = title.split("vs")
results = driver.find_elements(By.CLASS_NAME,"battle-team-result")
print('Title: ', title)
print(characters[0] + ': ' + results[0].text)
print('Draw: ', results[1].text)
print(characters[1] + ': ' + results[2].text)
您可以修复 win、loss、draw 部分的 .text 部分,但要获得您想要的值,您可以将这些值附加到数组中,同时检查该部分是否存在于页面上,然后抓取你说的名字是通过索引编入字典的。
wait=WebDriverWait(driver,10)
urls=['https://www.superherodb.com/superman-vs-thor/90-103/']
names=['Superman_vs_Thor']
complete_list={}
for indx,url in enumerate(urls):
driver.get(url)
battles=[]
try:
win=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result win']"))).text
battles.append(win)
except:
pass
try:
draw=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result draw']"))).text
battles.append(draw)
except:
pass
try:
loss=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result lose']"))).text
battles.append(loss)
except:
pass
complete_list[names[indx]]= battles
print(complete_list)
到目前为止给出了这个:
{'Superman_vs_Thor': ['912 wins (52%)', '35 (2%)', '806 wins (46%)']}
进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
我正在尝试创建一个网络爬虫,它能够从网站 https://www.superherodb.com/battle/create/#close.
收集来自不同超级英雄的战斗 win/loss 数据我已经收集了所有超级英雄的名字,我想单独添加每个角色并收集他们与所有其他角色的战斗数据。比如超人VS所有人,雷神VS所有人等等……收集每个角色VS所有其他角色的战斗数据。
例如,https://www.superherodb.com/superman-vs-thor/90-103/ 包含超人对雷神的统计数据。
如果可能的话,我怎样才能以一种有条理和干净的方式抓取数据,我可以以字典形式收集所有数据,例如:{"Superman_vs_Thor": [46, 2, 52]}, {"Superman_vs_Spiderman": [98, 2]}?
我无法将您需要的信息转换成字典,但我能够抓取它们
代码如下:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.superherodb.com/superman-vs-thor/90-103/')
soup = BeautifulSoup(r.text, 'lxml')
battle = soup.find('h1', class_='h1-battle')
superman = soup.find('div', class_='battle-team-result lose')
thor = soup.find('div', class_='battle-team-result win')
average = soup.find('div', class_='battle-team-result draw')
print('Battle:', battle.text)
print('Superman stats:', superman.text)
print('Thor stats:', thor.text)
print('Average:', average.text)
试一试
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = driver = webdriver.Chrome()
driver.get("https://www.superherodb.com/superman-vs-thor/90-103/")
title = driver.find_element(By.CLASS_NAME,"h1-battle").text
characters = title.split("vs")
results = driver.find_elements(By.CLASS_NAME,"battle-team-result")
print('Title: ', title)
print(characters[0] + ': ' + results[0].text)
print('Draw: ', results[1].text)
print(characters[1] + ': ' + results[2].text)
您可以修复 win、loss、draw 部分的 .text 部分,但要获得您想要的值,您可以将这些值附加到数组中,同时检查该部分是否存在于页面上,然后抓取你说的名字是通过索引编入字典的。
wait=WebDriverWait(driver,10)
urls=['https://www.superherodb.com/superman-vs-thor/90-103/']
names=['Superman_vs_Thor']
complete_list={}
for indx,url in enumerate(urls):
driver.get(url)
battles=[]
try:
win=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result win']"))).text
battles.append(win)
except:
pass
try:
draw=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result draw']"))).text
battles.append(draw)
except:
pass
try:
loss=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='battle-team-result lose']"))).text
battles.append(loss)
except:
pass
complete_list[names[indx]]= battles
print(complete_list)
到目前为止给出了这个:
{'Superman_vs_Thor': ['912 wins (52%)', '35 (2%)', '806 wins (46%)']}
进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC