如何从堆垛机中抓取数据
How to scrape data from stacker
我想从这个域中抓取数据https://stacker.com/stories/1587/100-best-movies-all-time
我和你一样是新手,我尝试过使用 beatifulsoap 它确实收到了请求,也许是某种类型的安全性,但我尝试用 selenium 做你想做的事并且它有效,检查这个:
from selenium import webdriver
website = "https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew"
path = "/"
chrome_options = webdriver.ChromeOptions();
chrome_options.add_experimental_option("excludeSwitches", ['enable-logging'])
driver = webdriver.Chrome(options=chrome_options);
driver.get(website)
box = driver.find_element_by_class_name("cast_new")
matches = box.find_elements_by_xpath('//*[@id="cast-and-crew"]/div[5]/table/tbody/tr[1]/td[1]/b/a')
for match in matches:
print(match.text)
driver.quit()
只有添加header才能获取到数据User-Agent
from bs4 import BeautifulSoup as BS
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
url = 'https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew'
response = requests.get(url, headers=headers)
# --- response ---
#print(response.status_code)
#print(response.text[:1000])
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('div', id="cast-and-crew")
for item in all_items:
print(item.get_text(strip=True, separator='\n'))
结果:
Lead Ensemble Members
Robert Downey, Jr.
Tony Stark/Iron Man
Chris Evans
Steve Rogers/Captain America
Mark Ruffalo
Bruce Banner/Hulk
Chris Hemsworth
Thor
Scarlett Johansson
Natasha Romanoff/Black Widow
Jeremy Renner
Clint Barton/Hawkeye
Don Cheadle
...
我想从这个域中抓取数据https://stacker.com/stories/1587/100-best-movies-all-time
我和你一样是新手,我尝试过使用 beatifulsoap 它确实收到了请求,也许是某种类型的安全性,但我尝试用 selenium 做你想做的事并且它有效,检查这个:
from selenium import webdriver
website = "https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew"
path = "/"
chrome_options = webdriver.ChromeOptions();
chrome_options.add_experimental_option("excludeSwitches", ['enable-logging'])
driver = webdriver.Chrome(options=chrome_options);
driver.get(website)
box = driver.find_element_by_class_name("cast_new")
matches = box.find_elements_by_xpath('//*[@id="cast-and-crew"]/div[5]/table/tbody/tr[1]/td[1]/b/a')
for match in matches:
print(match.text)
driver.quit()
只有添加header才能获取到数据User-Agent
from bs4 import BeautifulSoup as BS
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
url = 'https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew'
response = requests.get(url, headers=headers)
# --- response ---
#print(response.status_code)
#print(response.text[:1000])
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('div', id="cast-and-crew")
for item in all_items:
print(item.get_text(strip=True, separator='\n'))
结果:
Lead Ensemble Members
Robert Downey, Jr.
Tony Stark/Iron Man
Chris Evans
Steve Rogers/Captain America
Mark Ruffalo
Bruce Banner/Hulk
Chris Hemsworth
Thor
Scarlett Johansson
Natasha Romanoff/Black Widow
Jeremy Renner
Clint Barton/Hawkeye
Don Cheadle
...