美汤无法抓取网站内容
beautiful soup unable to scrape website contents
你好,我想在这个网站上做一个简单的网络抓取 https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram
我的代码是这样的:
def userAgent(URL):
ua = UserAgent()
USER_AGENT = ua.random
headers = {"User-Agent" : str(USER_AGENT),"Accept-Encoding": "*","Connection": "keep-alive"}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
print(f'{URL}')
else:
print(f'error 200:{URL}')
urlError = pd.DataFrame({'url':[URL],
'date':[dateNow]
})
urlError.to_csv('errorUrl/errorUrl.csv', mode='a', index=False, header=False)
return soup
soup = userAgent(url)
productTitle = soup.find_all('div', {"class":"InfoProductDetail__shortDesc"})
但是无法这样做,是我的代码有问题吗?我尝试添加 time.sleep 以等待页面加载,但它仍然不起作用。帮助将不胜感激
您的代码很好,但是 url 是动态的,这意味着数据是由 JavaScript 和请求生成的,BeautifulSoup 无法模仿,您需要类似 [=18 的自动化工具=] 你可以 运行 代码。
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
title=soup.select_one('.InfoProductDetail__shortDesc').text
price= soup.select_one('span.InfoProductDetail__price').text
print(title)
print(price)
输出:
Swallow Tepung Agar Agar Tinggi Serat 7 gram
7.900
你好,我想在这个网站上做一个简单的网络抓取 https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram
我的代码是这样的:
def userAgent(URL):
ua = UserAgent()
USER_AGENT = ua.random
headers = {"User-Agent" : str(USER_AGENT),"Accept-Encoding": "*","Connection": "keep-alive"}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
print(f'{URL}')
else:
print(f'error 200:{URL}')
urlError = pd.DataFrame({'url':[URL],
'date':[dateNow]
})
urlError.to_csv('errorUrl/errorUrl.csv', mode='a', index=False, header=False)
return soup
soup = userAgent(url)
productTitle = soup.find_all('div', {"class":"InfoProductDetail__shortDesc"})
但是无法这样做,是我的代码有问题吗?我尝试添加 time.sleep 以等待页面加载,但它仍然不起作用。帮助将不胜感激
您的代码很好,但是 url 是动态的,这意味着数据是由 JavaScript 和请求生成的,BeautifulSoup 无法模仿,您需要类似 [=18 的自动化工具=] 你可以 运行 代码。
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
title=soup.select_one('.InfoProductDetail__shortDesc').text
price= soup.select_one('span.InfoProductDetail__price').text
print(title)
print(price)
输出:
Swallow Tepung Agar Agar Tinggi Serat 7 gram
7.900