美汤无法抓取网站内容

Question

你好，我想在这个网站上做一个简单的网络抓取 https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram

我的代码是这样的：

def userAgent(URL):
    ua = UserAgent()
    USER_AGENT = ua.random
    headers = {"User-Agent" : str(USER_AGENT),"Accept-Encoding": "*","Connection": "keep-alive"}
    resp = requests.get(URL, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")
        print(f'{URL}')
    else:
        print(f'error 200:{URL}')
        urlError = pd.DataFrame({'url':[URL],
                                'date':[dateNow] 
                                })
        urlError.to_csv('errorUrl/errorUrl.csv', mode='a', index=False, header=False)
    return soup

soup = userAgent(url)
productTitle = soup.find_all('div', {"class":"InfoProductDetail__shortDesc"})

但是无法这样做，是我的代码有问题吗？我尝试添加 time.sleep 以等待页面加载，但它仍然不起作用。帮助将不胜感激

Answer 1

您的代码很好，但是 url 是动态的，这意味着数据是由 JavaScript 和请求生成的，BeautifulSoup 无法模仿，您需要类似 [=18 的自动化工具=] 你可以运行代码。

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


url = 'https://www.sayurbox.com/p/Swallow%20Tepung%20Agar%20Agar%20Tinggi%20Serat%207%20gram'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()



title=soup.select_one('.InfoProductDetail__shortDesc').text
price= soup.select_one('span.InfoProductDetail__price').text

print(title)
print(price)

输出：

Swallow Tepung Agar Agar Tinggi Serat 7 gram
7.900

美汤无法抓取网站内容

beautiful soup unable to scrape website contents

beautifulsoup