抓取 Python 脚本 returns None
Scraping Python script returns None
我正在尝试从亚马逊抓取数据,尤其是产品标题,但 运行 只有我的脚本 returns None
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
预期结果应该是 div 包含产品标题,但 None 是输出
更改解析器:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
title = soup.find(id="productTitle")
print(title.text)
您还可以从其中一个元标记的 content
属性中提取
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.select_one('[name=description]')['content']
print(title)
您应该先安装 lxml
(如果您还没有),您可以使用以下 pip
命令来安装:
pip install lxml
安装后替换为:
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
与:
soup = BeautifulSoup(page.content, 'lxml')
title = soup.find(id = "productTitle")
print(title.getText().strip())
希望这对您有所帮助
我无法发表评论,但我想在@Fozoro 所说的内容上留言,以防将来有人遇到我遇到的同样问题。成功执行 pip install lxml 运行 但是当我尝试将它用作我的应用程序的解析器时,它仍然给我关于找不到请求的功能的错误。然而,做:
python3 -m pip install lxml
让我可以使用 lxml 解析器。
我正在尝试从亚马逊抓取数据,尤其是产品标题,但 运行 只有我的脚本 returns None
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
预期结果应该是 div 包含产品标题,但 None 是输出
更改解析器:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
title = soup.find(id="productTitle")
print(title.text)
您还可以从其中一个元标记的 content
属性中提取
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Dell-Inspiron-5570-Touchscreen-Laptop/dp/B07FKRFTYW/ref=sxbs_sxwds-deals?keywords=laptops&pd_rd_i=B07FKRFTYW&pd_rd_r=38a464f1-5fc2-4e1e-91a3-c209f68e2b8c&pd_rd_w=IbLEX&pd_rd_wg=l5Ewu&pf_rd_p=8ea1b18a-72f9-4e02-9dad-007df8eca556&pf_rd_r=SWJJFWF3WM0ZQZGMN8XA&qid=1562328911&s=computers-intl-ship&smid=A19N59FKNWHX7C'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.select_one('[name=description]')['content']
print(title)
您应该先安装 lxml
(如果您还没有),您可以使用以下 pip
命令来安装:
pip install lxml
安装后替换为:
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
与:
soup = BeautifulSoup(page.content, 'lxml')
title = soup.find(id = "productTitle")
print(title.getText().strip())
希望这对您有所帮助
我无法发表评论,但我想在@Fozoro 所说的内容上留言,以防将来有人遇到我遇到的同样问题。成功执行 pip install lxml 运行 但是当我尝试将它用作我的应用程序的解析器时,它仍然给我关于找不到请求的功能的错误。然而,做:
python3 -m pip install lxml
让我可以使用 lxml 解析器。