cnn 新闻 webscraper return 空 [] 没有信息
cnn news webscraper return empty [] without information
所以我现在写了这段代码:
from urllib import request
from bs4 import BeautifulSoup
import requests
import csv
import re
serch_term = input('What News are you looking for today? ')
url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")
page_text = doc.find_all('<h3 class="cnn-search__result-headline">')
print(page_text)
但是如果我打印(page_text),我会变得空 [] 有人可以帮助我吗
有几个问题:
内容由 JavaScript
动态提供,因此您无法通过 requests
获取
我们不知道您的搜索词,可能没有结果
BeautifulSoup
无法使用 <h3 class="cnn-search__result-headline">
作为选择。
如何解决?使用像浏览器一样工作的 selenium
,也呈现 JavaScript
并且可以按预期为您提供 page_source
。
例子
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
service = Service(executable_path='YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/search?q=python')
soup = BeautifulSoup(driver.page_source,'html.parser' )
soup.select('h3.cnn-search__result-headline')
输出
[<h3 class="cnn-search__result-headline">
<a href="//www.cnn.com/travel/article/airasia-malaysia-snake-plane-rerouted-intl-hnk/index.html">AirAsia flight in Malaysia rerouted after snake found on board plane</a>
</h3>,
<h3 class="cnn-search__result-headline">
<a href="//www.cnn.com/2021/11/19/cnn-underscored/athleta-gift-shop-holiday/index.html">With gift options under plus splurge-worthy seasonal staples, Athleta's Gift Shop is a holiday shopping haven</a></h3>,...]
要在迭代 ResultSet
时调用 .text
方法并获取 href
的值,请在其包含的 <a>
上使用 ['href']
所以我现在写了这段代码:
from urllib import request
from bs4 import BeautifulSoup
import requests
import csv
import re
serch_term = input('What News are you looking for today? ')
url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")
page_text = doc.find_all('<h3 class="cnn-search__result-headline">')
print(page_text)
但是如果我打印(page_text),我会变得空 [] 有人可以帮助我吗
有几个问题:
内容由
获取JavaScript
动态提供,因此您无法通过requests
我们不知道您的搜索词,可能没有结果
BeautifulSoup
无法使用<h3 class="cnn-search__result-headline">
作为选择。
如何解决?使用像浏览器一样工作的 selenium
,也呈现 JavaScript
并且可以按预期为您提供 page_source
。
例子
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
service = Service(executable_path='YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/search?q=python')
soup = BeautifulSoup(driver.page_source,'html.parser' )
soup.select('h3.cnn-search__result-headline')
输出
[<h3 class="cnn-search__result-headline">
<a href="//www.cnn.com/travel/article/airasia-malaysia-snake-plane-rerouted-intl-hnk/index.html">AirAsia flight in Malaysia rerouted after snake found on board plane</a>
</h3>,
<h3 class="cnn-search__result-headline">
<a href="//www.cnn.com/2021/11/19/cnn-underscored/athleta-gift-shop-holiday/index.html">With gift options under plus splurge-worthy seasonal staples, Athleta's Gift Shop is a holiday shopping haven</a></h3>,...]
要在迭代 ResultSet
时调用 .text
方法并获取 href
的值,请在其包含的 <a>
上使用 ['href']