使用 selenium 和 bs4 从 cnn 抓取新闻以从文章中获取链接和标题时出错

Question

我现在写这段代码是为了从 cnn 的特定主题中抓取新闻：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

serch_term = input('What News are you looking for today? ')

service = Service(executable_path='chromedriver.exe')
driver = webdriver.Chrome(service=service)
driver.get(f'https://edition.cnn.com/search?q={serch_term}')

soup = BeautifulSoup(driver.page_source,'html.parser' )
soup.select('h3.cnn-search__result-headline')

但在 chrome 与 cnn 站点

一起弹出后，我收到此错误消息时无法正常工作

DevTools listening on ws://127.0.0.1:65095/devtools/browser/05c3af16-cb5a-423c-af0b-c6cc96af980d
[11496:15920:0314/183947.010:ERROR:ssl_client_socket_impl.cc(995)] handshake failed; returned -1, SSL error code 1, net_error -200
PS C:\Users\user\Desktop\Informatik\Praktik\Projekte\Python\stiil_working_on\news_automation> [3408:22012:0314/183950.356:ERROR:device_event_log_impl.cc(214)] [18:39:50.360] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: Ein an das System angeschlossenes Ger�t funktioniert nicht. (0x1F)
[3408:22012:0314/183950.356:ERROR:device_event_log_impl.cc(214)] [18:39:50.362] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: Ein an das System angeschlossenes Ger�t funktioniert nicht. (0x1F)
[11496:15920:0314/183953.096:ERROR:ssl_client_socket_impl.cc(995)] handshake failed; returned -1, SSL error code 1, net_error -200
[15208:11512:0314/184146.206:ERROR:gpu_init.cc(440)] Passthrough is not supported, GL is disabled, ANGLE is

Answer 1

输入函数找不到搜索结果并引发错误，但一般搜索正常。请只是运行代码。

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

serch_term = 'News'

url = f'https://edition.cnn.com/search?q={serch_term}'
print(url)

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()

driver.get(url)
time.sleep(4)

soup = BeautifulSoup(driver.page_source, 'html.parser')
#driver.close()

for h3 in soup.select('h3.cnn-search__result-headline > a'):
    title=h3.text
    url=h3.get('href')
    abs_url='https:'+ url
    print(abs_url)

输出：

https://www.cnn.com/europe/live-news/ukraine-russia-putin-news-03-14-22/index.html
https://www.cnn.com/2022/03/14/energy/india-russia-oil/index.html
https://www.cnn.com/2022/03/14/us/new-york-city-washington-dc-homeless-shootings/index.html
https://www.cnn.com/2022/03/14/politics/breonna-taylor-mother-federal-charges-officers/index.html
https://www.cnn.com/2022/03/14/politics/biden-possible-european-trip/index.html
https://www.cnn.com/2022/03/07/world/what-we-know-brittney-griner-arrest-russia/index.html
https://www.cnn.com/2022/03/14/middleeast/mideast-summary-03-14-2022-intl/index.html
https://www.cnn.com/2022/03/14/energy/oil-prices/index.html
https://www.cnn.com/2022/03/14/tech/pete-davidson-blue-origin-launch-scn/index.html
https://www.cnn.com/2022/03/14/politics/donald-trump-south-carolina-speech/index.html

使用 selenium 和 bs4 从 cnn 抓取新闻以从文章中获取链接和标题时出错

Error when webscraping news from cnn using selenium and bs4 to get links and titles from articles

python

selenium

beautifulsoup

web-scraping

python-3.x