美汤find_allreturnNone

Beautiful Soup find_all return None

我写了下面的代码来草写每一项的论文名称。

import requests
headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
        'Connection': 'close'
}

url = ‘https://www.sciencedirect.com/journal/journal-of-econometrics/vol/225/issue/1'
response = requests.get(url = url, headers = headers)
content = response.text

soup = BeautifulSoup(content,'lxml’)

#Method 1
books_contents = soup.find('ol',class_='js-article-list article-list-items')
hotel_texts = soup.find_all('li',class_='js-article-list-item article-item u-padding-xs-top u-margin-l-bottom')

#Method2
hotel_texts = soup.find_all('li',class_='js-article-list-item article-item u-padding-xs-top u-margin-l-bottom’)

我尝试了两种方法,方法一尝试先找到大框架,然后再找到每个元素。或者直接找到每个元素。我可以观察网络源代码中的每个元素,但它 returns 我 [].

网站受cloudflare保护。所以我使用 cloudscraper 而不是 requests.Here 是一个可行的解决方案示例。

from bs4 import BeautifulSoup
import cloudscraper

scraper = cloudscraper.create_scraper(delay=10,   browser={'custom': 'ScraperBot/1.0',})

url = 'https://www.sciencedirect.com/journal/journal-of-econometrics/vol/225/issue/1'
response = scraper.get(url)
#print(response)
content = response.text

soup = BeautifulSoup(content,'lxml')


hotel_texts = soup.find_all('dl',class_='js-article article-content')
for txt in hotel_texts:
    h3 = txt.select_one('.anchor-text').get_text()
    print(h3)

输出:

Editorial Board
Editorial for Special Issue: Vector Autoregressions
Detecting groups in large vector autoregressions
Identification of structural vector autoregressions through higher unconditional moments  
Using time-varying volatility for identification in Vector Autoregressions: An application to endogenous uncertainty
Inference in Structural Vector Autoregressions identified with an external instrument     
Inference in Bayesian Proxy-SVARs
Impulse response analysis for structural dynamic models with nonlinear regressors