AttributeError: 'NoneType' object has no attribute 'find' when scrapping an array of URLs

Question

我有以下代码：

from bs4 import BeautifulSoup
import requests

root = 'https://br.investing.com'
website = f'{root}/news/latest-news'

result = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
content = result.text
soup = BeautifulSoup(content, 'lxml')

box = soup.find('section', id='leftColumn')
links = [link['href'] for link in box.find_all('a', href=True)]

for link in links:
  result = requests.get(f'{root}/{link}', headers={"User-Agent": "Mozilla/5.0"})
  content = result.text
  soup = BeautifulSoup(content, 'lxml')

  box = soup.find('section', id='leftColumn')
  title = box.find('h1').get_text()

  with open('headlines.txt', 'w') as file:
    file.write(title)

我打算用这段代码从网站上抓取新闻的 URL，访问这些 URL 中的每一个，获取它的 headers 并将它们写在一个文本文件中。使用此代码，我只是在文件上得到一个 header 并接收 AttributeError: 'NoneType' object has no attribute 'find'。对此可以做些什么？

Answer 1

在您的 for 循环中，这里：title = box.find('h1').get_text()，方框是 None（即 None类型）...这就是为什么您被告知 None类型 object 没有属性 find

这可能是因为在循环的某个时刻，这一行：box = soup.find('section', id='leftColumn') returns None

如果框returnsNone，下一行会报错

您可以通过在调用查找之前检查框是否不是 None 来解决此问题。所以这个：

box = soup.find('section', id='leftColumn')
title = box.find('h1').get_text()

将更改为

box = soup.find('section', id='leftColumn')
if box is not None:
    title = box.find('h1').get_text()

编辑：

您只看到一个 header 的原因是您在这里有 -w：with open('headlines.txt', 'w')

-w 将覆盖该文件。我不明白内容，但我猜输出是最后一个 header

修复：将-w替换为-a。它会在文件内容中添加“标题”。您可以在这里阅读：https://www.w3schools.com/python/python_file_write.asp

AttributeError: 'NoneType' object has no attribute 'find' when scrapping an array of URLs

AttributeError: 'NoneType' object has no attribute 'find' when scrapping an array of URLs

html

python

beautifulsoup

web-scraping