Error: TypeError: must be str, not NoneType while Scrapping list Links from website using BeautifulSoup

Question

我想废弃 https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production 这个网站。有 2 组链接 SI units 和 Oil Field units

我试图废弃链接列表 SI units 并创建名为 get_gas_links

的函数

import io
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs, SoupStrainer
import re

url = "https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production"

first_page = requests.get(url)
soup = bs(first_page.content)

def pasrse_page(link):
print(link)
df = pd.read_html(link, skiprows=1, headers=1)
return df

def get_gas_links():
glinks=[]
gas_links = soup.find_all("a", href = re.compile("si.htm"))

for i in gas_links:
    glinks.append("https://ens.dk/" + i.get("herf"))
return glinks

get_gas_links()

scrap 3 tables from every link 的主要动机但是在报废之前 table 我正在尝试报废 list of links

但显示错误：TypeError: must be str, not NoneType error_image

Answer 1

您以错误的方式使用了错误的正则表达式。这就是 soup 找不到任何符合条件的链接的原因。您可以检查以下来源并根据需要验证 extracted_link。

def get_gas_links():
    glinks=[]
    gas_links = soup.find('table').find_all('a')
    for i in gas_links:
        extracted_link = i['href']
        #you can validate the extracted link however you want
        glinks.append("https://ens.dk/" + extracted_link)
    return glinks

Error: TypeError: must be str, not NoneType while Scrapping list Links from website using BeautifulSoup

Error: TypeError: must be str, not NoneType while Scrapping list Links from website using BeautifulSoup

python

beautifulsoup

web-scraping

python-requests