从 href python '#' 中删除元素

Question

我希望从以下代码中删除 href 元素，当我运行时，我能够 return 结果，但它不会从中删除 '#' 和 '#contents' python.

中的 url 列表

from bs4 import BeautifulSoup
import requests

url = 'https://www.census.gov/programs-surveys/popest.html'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
links_with_text = []

for a in soup.find_all('a', href=True): 
      if a.text: 
          links_with_text.append(a['href'])
      elif a.text:
          links_with_text.decompose(a['#content','#'])

print(links_with_text)

Answer 1

您可以使用 string#startswith 将任何以 "#" 开头的链接列入黑名单，或将任何以 "http" 或 "https" 开头的链接列入白名单。由于您的数据中有像 "/" 这样的 href，我会使用第二个选项。

import requests
from bs4 import BeautifulSoup

url = 'https://www.census.gov/programs-surveys/popest.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
links_with_text = []

for a in soup.find_all('a', href=True): 
      if a.text and a['href'].startswith('http'):
          links_with_text.append(a['href'])

print(links_with_text)

请注意 list.decompose 不是一个函数（而且程序的这个分支无论如何都无法访问）。

Answer 2

如果您只想要 https/http 链接，请使用内置的 css 通过 href 属性选择器和以运算符开头的过滤。 'lxml' 如果安装的话也是一个更快的解析器。

import requests
from bs4 import BeautifulSoup

url = 'https://www.census.gov/programs-surveys/popest.html'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
links = [i['href'] for i in soup.select('[href^=http]')]

从 href python '#' 中删除元素

removing elements from href python '#'

python

href