HTML 标签和类的问题 BeautifulSoup

Question

我是新手，正在努力让 BeautifulSoup 工作。我在恢复 classes 和标签时遇到 Html 问题。我走近了，但有些事情我错了。我插入了错误的标签和 classes 以抓取 title、time、link，和 text 一条新闻。

我想抓取垂直列表中的所有标题，然后抓取日期、标题、link 和内容。

你能帮我做正确的 html class 和标记吗？

我没有收到任何错误，但 python 控制台保持为空

>>>

代码

import requests
from bs4 import BeautifulSoup
    
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
    
news = beautify.find_all('div', {'class','[=14=]'})
arti = []
    
for each in news:
  time = each.find('span', {'class','hh serif'}).text
  title = each.find('span', {'class','title'}).text
  link = each.a.get('href')
  r = requests.get(url)
  soup = BeautifulSoup(r.text,'html5lib')
  content = soup.find('div', class_ = "read__content").text.strip()
    
  print(" ")   
  print(time)
  print(title)
  print(link)
  print(" ") 
  print(content)
  print(" ")

Answer 1

这是一个您可以尝试的解决方案，

import requests
from bs4 import BeautifulSoup

# mock browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')

news = soup.find_all('div', attrs={"class": "tcc-list-news"})

for each in news:
    for div in each.find_all("div"):
        print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
        print("-- Href ", div.find("a")['href'])
        print("-- Text ", " ".join([span.text for span in div.select("a > span")]))

-- Time  11:36
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text  focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time  11:24
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text  focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time  11:15
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text  Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...

编辑：

为什么这里需要headers？ How to use Python requests to fake a browser visit a.k.a and generate User Agent?

HTML 标签和类的问题 BeautifulSoup

HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

html

python

beautifulsoup

web-scraping

python-3.x

HTML 标签和 类 的问题 BeautifulSoup

HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

html

python

beautifulsoup

web-scraping

python-3.x

HTML 标签和类的问题 BeautifulSoup