HTML 标签和 类 的问题 BeautifulSoup
HTML problem with tags and classes in a simple and little scraping with BeautifulSoup
我是新手,正在努力让 BeautifulSoup 工作。我在恢复 classes 和标签时遇到 Html 问题。我走近了,但有些事情我错了。我插入了错误的标签和 classes 以抓取 title、time、link,和 text 一条新闻。
我想抓取垂直列表中的所有标题,然后抓取日期、标题、link 和内容。
你能帮我做正确的 html class 和标记吗?
我没有收到任何错误,但 python 控制台保持为空
>>>
代码
import requests
from bs4 import BeautifulSoup
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
news = beautify.find_all('div', {'class','[=14=]'})
arti = []
for each in news:
time = each.find('span', {'class','hh serif'}).text
title = each.find('span', {'class','title'}).text
link = each.a.get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text,'html5lib')
content = soup.find('div', class_ = "read__content").text.strip()
print(" ")
print(time)
print(title)
print(link)
print(" ")
print(content)
print(" ")
这是一个您可以尝试的解决方案,
import requests
from bs4 import BeautifulSoup
# mock browser request
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')
news = soup.find_all('div', attrs={"class": "tcc-list-news"})
for each in news:
for div in each.find_all("div"):
print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
print("-- Href ", div.find("a")['href'])
print("-- Text ", " ".join([span.text for span in div.select("a > span")]))
-- Time 11:36
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time 11:24
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time 11:15
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...
编辑:
为什么这里需要headers?
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
我是新手,正在努力让 BeautifulSoup 工作。我在恢复 classes 和标签时遇到 Html 问题。我走近了,但有些事情我错了。我插入了错误的标签和 classes 以抓取 title、time、link,和 text 一条新闻。
我想抓取垂直列表中的所有标题,然后抓取日期、标题、link 和内容。
你能帮我做正确的 html class 和标记吗?
我没有收到任何错误,但 python 控制台保持为空
>>>
代码
import requests
from bs4 import BeautifulSoup
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
news = beautify.find_all('div', {'class','[=14=]'})
arti = []
for each in news:
time = each.find('span', {'class','hh serif'}).text
title = each.find('span', {'class','title'}).text
link = each.a.get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text,'html5lib')
content = soup.find('div', class_ = "read__content").text.strip()
print(" ")
print(time)
print(title)
print(link)
print(" ")
print(content)
print(" ")
这是一个您可以尝试的解决方案,
import requests
from bs4 import BeautifulSoup
# mock browser request
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')
news = soup.find_all('div', attrs={"class": "tcc-list-news"})
for each in news:
for div in each.find_all("div"):
print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
print("-- Href ", div.find("a")['href'])
print("-- Text ", " ".join([span.text for span in div.select("a > span")]))
-- Time 11:36
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time 11:24
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time 11:15
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...
编辑:
为什么这里需要headers? How to use Python requests to fake a browser visit a.k.a and generate User Agent?