如何使用 BeautifulSoup 和 Python 抓取页面?
How to scrape a page with BeautifulSoup and Python?
我正在尝试从 BBC Good Food 网站提取信息,但在缩小我收集的数据范围时遇到了一些问题。
这是我目前的情况:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("a")
for anchor in links:
print(anchor.get('href')), anchor.text
此 returns 来自相关页面的所有 links 加上 link 的文本描述,但我想从中提取 links页面上的 'article' 类型对象。这些是特定食谱的 link。
通过一些实验,我设法 return 文章中的文本,但我似乎无法提取 links。
我看到的与文章标签相关的唯一两件事是 href 和 img.src:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("article")
for ele in links:
print(ele.a["href"])
print(ele.img["src"])
链接在"class=node-title"
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"})
for l in links:
print(l.a["href"])
/recipes/681646/tomato-tart
/recipes/4468/stuffed-tomatoes
/recipes/1641/charred-tomatoes
/recipes/tomato-confit
/recipes/1575635/roast-tomatoes
/recipes/2536638/tomato-passata
/recipes/2518/cherry-tomatoes
/recipes/681653/stuffed-tomatoes
/recipes/2852676/tomato-sauce
/recipes/2075/tomato-soup
/recipes/339605/tomato-sauce
/recipes/2130/essence-of-tomatoes-
/recipes/2942/tomato-tarts
/recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa
/recipes/3509/honey-and-thyme-tomatoes
要访问您需要在前面添加 http://www.bbcgoodfood.com
:
for l in links:
print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status
200
200
200
200
200
200
200
200
200
200
BBC 美食页面的结构现已更改。
我已经设法像这样调整代码,虽然不完美但可以建立在:
import numpy as np
#Create empty list
listofurls = []
pages = np.arange(1, 10, 1)
ingredientlist = ['milk','eggs','flour']
for ingredient in ingredientlist:
for page in pages:
page = requests.get('https://www.bbcgoodfood.com/search/recipes/page/' + str(page) + '/?q=' + ingredient + '&sort=-relevance')
soup = BeautifulSoup(page.content)
for link in soup.findAll(class_="standard-card-new__article-title"):
listofurls.append("https://www.bbcgoodfood.com" + link.get('href'))
我正在尝试从 BBC Good Food 网站提取信息,但在缩小我收集的数据范围时遇到了一些问题。
这是我目前的情况:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("a")
for anchor in links:
print(anchor.get('href')), anchor.text
此 returns 来自相关页面的所有 links 加上 link 的文本描述,但我想从中提取 links页面上的 'article' 类型对象。这些是特定食谱的 link。
通过一些实验,我设法 return 文章中的文本,但我似乎无法提取 links。
我看到的与文章标签相关的唯一两件事是 href 和 img.src:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("article")
for ele in links:
print(ele.a["href"])
print(ele.img["src"])
链接在"class=node-title"
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"})
for l in links:
print(l.a["href"])
/recipes/681646/tomato-tart
/recipes/4468/stuffed-tomatoes
/recipes/1641/charred-tomatoes
/recipes/tomato-confit
/recipes/1575635/roast-tomatoes
/recipes/2536638/tomato-passata
/recipes/2518/cherry-tomatoes
/recipes/681653/stuffed-tomatoes
/recipes/2852676/tomato-sauce
/recipes/2075/tomato-soup
/recipes/339605/tomato-sauce
/recipes/2130/essence-of-tomatoes-
/recipes/2942/tomato-tarts
/recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa
/recipes/3509/honey-and-thyme-tomatoes
要访问您需要在前面添加 http://www.bbcgoodfood.com
:
for l in links:
print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status
200
200
200
200
200
200
200
200
200
200
BBC 美食页面的结构现已更改。
我已经设法像这样调整代码,虽然不完美但可以建立在:
import numpy as np
#Create empty list
listofurls = []
pages = np.arange(1, 10, 1)
ingredientlist = ['milk','eggs','flour']
for ingredient in ingredientlist:
for page in pages:
page = requests.get('https://www.bbcgoodfood.com/search/recipes/page/' + str(page) + '/?q=' + ingredient + '&sort=-relevance')
soup = BeautifulSoup(page.content)
for link in soup.findAll(class_="standard-card-new__article-title"):
listofurls.append("https://www.bbcgoodfood.com" + link.get('href'))