从网站抓取多个页面(BeautifulSoup,请求,Python3)
Crawl Multiple pages from a website (BeautifulSoup,Requests,Python3)
我想知道如何使用漂亮的 soup/requests 从一个网站抓取多个不同的页面,而不必一遍又一遍地重复我的代码。
下面是我目前的代码,是爬取某些城市的旅游景点:
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")
目前一切正常。下一步我想抓取这些城市最受欢迎的博物馆,除了旅游景点。
因此,我必须通过更改 c 参数来修改请求,以便获得所有需要的博物馆:
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
因此我的代码将如下所示:
RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")
这似乎不完全正确。我得到的输出不包括某些城市的所有博物馆和旅游景点。
有人可以帮我吗?感谢任何反馈。
所有名称都在 div 内的锚标记中,带有 property_title
class。
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
for item in (a.text for a in soup.select("div.property_title a")):
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dct[reg]) + " | " + "Art: Museum ")
最好从分页获取链接div:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dct = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
def get_names(soup):
for item in (a.text for a in soup.select("div.property_title a")):
if item not in already_printed:
already_printed.add(item)
print("POI: {} | Location: {} | Art: Museum ".format(item, dct[reg]))
base = "https://www.tripadvisor.de"
for reg in RegionIDArray:
r = requests.get("https://www.tripadvisor.de/Attractions-c[47,49]-g{}-oa.html".format(reg))
soup = BeautifulSoup(r.content)
# get links to all next pages.
all_pages = (urljoin(base, a["href"]) for a in soup.select("div.unified.pagination a.pageNum.taLnk")[1:])
# use helper function to print the names.
get_names(soup)
# visit all remaining pages.
for url in all_pages:
soup = BeautifulSoup(requests.get(url).content)
get_names(soup)
我想知道如何使用漂亮的 soup/requests 从一个网站抓取多个不同的页面,而不必一遍又一遍地重复我的代码。
下面是我目前的代码,是爬取某些城市的旅游景点:
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")
目前一切正常。下一步我想抓取这些城市最受欢迎的博物馆,除了旅游景点。
因此,我必须通过更改 c 参数来修改请求,以便获得所有需要的博物馆:
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
因此我的代码将如下所示:
RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")
这似乎不完全正确。我得到的输出不包括某些城市的所有博物馆和旅游景点。
有人可以帮我吗?感谢任何反馈。
所有名称都在 div 内的锚标记中,带有 property_title
class。
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
for item in (a.text for a in soup.select("div.property_title a")):
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dct[reg]) + " | " + "Art: Museum ")
最好从分页获取链接div:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dct = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
def get_names(soup):
for item in (a.text for a in soup.select("div.property_title a")):
if item not in already_printed:
already_printed.add(item)
print("POI: {} | Location: {} | Art: Museum ".format(item, dct[reg]))
base = "https://www.tripadvisor.de"
for reg in RegionIDArray:
r = requests.get("https://www.tripadvisor.de/Attractions-c[47,49]-g{}-oa.html".format(reg))
soup = BeautifulSoup(r.content)
# get links to all next pages.
all_pages = (urljoin(base, a["href"]) for a in soup.select("div.unified.pagination a.pageNum.taLnk")[1:])
# use helper function to print the names.
get_names(soup)
# visit all remaining pages.
for url in all_pages:
soup = BeautifulSoup(requests.get(url).content)
get_names(soup)