如何使用 python 和 beautifulsoup4 循环抓取网站中多个页面的数据
How can I loop scraping data for multiple pages in a website using python and beautifulsoup4
我正在尝试从 PGA.com 网站抓取数据以获取美国所有高尔夫球场的 table。在我的 CSV table 中,我想包括高尔夫球场的名称、地址、所有权、网站、Phone 编号。有了这些数据,我想对其进行地理编码并将其放入地图中,并在我的计算机上有一个本地副本
我使用 Python 和 Beautiful Soup4 来提取我的数据。我已经达到提取数据并将其导入 CSV 的目的,但我现在遇到了从 PGA 网站上的多个页面抓取数据的问题。我想提取所有高尔夫球场,但我的脚本仅限于一个页面,我想将其循环播放,以便从 PGA 网站的所有页面中捕获高尔夫球场的所有数据。大约有18000个黄金课程和900页抓取数据
下面附上我的脚本。我需要有关创建代码的帮助,这些代码将从 PGA 网站捕获所有数据,而不仅仅是一个站点,而是多个站点。这样它就会给我提供美国黄金课程的所有数据。
下面是我的脚本:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
此脚本一次只能捕获 20 个,我想在一个脚本中捕获所有内容,该脚本占 18000 个高尔夫球场和 900 个要抓取的页面。
您将 link 放到单个页面,它不会单独遍历每个页面。
第 1 页:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
第 2 页:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
第 907 页:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
由于您 运行 正在浏览第 1 页,因此您只会得到 20 页。您需要创建一个循环 运行 遍历每一页。
您可以先创建一个执行一页的函数,然后迭代该函数。
在 url 中的 search?
之后,从第 2 页开始,page=1
开始增加,直到第 907 页 page=906
。
PGA 网站的搜索有多个页面,url 遵循以下模式:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
这意味着你可以读取该页的内容,然后将page的值改变1,读取下一页....等等。
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
如果你还在读这个post,你也可以试试这个代码....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
写 range(1,5) 的地方只需将其更改为 0,到最后一页,您将获得 CSV 格式的所有详细信息,我非常努力地以正确的格式获取您的数据,但这很难: ).
我注意到第一个解决方案与第一个实例有重复,那是因为0页和1页是同一页。这通过在范围函数中指定起始页来解决。下面的示例...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
遇到了完全相同的问题,但上述解决方案均无效。我通过计算 cookie 解决了我的问题。请求会话有帮助。创建一个会话,它将通过向所有编号页面插入 cookie 来提取您需要的所有页面。
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
PGA 网站已更改此问题已被问到。
他们似乎按以下方式组织所有课程:州 > 城市 > 课程
鉴于这一变化和这个问题的受欢迎程度,以下是我今天解决这个问题的方法。
第 1 步 - 导入我们需要的一切:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
第 2 步 - 抓取所有状态 URL 端点:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
第 3 步 - 编写一个函数来抓取所有城市链接:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
第 4 步 - 编写一个函数来抓取所有课程:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
第 5 步 - 编写一个函数来解析有关课程的所有有用信息:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
第 6 步 - 遍历所有内容并保存:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)
我正在尝试从 PGA.com 网站抓取数据以获取美国所有高尔夫球场的 table。在我的 CSV table 中,我想包括高尔夫球场的名称、地址、所有权、网站、Phone 编号。有了这些数据,我想对其进行地理编码并将其放入地图中,并在我的计算机上有一个本地副本
我使用 Python 和 Beautiful Soup4 来提取我的数据。我已经达到提取数据并将其导入 CSV 的目的,但我现在遇到了从 PGA 网站上的多个页面抓取数据的问题。我想提取所有高尔夫球场,但我的脚本仅限于一个页面,我想将其循环播放,以便从 PGA 网站的所有页面中捕获高尔夫球场的所有数据。大约有18000个黄金课程和900页抓取数据
下面附上我的脚本。我需要有关创建代码的帮助,这些代码将从 PGA 网站捕获所有数据,而不仅仅是一个站点,而是多个站点。这样它就会给我提供美国黄金课程的所有数据。
下面是我的脚本:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
此脚本一次只能捕获 20 个,我想在一个脚本中捕获所有内容,该脚本占 18000 个高尔夫球场和 900 个要抓取的页面。
您将 link 放到单个页面,它不会单独遍历每个页面。
第 1 页:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
第 2 页:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
第 907 页:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
由于您 运行 正在浏览第 1 页,因此您只会得到 20 页。您需要创建一个循环 运行 遍历每一页。
您可以先创建一个执行一页的函数,然后迭代该函数。
在 url 中的 search?
之后,从第 2 页开始,page=1
开始增加,直到第 907 页 page=906
。
PGA 网站的搜索有多个页面,url 遵循以下模式:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
这意味着你可以读取该页的内容,然后将page的值改变1,读取下一页....等等。
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
如果你还在读这个post,你也可以试试这个代码....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
写 range(1,5) 的地方只需将其更改为 0,到最后一页,您将获得 CSV 格式的所有详细信息,我非常努力地以正确的格式获取您的数据,但这很难: ).
我注意到第一个解决方案与第一个实例有重复,那是因为0页和1页是同一页。这通过在范围函数中指定起始页来解决。下面的示例...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
遇到了完全相同的问题,但上述解决方案均无效。我通过计算 cookie 解决了我的问题。请求会话有帮助。创建一个会话,它将通过向所有编号页面插入 cookie 来提取您需要的所有页面。
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
PGA 网站已更改此问题已被问到。
他们似乎按以下方式组织所有课程:州 > 城市 > 课程
鉴于这一变化和这个问题的受欢迎程度,以下是我今天解决这个问题的方法。
第 1 步 - 导入我们需要的一切:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
第 2 步 - 抓取所有状态 URL 端点:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
第 3 步 - 编写一个函数来抓取所有城市链接:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
第 4 步 - 编写一个函数来抓取所有课程:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
第 5 步 - 编写一个函数来解析有关课程的所有有用信息:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
第 6 步 - 遍历所有内容并保存:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)