使用 Scrapy 爬取所有高尔夫球场页面
Crawl through all golf course pages with Scrapy
我正在尝试使用此目录抓取世界上所有高尔夫球场的高尔夫球场详细信息:
https://www.golfadvisor.com/course-directory
虽然我编写了解析函数来抓取实际课程:
def parse_filter_course(self, response):
# checking if it is an actual course page. excluded it for final ran, didnt fully
# exists = response.css('.CoursePageSidebar-map').get()
# if exists:
# the page is split in multiple sections with different amount of details specified on each.
# I decided to use nested for loop (for section in sections, for detail in section) to retrieve data.
about_section = response.css('.CourseAbout-information-item')
details_section = response.css('.CourseAbout-details-item')
rental_section = response.css('.CourseAbout-rentalsServices-item')
practice_section = response.css('.CourseAbout-practiceInstruction-item')
policies_section = response.css('.CourseAbout-policies-item')
sections = [
about_section,
details_section,
rental_section,
practice_section,
policies_section
]
# created a default list dict to add new details from for loops
dict = defaultdict(list)
# also have details added NOT from for loop sections, but hard coded using css and xpath selectors.
dict = {
'link': response.url,
'Name': response.css('.CoursePage-pageLeadHeading::text').get().strip(),
'Review Rating': response.css('.CoursePage-stars .RatingStarItem-stars-value::text').get('').strip(),
'Number of Reviews': response.css('.CoursePage-stars .desktop::text').get('').strip().replace(' Reviews',''),
'% Recommend this course': response.css('.RatingRecommendation-percentValue::text').get('').strip().replace('%',''),
'Address': response.css('.CoursePageSidebar-addressFirst::text').get('').strip(),
'Phone Number': response.css('.CoursePageSidebar-phoneNumber::text').get('').strip(),
# website has a redirecting link, did not figure out how to get the main during scraping process
'Website': urljoin('https://www.golfadvisor.com/', response.css('.CoursePageSidebar-courseWebsite .Link::attr(href)').get()),
'Latitude': response.css('.CoursePageSidebar-map::attr(data-latitude)').get('').strip(),
'Longitude': response.css('.CoursePageSidebar-map::attr(data-longitude)').get('').strip(),
'Description': response.css('.CourseAbout-description p::text').get('').strip(),
# here, I was suggested to use xpath to retrieve text. should it be used for the fields above and why?
'Food & Beverage': response.xpath('//h3[.="Available Facilities"]/following-sibling::text()[1]').get('').strip(),
'Available Facilities': response.xpath('//h3[.="Food & Beverage"]/following-sibling::text()[1]').get('').strip(),
# another example of using xpath for microdata
'Country': response.xpath("(//meta[@itemprop='addressCountry'])/@content").get('')
}
# nested for loop I mentioned above
for section in sections:
for item in section:
dict[item.css('.CourseValue-label::text').get().strip()] = item.css('.CourseValue-value::text').get('').strip()
yield dict
我正在努力浏览目录中的所有高尔夫球场。
我的方法很少:
- 根据之前的抓取经验,我使用了 scrapy.spider 爬虫和多个解析函数来完成每个步骤,首先:从世界目录中抓取所有国家 link,第二步:抓取所有 states/region link来自国家目录,第三个:在地区目录中抓取课程。
但我马上就遇到了困难,因为有些国家没有 regions/states 目录,而是课程目录。而且我不知道如何跳过一个解析函数而不是抓取 state/region links 开始抓取课程细节。
- 所以我遇到了 crawlspider,并用 link 提取器编写了一个规则,只访问路径中带有 'courses/' 的页面,而忽略路径中带有 'page=', ' 的页面,因为那些是重复的球场 link 导致同一个高尔夫球场被多次刮掉。
class GolfCourseSpider(CrawlSpider):
name = 'golfadvisor'
allowed_domains = ['golfadvisor.com']
start_urls = ['https://www.golfadvisor.com/course-directory']
# use rules to visit only pages with 'courses/' in the path and exclude pages with 'page=1, page=2, etc'
# since those are duplicate links to the same course
rules = [
Rule(LinkExtractor(allow=('courses/'), deny=('page=')), callback='parse_filter_course', follow=True),
]
通过这种方法,我能够从 36k 课程中抓取大约 20k 课程。
- 我提取了国家 url 而不是一个开始 url 并将其用作开始 url。
这给了我 36k 门课程中的 26k 门课程。
您能否建议一种更好的方式来抓取所有课程页面?
谢谢
对于这种情况,我建议使用 SitemapSpider
根据网站的 robots.txt - 它的站点地图包含所有 36K 课程链接
import scrapy
class GolfAdvisorComSpider(scrapy.spiders.SitemapSpider):
name = "golfadvisorcom"
custom_settings = {"DOWNLOAD_DELAY":1,
}
sitemap_urls = [
'https://www.golfadvisor.com/sitemap1.xml',
'https://www.golfadvisor.com/sitemap2.xml'
]
def sitemap_filter(self, entries):
for entry in entries:
if "/courses/" in entry.get("loc"):
yield entry
def parse(self, response):
...
#parse course data
我正在尝试使用此目录抓取世界上所有高尔夫球场的高尔夫球场详细信息: https://www.golfadvisor.com/course-directory
虽然我编写了解析函数来抓取实际课程:
def parse_filter_course(self, response):
# checking if it is an actual course page. excluded it for final ran, didnt fully
# exists = response.css('.CoursePageSidebar-map').get()
# if exists:
# the page is split in multiple sections with different amount of details specified on each.
# I decided to use nested for loop (for section in sections, for detail in section) to retrieve data.
about_section = response.css('.CourseAbout-information-item')
details_section = response.css('.CourseAbout-details-item')
rental_section = response.css('.CourseAbout-rentalsServices-item')
practice_section = response.css('.CourseAbout-practiceInstruction-item')
policies_section = response.css('.CourseAbout-policies-item')
sections = [
about_section,
details_section,
rental_section,
practice_section,
policies_section
]
# created a default list dict to add new details from for loops
dict = defaultdict(list)
# also have details added NOT from for loop sections, but hard coded using css and xpath selectors.
dict = {
'link': response.url,
'Name': response.css('.CoursePage-pageLeadHeading::text').get().strip(),
'Review Rating': response.css('.CoursePage-stars .RatingStarItem-stars-value::text').get('').strip(),
'Number of Reviews': response.css('.CoursePage-stars .desktop::text').get('').strip().replace(' Reviews',''),
'% Recommend this course': response.css('.RatingRecommendation-percentValue::text').get('').strip().replace('%',''),
'Address': response.css('.CoursePageSidebar-addressFirst::text').get('').strip(),
'Phone Number': response.css('.CoursePageSidebar-phoneNumber::text').get('').strip(),
# website has a redirecting link, did not figure out how to get the main during scraping process
'Website': urljoin('https://www.golfadvisor.com/', response.css('.CoursePageSidebar-courseWebsite .Link::attr(href)').get()),
'Latitude': response.css('.CoursePageSidebar-map::attr(data-latitude)').get('').strip(),
'Longitude': response.css('.CoursePageSidebar-map::attr(data-longitude)').get('').strip(),
'Description': response.css('.CourseAbout-description p::text').get('').strip(),
# here, I was suggested to use xpath to retrieve text. should it be used for the fields above and why?
'Food & Beverage': response.xpath('//h3[.="Available Facilities"]/following-sibling::text()[1]').get('').strip(),
'Available Facilities': response.xpath('//h3[.="Food & Beverage"]/following-sibling::text()[1]').get('').strip(),
# another example of using xpath for microdata
'Country': response.xpath("(//meta[@itemprop='addressCountry'])/@content").get('')
}
# nested for loop I mentioned above
for section in sections:
for item in section:
dict[item.css('.CourseValue-label::text').get().strip()] = item.css('.CourseValue-value::text').get('').strip()
yield dict
我正在努力浏览目录中的所有高尔夫球场。
我的方法很少:
- 根据之前的抓取经验,我使用了 scrapy.spider 爬虫和多个解析函数来完成每个步骤,首先:从世界目录中抓取所有国家 link,第二步:抓取所有 states/region link来自国家目录,第三个:在地区目录中抓取课程。
但我马上就遇到了困难,因为有些国家没有 regions/states 目录,而是课程目录。而且我不知道如何跳过一个解析函数而不是抓取 state/region links 开始抓取课程细节。
- 所以我遇到了 crawlspider,并用 link 提取器编写了一个规则,只访问路径中带有 'courses/' 的页面,而忽略路径中带有 'page=', ' 的页面,因为那些是重复的球场 link 导致同一个高尔夫球场被多次刮掉。
class GolfCourseSpider(CrawlSpider):
name = 'golfadvisor'
allowed_domains = ['golfadvisor.com']
start_urls = ['https://www.golfadvisor.com/course-directory']
# use rules to visit only pages with 'courses/' in the path and exclude pages with 'page=1, page=2, etc'
# since those are duplicate links to the same course
rules = [
Rule(LinkExtractor(allow=('courses/'), deny=('page=')), callback='parse_filter_course', follow=True),
]
通过这种方法,我能够从 36k 课程中抓取大约 20k 课程。
- 我提取了国家 url 而不是一个开始 url 并将其用作开始 url。 这给了我 36k 门课程中的 26k 门课程。
您能否建议一种更好的方式来抓取所有课程页面?
谢谢
对于这种情况,我建议使用 SitemapSpider
根据网站的 robots.txt - 它的站点地图包含所有 36K 课程链接
import scrapy
class GolfAdvisorComSpider(scrapy.spiders.SitemapSpider):
name = "golfadvisorcom"
custom_settings = {"DOWNLOAD_DELAY":1,
}
sitemap_urls = [
'https://www.golfadvisor.com/sitemap1.xml',
'https://www.golfadvisor.com/sitemap2.xml'
]
def sitemap_filter(self, entries):
for entry in entries:
if "/courses/" in entry.get("loc"):
yield entry
def parse(self, response):
...
#parse course data