Scrapy 蜘蛛找不到点击时加载的 URL

Question

我正在尝试从此页面抓取数据 - http://catalog.umassd.edu/content.php?catoid=45&navoid=3554

我想用 'Display courses for this department' link 扩展每个部分，然后在该页面上获取每门课程的课程信息（文本）。

我编写了以下脚本：

 from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
 from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
 from scrapy.selector import Selector
 from scrapy.http import HtmlResponse

 from courses.items import Course


class EduSpider(CrawlSpider):
    name = 'umassd.edu'
    allowed_domains = ['umassd.edu']
    start_urls = ['http://catalog.umassd.edu/content.php']

    rules = (Rule(LxmlLinkExtractor(
         allow=('.*/http://catalog.umassd.edu/preview_course.php?
         catoid=[0-9][0-9]&coid=[0-9][0-9][0-9][0-9][0-9][0-9]', ),
         ), callback='parse_item'),

    def parse_item(self, response):
        item = Course()
        print (response)

现在，无论我提供什么 start_url，蜘蛛似乎都无法到达 preview_course.php links - 我尝试了一些变化。脚本退出时根本没有抓取任何 /content.php 页面。

这仅用于教育目的。

Answer 1

您正在寻找的 url 是通过 AJAX 请求检索到的。如果你打开你的浏览器开发工具并转到 "networks" 选项卡，你会看到当你点击按钮时发出的请求，类似于：

http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide=show&cat_oid=45&nav_oid=3554&ent_oid=2027&type=c&link_text=this%20department

此 url 由 javascript 生成，然后将其内容下载并注入您的页面。
由于 scrapy 不执行任何 javascript 你需要自己重新创建这个 url 。幸运的是，在您的情况下很容易对其进行逆向工程。

如果您检查 html 源代码，您可以看到 "display courses for this department" link 节点上有一些有趣的东西：

<a href="#" 
target="_blank" 
onclick="showHideFilterData(this, 'show', '45', '3554', '2027', 'c', 'this department'); return false;>
Display courses for this department.</a>

我们可以看到，当我们点击某些 javascript 功能时，如果我们将其与上面的 url 进行比较，您可以清楚地看到一些相似之处。

现在我们可以使用以下数据重新创建此 url：

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://catalog.umassd.edu/content.php?catoid=45&navoid=3554']

    def parse(self, response):
        # get "onclick" java function of every "show more" link
        # and extract parameters supplied to this function with regular expressions
        links = response.xpath("//a/@onclick[contains(.,'showHide')]")
        for link in links:
            args = link.re("'(.+?)'")
            # make our url by putting arguments from page source 
            # into a template of an url
            url = 'http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide={}&cat_oid={}&nav_oid={}&ent_oid={}&type={}&link_text={}'.format(*args)
            yield scrapy.Request(url, self.parse_more) 

    def parse_more(self, response):
        # here you'll get page source with all of the links

Scrapy 蜘蛛找不到点击时加载的 URL

Scrapy spider can't find URLs that load on click

python

scrapy

web-scraping

scrapy-spider