如何在使用 selenium 时每次在 scrapy xpath 中更新新页面源？

Question

这个与 scrapy 合并的 selenium 工作正常，只有一个问题-

我每次都需要使用页面生成的新源代码更新 sites = response.xpath()，否则它会一次又一次地返回重复的结果。

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import TextResponse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time


class Product(scrapy.Item):
    title = scrapy.Field()


class FooSpider(CrawlSpider):
    name = 'foo'

    start_urls = ["https://www.example.com"]

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe")
        self.browser.implicitly_wait(60) # 

    def parse(self,response):
        self.browser.get(response.url)
        sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')

        for i in range(0,200):
            items = []
            time.sleep(20)
            button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
            button.click()
            self.browser.implicitly_wait(30)

            for site in sites:
                item = Product()

                item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
                yield item

Answer 1

您需要创建一个新的 Selector instance in the loop after the click passing the current page source from .page_source:

from scrapy.selector import Selector

self.browser.implicitly_wait(30)

for i in range(0,200):
    time.sleep(20)  # TODO: a delay like this doesn't look good

    button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
    button.click()

    sel = Selector(text=self.browser.page_source)
    sites = sel.xpath('//div[@class="single-review"]/div[@class="review-header"]')

    for site in sites:
        item = Product()

        item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
        yield item

请注意，您只需调用 implicitly_wait() 一次 - 它不会立即增加延迟 - 它只会指示 selenium 在搜索元素时等待 X 秒。

另外，我怀疑你真的需要 time.sleep(20) 打电话。相反，您可能想开始使用 Explicit Waits.

如何在使用 selenium 时每次在 scrapy xpath 中更新新页面源？

How to update the new page source everytime in scrapy xpath while using selenium?

python

selenium

xpath

scrapy

web-scraping