如何使用 Scrapy 在一个 POST 请求上捕获多个响应?

How to capture multiple responses on one single POST request using Scrapy?

我正在尝试网络抓取 this 网站并在您完成该网站的整个生命周期后下载可用的 pdf 文件。我正在为此使用 Scrapy。 在正确的时间捕获验证码

我遇到了一些问题

此网站是一个 ASPX 网页,使用“Viewstates”来跟踪每个 POST 请求。现在,如果你浏览这个网站,你就会明白,每当你填写任何下拉字段时,它都会向某个 URL 路径发送带有 'Viewstate' 值的 POST 请求,你可以看到在浏览器控制台中。但与此同时,它向另一个 URL 发送另一个 GET 请求以获取“CAPTCHA”图像。但我无法得到这个回应。我不知道使用Scrapy是否可以同时捕获多个请求多个响应。

现在,我试图找到解决此问题的方法。我几乎遵循了此 Whosebug post 中提到的所有内容,但作为响应,我收到 HTML 代码,其中 javascript 警报代码提到“ 插入了错误的文本, 请输入图像文本框中显示的新字符”。所以,这个解决方案也不适合我。

这是我的爬虫代码:

# -*- coding: utf-8 -*-
import scrapy
import cv2
import pytesseract
from PIL import Image
from io import BytesIO
from election_data.items  import ElectionDataItem

class ElectionSpider(scrapy.Spider):
    name = 'election'
    allowed_domains = ['ceo.maharashtra.gov.in']
    start_urls = ['https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx']
    dist_dict = []

    def parse(self, response):
        district = response.css('select#Content_DistrictList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_DistrictList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : district,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_assembly)
        request.meta['district'] = district
        yield request

    def parse_assembly(self, response):
        print('parse_assembly')
        assembly = response.css('select#Content_AssemblyList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_AssemblyList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : assembly,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_part)
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = assembly
        yield request

    def parse_part(self, response):
        print('parse_part')
        part = response.css('select#Content_PartList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_PartList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : part,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_captcha)
        request.meta['__VIEWSTATE'] = response.css('input#__VIEWSTATE::attr(value)').extract_first()
        request.meta['__EVENTVALIDATION'] = response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = part
        yield request

    def parse_captcha(self, response):
        data_for_later = response
        request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
        request.meta['__VIEWSTATE'] = response.meta['__VIEWSTATE']
        request.meta['__EVENTVALIDATION'] = response.meta['__EVENTVALIDATION']
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = response.meta['part']
        request.meta['data_for_later'] = data_for_later
        yield request

    def store_image(self, response):
        captcha_target_filename = 'filename.png'
        # save the image for processing
        i = Image.open(BytesIO(response.body))
        i.save(captcha_target_filename)
        captcha_text = self.solve_captcha(captcha_target_filename)
        print(captcha_text)
        data = {
            '__EVENTTARGET' : '',
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.meta['__VIEWSTATE'],
            '__EVENTVALIDATION' : response.meta['__EVENTVALIDATION'],
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : response.meta['part'],
            'ctl00$Content$txtcaptcha' : captcha_text,
            'ctl00$Content$OpenButton': 'Open PDF'
        }
        captcha_form = response.meta['data_for_later']
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest.from_response(captcha_form, method='POST', formdata=data, meta=meta, callback=self.get_pdfs)
        yield request

    def get_pdfs(self, response):
        # THIS IS WHERE FINAL RESPONSE IS CAPTURED
        print(response.text)

    def solve_captcha(self, image):
        image = cv2.imread(image,0)
        thresh = cv2.threshold(image, 220, 255, cv2.THRESH_BINARY)[1]

        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
        close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

        result = 255 - close
        cv2.imshow('thresh', thresh)
        cv2.imshow('close', close)
        cv2.imshow('result', result)

        return pytesseract.image_to_string(result)

如果您浏览上述站点并填写所有表单详细信息,监控浏览器控制台的网络选项卡,您就会对这个问题有所了解。

请指导我如何解决这个问题。谢谢。

(还)不是答案,但有几点建议:

  1. 您是否启用了 Cookie?此站点上的每个请求都会传递一个 ASP.NET_SessionID cookie。

  2. 您请求验证码时的回复是否正常?

  3. 这一长串请求很难理解,并且可能包含难以发现的错误。建议您首先专注于解决验证码:

    • 如果我 select 仅区并填写错误或正确的验证码解决方案,我要么收到 "wrong captcha" 消息,要么收到 "select correct details"。
    • 所以进入 "select correct details" 更容易(更少的 requests/moving 部分)但是已经向你展示了你是否正确地解决了验证码,所以我建议你先尝试这个然后再继续结果。

除此之外,您的方法看起来不错,没有明显的问题。

顺便说一句:最后可能会发现模拟完整的请求序列是不必要的,跳到最后两个获取最终验证码和发送最终表单提交的请求可能没问题……但是这对我们没有帮助,只是为了以后重构和简化代码。

这就是我讨厌 ASP.NET 应用程序的原因,它只会让你在抓取时抓狂。无论如何,除了一件事,你的一切都近乎完美

def parse_captcha(self, response):
    data_for_later = response
    request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
    request.meta['__VIEWSTATE'] = response.meta['__VIEWSTATE']
    request.meta['__EVENTVALIDATION'] = response.meta['__EVENTVALIDATION']
    request.meta['district'] = response.meta['district']
    request.meta['assembly'] = response.meta['assembly']
    request.meta['part'] = response.meta['part']
    request.meta['data_for_later'] = data_for_later
    yield request

这来自您设置 part 的响应,但您所做的是在设置部分之前复制 __VIEWSTATE__EVENTVALIDATION。所以你需要确保你捕捉到正确的状态

def parse_captcha(self, response):
    data_for_later = response
    request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
    request.meta['__VIEWSTATE'] = response.css('input#__VIEWSTATE::attr(value)').extract_first()
    request.meta['__EVENTVALIDATION'] = response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
    request.meta['district'] = response.meta['district']
    request.meta['assembly'] = response.meta['assembly']
    request.meta['part'] = response.meta['part']
    request.meta['data_for_later'] = data_for_later
    yield request