如何使用 scrapy/python 直接从 URL 读取 xml

Question

在 Scrapy 中你必须定义 start_urls。但是我怎样才能从其他网址抓取呢？

到目前为止，我有一个登录网页的登录脚本。登录后，我想从不同的 url 中提取 xml。

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['login page']
    urls = ['url','url']

def parse(self, response):
    return scrapy.FormRequest.from_response(
        response,
        formdata={'UserName': '', 'Password': ''},
        callback=self.check_login_response
    )

def check_login_response(self, response):
    # check login succeed before going on
    if "incorrect" in response.body:
        self.log("Login failed", level=scrapy.log.ERROR)
        return

如何从 urls 数组中定义的 url 抓取数据？

Answer 1

您需要 yield Request 其他网址的实例：

def check_login_response(self, response):
    # check login succeed before going on
    if "incorrect" in response.body:
        self.log("Login failed", level=scrapy.log.ERROR)
        return

    for url in list_or_urls:
        yield Request(url, callback=self.parse_other_url)

def parse_other_url(self, response):
    # ...

如何使用 scrapy/python 直接从 URL 读取 xml

How to read xml directly from URLs with scrapy/python

python

xml

scrapy

web-scraping

scrapy-spider