如何使用 scrapy/python 直接从 URL 读取 xml
How to read xml directly from URLs with scrapy/python
在 Scrapy 中你必须定义 start_url
s。但是我怎样才能从其他网址抓取呢?
到目前为止,我有一个登录网页的登录脚本。登录后,我想从不同的 url 中提取 xml。
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example'
start_urls = ['login page']
urls = ['url','url']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'UserName': '', 'Password': ''},
callback=self.check_login_response
)
def check_login_response(self, response):
# check login succeed before going on
if "incorrect" in response.body:
self.log("Login failed", level=scrapy.log.ERROR)
return
如何从 urls 数组中定义的 url 抓取数据?
您需要 yield
Request
其他网址的实例:
def check_login_response(self, response):
# check login succeed before going on
if "incorrect" in response.body:
self.log("Login failed", level=scrapy.log.ERROR)
return
for url in list_or_urls:
yield Request(url, callback=self.parse_other_url)
def parse_other_url(self, response):
# ...
在 Scrapy 中你必须定义 start_url
s。但是我怎样才能从其他网址抓取呢?
到目前为止,我有一个登录网页的登录脚本。登录后,我想从不同的 url 中提取 xml。
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example'
start_urls = ['login page']
urls = ['url','url']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'UserName': '', 'Password': ''},
callback=self.check_login_response
)
def check_login_response(self, response):
# check login succeed before going on
if "incorrect" in response.body:
self.log("Login failed", level=scrapy.log.ERROR)
return
如何从 urls 数组中定义的 url 抓取数据?
您需要 yield
Request
其他网址的实例:
def check_login_response(self, response):
# check login succeed before going on
if "incorrect" in response.body:
self.log("Login failed", level=scrapy.log.ERROR)
return
for url in list_or_urls:
yield Request(url, callback=self.parse_other_url)
def parse_other_url(self, response):
# ...