Scrapy 请求在 301 时没有传递给回调?
Scrapy request not passing to callback when 301?
我正在尝试更新一个包含外部网站链接的数据库,出于某种原因,当请求 headers/website/w/e 是 moved/301 标志时,它会跳过回调
def start_requests(self):
#... database stuff
for x in xrange(0, numrows):
row = cur.fetchone()
item = exampleItem()
item['real_id'] = row[0]
item['product_id'] = row[1]
url = "http://www.example.com/a/-" + item['real_id'] + ".htm"
log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right
request = scrapy.Request(url, callback=self.parse_url)
request.meta['item'] = item
yield request
def parse_url(self, response):
item = response.meta['item']
item['real_url'] = response.url
log.msg("item %d new URL is %s" % (item['product_id'], item['real_url']), log.INFO) #doesnt even show the items that have redirected.
Scrapy版本是0.24,我能做什么?
有趣的事实:它只发生在一些损坏的链接上,即使它们来自具有完全相同的 url 等的同一网站
必须将 dont_filter=True
参数传递给响应回调函数
我正在尝试更新一个包含外部网站链接的数据库,出于某种原因,当请求 headers/website/w/e 是 moved/301 标志时,它会跳过回调
def start_requests(self):
#... database stuff
for x in xrange(0, numrows):
row = cur.fetchone()
item = exampleItem()
item['real_id'] = row[0]
item['product_id'] = row[1]
url = "http://www.example.com/a/-" + item['real_id'] + ".htm"
log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right
request = scrapy.Request(url, callback=self.parse_url)
request.meta['item'] = item
yield request
def parse_url(self, response):
item = response.meta['item']
item['real_url'] = response.url
log.msg("item %d new URL is %s" % (item['product_id'], item['real_url']), log.INFO) #doesnt even show the items that have redirected.
Scrapy版本是0.24,我能做什么?
有趣的事实:它只发生在一些损坏的链接上,即使它们来自具有完全相同的 url 等的同一网站
必须将 dont_filter=True
参数传递给响应回调函数