response.xpath() 和 Selector(text=response.text).xpath() 有什么区别

Question

>>> print(response.text)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-product.xml</loc>
 </sitemap>
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-category.xml</loc>
 </sitemap>
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-page.xml</loc>
 </sitemap>
</sitemapindex>

>>> response.xpath('//loc')
[]
>>> Selector(text=response.text).xpath('//loc')[0].extract()
'<loc>https://cargadgetss.com/sitemap-product.xml</loc>'
>>>

我想从“xml”text.Actually中提取标签信息，我刚刚开始学习如何使用scrapy提取数据，其中总是使用respone.xpath代码，但是这次，它没有 work.So 我尝试使用“Selector”，幸运的是，我得到了我想要的数据 need.But 我仍然不明白为什么可以使用 Selector 提取数据, 但不仅限于 .xpath?

Answer 1

那是因为 XML 命名空间 (xmlns)。另一种提取这些 URL 的方法是为命名空间分配一些前缀并在选择器上使用它。例如：

>>> response.xpath("//x:loc/text()", namespaces={"x": "http://www.sitemaps.org/schemas/sitemap/0.9"}).getall()                  
['https://cargadgetss.com/sitemap-product.xml',
 'https://cargadgetss.com/sitemap-category.xml',
 'https://cargadgetss.com/sitemap-page.xml']

(More info about namespaces and parsel)

但是，如果您想从站点地图中提取链接，我建议您使用 Scrapy's SitemapSpider。例如：

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),
    ]

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...

response.xpath() 和 Selector(text=response.text).xpath() 有什么区别

what's the different between response.xpath() and Selector(text=response.text).xpath()

python

scrapy