为什么 xpath 只选择 <ul> 中的最后一个 <li>？

Question

我正在尝试抓取此网站：http://www.kaymu.com.ng/。

我正在抓取的HTML部分是这样的：

<ul id="navigation-menu">
    <li> some content </li>
    <li> some content </li>
    ...
    <li> some content </li>
</ul>

这是我的蜘蛛 :

class KaymuSpider(Spider):
    name = "kaymu"
    allowed_domains = ["kaymu.com.ng"]
    start_urls = [
        "http://www.kaymu.com.ng"
    ]

    def parse(self, response):
        sel = response.selector
        menu = sel.xpath('//ul[@id="navigation-menu"]/li')

菜单只有列表中的最后一个 li 元素。当 select 所有 li 元素的语法正确时，我不确定为什么会这样。有什么不对的请指教，谢谢！

Answer 1

问题是菜单是在浏览器执行 javascript 的帮助下动态构建的。 Scrapy 不是浏览器，也没有内置 javascript 引擎。

希望有一个包含 javascript 菜单对象数组的 script 标签。我们可以找到所需的 script 标签，提取 javascript 数组，在 json module 的帮助下将其加载到 Python 列表中并打印出菜单项名称。

来自 "Scrapy Shell" 的演示：

$ scrapy shell http://www.kaymu.com.ng/

In [1]: script = response.xpath("//script[contains(., 'categoryData')]/text()").extract()[0]

In [2]: import re

In [3]: pattern = re.compile(r'var categoryData = (.*?);\n')

In [4]: data = pattern.search(script).group(1)

In [5]: import json

In [6]: data = json.loads(data)

In [7]: for item in data:
   ....:     print item['name']
   ....:     
Fashion
Jewelry & Watches
Health & Beauty
Sporting Goods
Mobile Phones & Tablets
Audio, Video & Gaming
Computers, Laptops & Accessories
Appliances, Furniture & Decor
Books & Media
Babies & Kids
Food & Beverages
Other

为什么 xpath 只选择 <ul> 中的最后一个 <li>？

Why is xpath selecting only the last <li> inside the <ul>?

python

scrapy

web-scraping

scrapy-spider