scrapy:如何将 <ul> 和 <li> 一个一个地转义?

scrapy: how to scape both <ul> and <li> one by one?

使用Pythonscrapy从网页中获取内容,我想按以下顺序获取内容:

客厅,椅子link

客厅,沙发link

...

卧室,床link

卧室,镜子link

...

到目前为止,url 是正确的,但是 parse_item_info 中打印出的所有 sub_cat 都是 Living room。当我尝试在 parse_item 中打印出 sub_cat 时,我得到了所有子类别。

我认为问题是 <ul> 中的标签 <li> 被获取了两次。我怎样才能一一纠正它们? 谢谢。

html:

  <div class="row margin-b2">
                    <div class="col">
                        <ul class="list-unstyled">
                                <li class="font-size-3 color-yellow main-list-li">
                                    Living room
                                </li>
                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/536" class="text-light">Chair</a>
                                    </li>
                                    ...
                                </ul>   
                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/537" class="text-light">Sofa</a>
                                    </li>
                                    ...
                                </ul>


                                <li class="font-size-3 color-yellow main-list-li">
                                    Bed room
                                </li>

                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/538" class="text-light">Bed</a>
                                    </li>
                                    ...
                                </ul>   

                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/539" class="text-light">Mirror</a>
                                    </li>
                                    ...
                                </ul>       

                                ...                                                                                                                                                                                                                                                       </ul>

                        </ul>
                    </div>
                </div>

Python:

   def parse_item(self, response):
        cat = response.meta["cat"]
        out_box = response.xpath('//div[@class="row margin-b2"]')

        # get all sub categories first
        sub_cat_arr = []
        for box in out_box.xpath('//li[@class="font-size-3 color-yellow main-list-li"]'):
            sub_cat = box.xpath('./text()').extract()[0].strip()
            sub_cat_arr.append(sub_cat)

        i = 0
        for box in out_box.xpath('//ul[@class="list-inline main-list-ul"]'):
            sub_cat = sub_cat_arr[i]
            i += 1
            print("in......")
            print(sub_cat)
            for url_box in box.xpath('//li[@class="list-inline-item main-list-li-w align-text-top"]//a'):
                new_url = url_box.xpath('.//@href').extract()[0]
                yield scrapy.Request(new_url, meta={"url": new_url, "cat": cat, "sub_cat": sub_cat}, callback=self.parse_item_info)


    def parse_item_info(self, response):
        cat = response.meta["cat"]
        sub_cat = response.meta["sub_cat"]
        url = response.meta["url"]
        print(sub_cat)
        print(url)
        ...

为了不处理相同的标签两次你肯定只需要使用一个周期:

    def parse_item(self, response):
    cat = response.meta["cat"]
    for tag in response.css("ul.list-unstyled li.font-size-3.color-yellow.main-list-li, ul.list-inline.main-list-ul li a"):
        if tag.root.tag == "li":
            current_sub_cat = tag.css("*::text").extract_first("").strip("\n ")
        elif tag.root.tag == "a":
            new_url = tag.css("*::attr(href)").extract_first()
            sub_cat = current_sub_cat
            yield scrapy.Request(url=new_url, meta={"new_url": new_url, "sub_cat": sub_cat, "cat": cat}, callback=self.parse_item_info)