scrapy:如何将 <ul> 和 <li> 一个一个地转义?
scrapy: how to scape both <ul> and <li> one by one?
使用Pythonscrapy从网页中获取内容,我想按以下顺序获取内容:
客厅,椅子link
客厅,沙发link
...
卧室,床link
卧室,镜子link
...
到目前为止,url 是正确的,但是 parse_item_info
中打印出的所有 sub_cat
都是 Living room
。当我尝试在 parse_item
中打印出 sub_cat
时,我得到了所有子类别。
我认为问题是 <ul>
中的标签 <li>
被获取了两次。我怎样才能一一纠正它们?
谢谢。
html:
<div class="row margin-b2">
<div class="col">
<ul class="list-unstyled">
<li class="font-size-3 color-yellow main-list-li">
Living room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/536" class="text-light">Chair</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/537" class="text-light">Sofa</a>
</li>
...
</ul>
<li class="font-size-3 color-yellow main-list-li">
Bed room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/538" class="text-light">Bed</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/539" class="text-light">Mirror</a>
</li>
...
</ul>
... </ul>
</ul>
</div>
</div>
Python:
def parse_item(self, response):
cat = response.meta["cat"]
out_box = response.xpath('//div[@class="row margin-b2"]')
# get all sub categories first
sub_cat_arr = []
for box in out_box.xpath('//li[@class="font-size-3 color-yellow main-list-li"]'):
sub_cat = box.xpath('./text()').extract()[0].strip()
sub_cat_arr.append(sub_cat)
i = 0
for box in out_box.xpath('//ul[@class="list-inline main-list-ul"]'):
sub_cat = sub_cat_arr[i]
i += 1
print("in......")
print(sub_cat)
for url_box in box.xpath('//li[@class="list-inline-item main-list-li-w align-text-top"]//a'):
new_url = url_box.xpath('.//@href').extract()[0]
yield scrapy.Request(new_url, meta={"url": new_url, "cat": cat, "sub_cat": sub_cat}, callback=self.parse_item_info)
def parse_item_info(self, response):
cat = response.meta["cat"]
sub_cat = response.meta["sub_cat"]
url = response.meta["url"]
print(sub_cat)
print(url)
...
为了不处理相同的标签两次你肯定只需要使用一个周期:
def parse_item(self, response):
cat = response.meta["cat"]
for tag in response.css("ul.list-unstyled li.font-size-3.color-yellow.main-list-li, ul.list-inline.main-list-ul li a"):
if tag.root.tag == "li":
current_sub_cat = tag.css("*::text").extract_first("").strip("\n ")
elif tag.root.tag == "a":
new_url = tag.css("*::attr(href)").extract_first()
sub_cat = current_sub_cat
yield scrapy.Request(url=new_url, meta={"new_url": new_url, "sub_cat": sub_cat, "cat": cat}, callback=self.parse_item_info)
使用Pythonscrapy从网页中获取内容,我想按以下顺序获取内容:
客厅,椅子link
客厅,沙发link
...
卧室,床link
卧室,镜子link
...
到目前为止,url 是正确的,但是 parse_item_info
中打印出的所有 sub_cat
都是 Living room
。当我尝试在 parse_item
中打印出 sub_cat
时,我得到了所有子类别。
我认为问题是 <ul>
中的标签 <li>
被获取了两次。我怎样才能一一纠正它们?
谢谢。
html:
<div class="row margin-b2">
<div class="col">
<ul class="list-unstyled">
<li class="font-size-3 color-yellow main-list-li">
Living room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/536" class="text-light">Chair</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/537" class="text-light">Sofa</a>
</li>
...
</ul>
<li class="font-size-3 color-yellow main-list-li">
Bed room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/538" class="text-light">Bed</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/539" class="text-light">Mirror</a>
</li>
...
</ul>
... </ul>
</ul>
</div>
</div>
Python:
def parse_item(self, response):
cat = response.meta["cat"]
out_box = response.xpath('//div[@class="row margin-b2"]')
# get all sub categories first
sub_cat_arr = []
for box in out_box.xpath('//li[@class="font-size-3 color-yellow main-list-li"]'):
sub_cat = box.xpath('./text()').extract()[0].strip()
sub_cat_arr.append(sub_cat)
i = 0
for box in out_box.xpath('//ul[@class="list-inline main-list-ul"]'):
sub_cat = sub_cat_arr[i]
i += 1
print("in......")
print(sub_cat)
for url_box in box.xpath('//li[@class="list-inline-item main-list-li-w align-text-top"]//a'):
new_url = url_box.xpath('.//@href').extract()[0]
yield scrapy.Request(new_url, meta={"url": new_url, "cat": cat, "sub_cat": sub_cat}, callback=self.parse_item_info)
def parse_item_info(self, response):
cat = response.meta["cat"]
sub_cat = response.meta["sub_cat"]
url = response.meta["url"]
print(sub_cat)
print(url)
...
为了不处理相同的标签两次你肯定只需要使用一个周期:
def parse_item(self, response):
cat = response.meta["cat"]
for tag in response.css("ul.list-unstyled li.font-size-3.color-yellow.main-list-li, ul.list-inline.main-list-ul li a"):
if tag.root.tag == "li":
current_sub_cat = tag.css("*::text").extract_first("").strip("\n ")
elif tag.root.tag == "a":
new_url = tag.css("*::attr(href)").extract_first()
sub_cat = current_sub_cat
yield scrapy.Request(url=new_url, meta={"new_url": new_url, "sub_cat": sub_cat, "cat": cat}, callback=self.parse_item_info)