Scrapy 提取 <li> 内有 span
Scrapy extracting <li> with span inside
我正在尝试从此 html 结构中提取文本:
<div class="col-6 col-lg-3">
<span class="font-weight-bold">List of Birds</span>
<ul class="bird-forms">
<li>Crow <span class="color">Black</span></li>
<li>Peacock <span class="color">Multicolored</span></li>
<li>Dove <span class="color">Multicolored</span></li>
<li>Sparrow <span class="color">Brown</span></li>
<li>Goose <span class="color">Multicolored</span></li>
<li>Ostrich <span class="color">Multicolored</span></li>
</ul>
</div>
使用 scrapy shell:response.css('ul.bird-forms li ::text').extract()
我希望结果看起来像这样:
['Crow Black',
'Peacock Multicolored',
'Dove Multicolored',
'Sparrow Brown',
'Goose Multicolored',
'Ostrich Multicolored']
而不是这个:
['Crow',
'Black',
'Peacock',
'Multicolored',
'Dove',
'Multicolored',
'Sparrow',
'Brown',
'Goose',
'Multicolored',
'Ostrich',
'Multicolored']
我们可以单独拉取详细信息,然后合并它们:
li_tags = response.xpath(".//ul[@class='bird-forms']//li/text()").extract()
color_tags = response.xpath(".//ul[@class='bird-forms']//span[@class='color']/text()").extract()
[" ".join(entry) for entry in zip(li_tags, color_tags)]
['Crow Black',
'Peacock Multicolored',
'Dove Multicolored',
'Sparrow Brown',
'Goose Multicolored',
'Ostrich Multicolored']
您需要先单独 select li
标签,然后再为每个 li
标签添加 select 文本:
data = []
for li_tag in response.css("ul.bird-forms li"):
data.append(" ".join(li_tag.css("*::text").extract()))
与python列表理解相同:
data = [" ".join(x.css("*::text").extract()) for x in response.css("ul.bird-forms li")]
print(data)
# output <class 'list'>: ['Crow Black', 'Peacock Multicolored',
# 'Dove Multicolored', 'Sparrow Brown', 'Goose Multicolored', 'Ostrich Multicolored']
只需使用 XPath string()
:
birds = []
for li in response.xpath('//ul[@class="bird-forms"]/li'):
bird = li.xpath('string(.)').get()
birds.append(bird)
我正在尝试从此 html 结构中提取文本:
<div class="col-6 col-lg-3">
<span class="font-weight-bold">List of Birds</span>
<ul class="bird-forms">
<li>Crow <span class="color">Black</span></li>
<li>Peacock <span class="color">Multicolored</span></li>
<li>Dove <span class="color">Multicolored</span></li>
<li>Sparrow <span class="color">Brown</span></li>
<li>Goose <span class="color">Multicolored</span></li>
<li>Ostrich <span class="color">Multicolored</span></li>
</ul>
</div>
使用 scrapy shell:response.css('ul.bird-forms li ::text').extract()
我希望结果看起来像这样:
['Crow Black',
'Peacock Multicolored',
'Dove Multicolored',
'Sparrow Brown',
'Goose Multicolored',
'Ostrich Multicolored']
而不是这个:
['Crow',
'Black',
'Peacock',
'Multicolored',
'Dove',
'Multicolored',
'Sparrow',
'Brown',
'Goose',
'Multicolored',
'Ostrich',
'Multicolored']
我们可以单独拉取详细信息,然后合并它们:
li_tags = response.xpath(".//ul[@class='bird-forms']//li/text()").extract()
color_tags = response.xpath(".//ul[@class='bird-forms']//span[@class='color']/text()").extract()
[" ".join(entry) for entry in zip(li_tags, color_tags)]
['Crow Black',
'Peacock Multicolored',
'Dove Multicolored',
'Sparrow Brown',
'Goose Multicolored',
'Ostrich Multicolored']
您需要先单独 select li
标签,然后再为每个 li
标签添加 select 文本:
data = []
for li_tag in response.css("ul.bird-forms li"):
data.append(" ".join(li_tag.css("*::text").extract()))
与python列表理解相同:
data = [" ".join(x.css("*::text").extract()) for x in response.css("ul.bird-forms li")]
print(data)
# output <class 'list'>: ['Crow Black', 'Peacock Multicolored',
# 'Dove Multicolored', 'Sparrow Brown', 'Goose Multicolored', 'Ostrich Multicolored']
只需使用 XPath string()
:
birds = []
for li in response.xpath('//ul[@class="bird-forms"]/li'):
bird = li.xpath('string(.)').get()
birds.append(bird)