如何通过scrapy提取类似href中的文本
How to extract text in similar a href by scrapy
我想提取如下文本,
SUBTHEME_SELECTOR = '.subtheme::text',
YEAR_SELECTOR = '.year::text'
但我不知道如何提取主题,你能帮我吗?
THEME_SELECTOR = '//a[contains(@href, "/sets/theme-")]/@href' ???
<div class='tags floatleft'>
<a href='/sets/10251-1/Brick-Bank'>10251-1</a>
<a href='/sets/theme-Creator-Expert'>Creator Expert</a>
<a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a>
<a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a>
</div>
你没看错。即使没有真正抓取站点,您也可以非常简单地测试它:
import scrapy
TEXT = """
<div class='tags floatleft'>
<a href='/sets/10251-1/Brick-Bank'>10251-1</a>
<a href='/sets/theme-Creator-Expert'>Creator Expert</a>
<a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a>
<a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a>
</div>
"""
s = scrapy.Selector(text=TEXT)
link = s.xpath('//a[contains(@href,"/sets/theme-")]/@href').extract_first()
text = s.xpath('//a[contains(@href,"/sets/theme-")]/text()').extract_first()
print(link)
print(text)
生产:
/sets/theme-Creator-Expert
Creator Expert
我想提取如下文本,
SUBTHEME_SELECTOR = '.subtheme::text',
YEAR_SELECTOR = '.year::text'
但我不知道如何提取主题,你能帮我吗?
THEME_SELECTOR = '//a[contains(@href, "/sets/theme-")]/@href' ???
<div class='tags floatleft'>
<a href='/sets/10251-1/Brick-Bank'>10251-1</a>
<a href='/sets/theme-Creator-Expert'>Creator Expert</a>
<a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a>
<a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a>
</div>
你没看错。即使没有真正抓取站点,您也可以非常简单地测试它:
import scrapy
TEXT = """
<div class='tags floatleft'>
<a href='/sets/10251-1/Brick-Bank'>10251-1</a>
<a href='/sets/theme-Creator-Expert'>Creator Expert</a>
<a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a>
<a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a>
</div>
"""
s = scrapy.Selector(text=TEXT)
link = s.xpath('//a[contains(@href,"/sets/theme-")]/@href').extract_first()
text = s.xpath('//a[contains(@href,"/sets/theme-")]/text()').extract_first()
print(link)
print(text)
生产:
/sets/theme-Creator-Expert
Creator Expert