Scrapy - <TD> 解析对齐有问题
Scrapy - Trouble with <TD> parsing alignment
我试图仅从 html table 中的 item
和 Skill Cap
列解析数据:http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html
在解析我 运行 时出现对齐问题,我的脚本从其他列解析。
import scrapy
class parser(scrapy.Spider):
name = "recipe_table"
start_urls = ['http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html']
def parse(self, response):
for row in response.xpath('//*[@class="datatable sortable"]//tr'):
data = row.xpath('td//text()').extract()
if not data: # skip empty row
continue
yield {
'name': data[0],
'cap': data[1],
# 'misc': data[2]
}
结果:scrapy runspider cap.py -t json
当它到达第 3 行时,正在解析来自意外列的数据。我不确定选择是怎么回事。
2019-05-09 19:41:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html> (referer: None)
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Set', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Table', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Cermet Kilij', 'cap': u'Cermet Kilij +1'}
如何使用 XPath 显式设置源列:
for row in response.xpath('//*[@class="datatable sortable"]//tr'):
yield {
'name': row.xpath('./td[1]/text()').extract_first(),
'cap': row.xpath('./td[3]/text()').extract_first(),
# 'misc': etc.
}
我试图仅从 html table 中的 item
和 Skill Cap
列解析数据:http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html
在解析我 运行 时出现对齐问题,我的脚本从其他列解析。
import scrapy
class parser(scrapy.Spider):
name = "recipe_table"
start_urls = ['http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html']
def parse(self, response):
for row in response.xpath('//*[@class="datatable sortable"]//tr'):
data = row.xpath('td//text()').extract()
if not data: # skip empty row
continue
yield {
'name': data[0],
'cap': data[1],
# 'misc': data[2]
}
结果:scrapy runspider cap.py -t json
当它到达第 3 行时,正在解析来自意外列的数据。我不确定选择是怎么回事。
2019-05-09 19:41:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html> (referer: None)
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Set', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Table', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Cermet Kilij', 'cap': u'Cermet Kilij +1'}
如何使用 XPath 显式设置源列:
for row in response.xpath('//*[@class="datatable sortable"]//tr'):
yield {
'name': row.xpath('./td[1]/text()').extract_first(),
'cap': row.xpath('./td[3]/text()').extract_first(),
# 'misc': etc.
}