scrapy 蜘蛛不返回任何结果
scrapy spider not returning any results
第一次尝试制作蜘蛛,如有不妥之处请多多包涵。
这是我试图从中提取数据的网站的 link。 http://www.4icu.org/in/。我想要页面上显示的完整大学列表。但是当我 运行 下面的蜘蛛时,我返回一个空的 json 文件。
我的 items.py
import scrapy
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
这是蜘蛛
colleges.py
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
return Request(
url = "http://www.4icu.org/in/",
callback = self.parse_fixtures
)
def parse_fixtures(self,response):
sel = response.selector
for div in sel.css("col span_2_of_2>div>tbody>tr"):
item = Fixture()
item['university.name'] = tr.xpath('td[@class="i"]/span /a/text()').extract()
yield item
正如问题评论中所述,您的代码存在一些问题。
首先,您不需要两个方法 -- 因为在 parse
方法中您调用了与在 start_urls
.
中相同的 URL
要从站点获取一些信息,请尝试使用以下代码:
def parse(self, response):
for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
name = tr.xpath('./td[1]/a/text()').extract()[0]
location = tr.xpath('./td[2]//text()').extract()[0]
print name, location
并根据您的需要进行调整以填充您的项目(或项目)。
如您所见,您的浏览器在 table
中显示了一个额外的 tbody
,当您使用 Scrapy 进行抓取时,它并不存在。这意味着您经常需要判断您在浏览器中看到的内容。
这是工作代码
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
location = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
item = CollegesItem()
item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
item['location'] = tr.xpath('./td[2]//text()').extract()[0]
yield item
在 运行 命令之后
蜘蛛
>>scrapy crawl colleges -o mait.json
以下是结果片段:
[[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
{"name": "Indian Institute of Technology Madras", "location": "Chennai"},
{"name": "University of Delhi", "location": "Delhi"},
{"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
{"name": "Anna University", "location": "Chennai"},
{"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
{"name": "Manipal University", "location": "Manipal ..."},
{"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
{"name": "Indian Institute of Science", "location": "Bangalore"},
{"name": "Panjab University", "location": "Chandigarh"},
{"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........
第一次尝试制作蜘蛛,如有不妥之处请多多包涵。 这是我试图从中提取数据的网站的 link。 http://www.4icu.org/in/。我想要页面上显示的完整大学列表。但是当我 运行 下面的蜘蛛时,我返回一个空的 json 文件。 我的 items.py
import scrapy
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
这是蜘蛛 colleges.py
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
return Request(
url = "http://www.4icu.org/in/",
callback = self.parse_fixtures
)
def parse_fixtures(self,response):
sel = response.selector
for div in sel.css("col span_2_of_2>div>tbody>tr"):
item = Fixture()
item['university.name'] = tr.xpath('td[@class="i"]/span /a/text()').extract()
yield item
正如问题评论中所述,您的代码存在一些问题。
首先,您不需要两个方法 -- 因为在 parse
方法中您调用了与在 start_urls
.
要从站点获取一些信息,请尝试使用以下代码:
def parse(self, response):
for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
name = tr.xpath('./td[1]/a/text()').extract()[0]
location = tr.xpath('./td[2]//text()').extract()[0]
print name, location
并根据您的需要进行调整以填充您的项目(或项目)。
如您所见,您的浏览器在 table
中显示了一个额外的 tbody
,当您使用 Scrapy 进行抓取时,它并不存在。这意味着您经常需要判断您在浏览器中看到的内容。
这是工作代码
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
location = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
item = CollegesItem()
item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
item['location'] = tr.xpath('./td[2]//text()').extract()[0]
yield item
在 运行 命令之后 蜘蛛
>>scrapy crawl colleges -o mait.json
以下是结果片段:
[[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
{"name": "Indian Institute of Technology Madras", "location": "Chennai"},
{"name": "University of Delhi", "location": "Delhi"},
{"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
{"name": "Anna University", "location": "Chennai"},
{"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
{"name": "Manipal University", "location": "Manipal ..."},
{"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
{"name": "Indian Institute of Science", "location": "Bangalore"},
{"name": "Panjab University", "location": "Chandigarh"},
{"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........