爬虫不产生任何输出
Crawler not producing any output
构建我的第一个网络抓取工具。我只是想获取名称列表并将它们附加到 csv 文件中。刮板似乎可以工作,但没有达到预期效果。输出文件只产生一个名字,它总是被刮掉的姓氏。当我重新运行爬虫时,它总是一个不同的名字。在这种情况下,写入 csv 文件的名称是 Ola Aina。
#Create the spider class
class premSpider(scrapy.Spider):
name = "premSpider"
def start_requests(self):
# Create a List of Urls with which we wish to scrape
urls = ['https://www.premierleague.com/players']
#Iterate through each url and send it to be parsed
for url in urls:
#yield kind of acts like return
yield scrapy.Request(url = url, callback = self.parse)
def parse(self, response):
#extract links to player pages
plinks = response.xpath('//tr').css('a::attr(href)').extract()
#follow links to specific player pages
for plink in plinks:
yield response.follow(url = plink, callback = self.parse2)
def parse2(self, response):
plinks2 = response.xpath('//a[@href="stats"]').css('a::attr(href)').extract()
for link2 in plinks2:
yield response.follow(url = link2, callback = self.parse3)
def parse3(self, response):
names= response.xpath('//div[@class="name t-colour"]/text()').extract()
filepath = 'playerlinks.csv'
with open(filepath, 'w') as f:
f.writelines([name + '\n' for name in names])
process = CrawlerProcess()
process.crawl(premSpider)
process.start()
你也可以使用 Scrapy 自己的“FEEDS”导出..
将此添加到您的蜘蛛名称下方:
custom_settings = {'FEEDS':{'results1.csv':{'format':'csv'}}}"
并修改parse3如下:
def parse3(self, response):
names=response.xpath('.//div[@class="name t-colour"]/text()').get()
yield {'names':names}
构建我的第一个网络抓取工具。我只是想获取名称列表并将它们附加到 csv 文件中。刮板似乎可以工作,但没有达到预期效果。输出文件只产生一个名字,它总是被刮掉的姓氏。当我重新运行爬虫时,它总是一个不同的名字。在这种情况下,写入 csv 文件的名称是 Ola Aina。
#Create the spider class
class premSpider(scrapy.Spider):
name = "premSpider"
def start_requests(self):
# Create a List of Urls with which we wish to scrape
urls = ['https://www.premierleague.com/players']
#Iterate through each url and send it to be parsed
for url in urls:
#yield kind of acts like return
yield scrapy.Request(url = url, callback = self.parse)
def parse(self, response):
#extract links to player pages
plinks = response.xpath('//tr').css('a::attr(href)').extract()
#follow links to specific player pages
for plink in plinks:
yield response.follow(url = plink, callback = self.parse2)
def parse2(self, response):
plinks2 = response.xpath('//a[@href="stats"]').css('a::attr(href)').extract()
for link2 in plinks2:
yield response.follow(url = link2, callback = self.parse3)
def parse3(self, response):
names= response.xpath('//div[@class="name t-colour"]/text()').extract()
filepath = 'playerlinks.csv'
with open(filepath, 'w') as f:
f.writelines([name + '\n' for name in names])
process = CrawlerProcess()
process.crawl(premSpider)
process.start()
你也可以使用 Scrapy 自己的“FEEDS”导出..
将此添加到您的蜘蛛名称下方:
custom_settings = {'FEEDS':{'results1.csv':{'format':'csv'}}}"
并修改parse3如下:
def parse3(self, response):
names=response.xpath('.//div[@class="name t-colour"]/text()').get()
yield {'names':names}