抓取多个页面 Scrapy
Scraping Multiple Pages Scrapy
我正试图在每年的 Billboard 前 100 名中抓取。我有一个文件一次可以使用一年,但我希望它能够抓取所有年份并收集该数据。这是我当前的代码:
from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
URL = "http://www.billboard.com/archive/charts/%/hot-100"
class BillboardSpider(Spider):
name = 'Billboard_spider'
allowed_urls = ['http://www.billboard.com/']
start_urls = [URL % 1958]
def _init_(self):
self.page_number=1958
def parse(self, response):
print self.page_number
print "----------"
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
Song = Selector(text=row).xpath('//td[2]/text()').extract()
Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
yield item
self.page_number += 1
yield Request(URL % self.page_number)
但我收到错误消息:“start_urls = [URL % 1958]
ValueError: 索引 41 处不支持的格式字符 '/' (0x2f)"
有什么想法吗?我希望代码将年份从原来的 "URL" link 自动更改为 1959,并逐年继续,直到它停止找到 table,然后关闭。
您遇到的错误是因为您没有使用正确的字符串格式化语法。您可以查看 here 了解其工作原理的详细信息。
它在您的特定情况下不起作用的原因是您的 URL 缺少 's':
URL = "http://www.billboard.com/archive/charts/%/hot-100"
应该是
URL = "http://www.billboard.com/archive/charts/%s/hot-100"
无论如何,最好使用新样式的字符串格式:
URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]
继续,您的代码还有一些其他问题:
def _init_(self):
self.page_number=1958
如果你想使用一个init函数,它应该被命名为__init__
(两个下划线)并且因为你正在扩展Spider
,你需要传递*args
和 **kwargs
所以你可以调用父构造函数:
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.page_number = 1958
听起来你最好不要使用 __init__
而只是使用列表 comprehension 从一开始就生成所有网址:
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
start_urls
将如下所示:
['http://www.billboard.com/archive/charts/1958/hot-100',
'http://www.billboard.com/archive/charts/1959/hot-100',
'http://www.billboard.com/archive/charts/1960/hot-100',
'http://www.billboard.com/archive/charts/1961/hot-100',
...
'http://www.billboard.com/archive/charts/2017/hot-100']
您也没有正确填充 BillboardItem
,因为对象(默认情况下)不支持项目分配:
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
应该是:
item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist
尽管在 class' 初始化函数中这样做通常更好:
class 广告牌项目(对象):
def init(self, issue_date, song, artist):
self.issue_date = issue_date
self.song = 歌曲
self.artist = 艺术家
然后通过 item = BillboardItem(IssueDate, Song, Artist)
创建项目
已更新
无论如何,我清理了您的代码(并创建了一个 BillboardItem,因为我不完全了解您的代码):
from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class BillboardItem(Item):
issue_date = Field()
song = Field()
artist = Field()
class BillboardSpider(Spider):
name = 'billboard'
allowed_urls = ['http://www.billboard.com/']
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
def parse(self, response):
print(response.url)
print("----------")
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
song = Selector(text=row).xpath('//td[2]/text()').extract()
artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem(issue_date=issue_date, song=song, artist=artist)
yield item
希望这对您有所帮助。 :)
我正试图在每年的 Billboard 前 100 名中抓取。我有一个文件一次可以使用一年,但我希望它能够抓取所有年份并收集该数据。这是我当前的代码:
from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
URL = "http://www.billboard.com/archive/charts/%/hot-100"
class BillboardSpider(Spider):
name = 'Billboard_spider'
allowed_urls = ['http://www.billboard.com/']
start_urls = [URL % 1958]
def _init_(self):
self.page_number=1958
def parse(self, response):
print self.page_number
print "----------"
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
Song = Selector(text=row).xpath('//td[2]/text()').extract()
Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
yield item
self.page_number += 1
yield Request(URL % self.page_number)
但我收到错误消息:“start_urls = [URL % 1958] ValueError: 索引 41 处不支持的格式字符 '/' (0x2f)"
有什么想法吗?我希望代码将年份从原来的 "URL" link 自动更改为 1959,并逐年继续,直到它停止找到 table,然后关闭。
您遇到的错误是因为您没有使用正确的字符串格式化语法。您可以查看 here 了解其工作原理的详细信息。 它在您的特定情况下不起作用的原因是您的 URL 缺少 's':
URL = "http://www.billboard.com/archive/charts/%/hot-100"
应该是
URL = "http://www.billboard.com/archive/charts/%s/hot-100"
无论如何,最好使用新样式的字符串格式:
URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]
继续,您的代码还有一些其他问题:
def _init_(self):
self.page_number=1958
如果你想使用一个init函数,它应该被命名为__init__
(两个下划线)并且因为你正在扩展Spider
,你需要传递*args
和 **kwargs
所以你可以调用父构造函数:
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.page_number = 1958
听起来你最好不要使用 __init__
而只是使用列表 comprehension 从一开始就生成所有网址:
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
start_urls
将如下所示:
['http://www.billboard.com/archive/charts/1958/hot-100',
'http://www.billboard.com/archive/charts/1959/hot-100',
'http://www.billboard.com/archive/charts/1960/hot-100',
'http://www.billboard.com/archive/charts/1961/hot-100',
...
'http://www.billboard.com/archive/charts/2017/hot-100']
您也没有正确填充 BillboardItem
,因为对象(默认情况下)不支持项目分配:
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
应该是:
item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist
尽管在 class' 初始化函数中这样做通常更好:
class 广告牌项目(对象):
def init(self, issue_date, song, artist):
self.issue_date = issue_date
self.song = 歌曲
self.artist = 艺术家
然后通过 item = BillboardItem(IssueDate, Song, Artist)
已更新
无论如何,我清理了您的代码(并创建了一个 BillboardItem,因为我不完全了解您的代码):
from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class BillboardItem(Item):
issue_date = Field()
song = Field()
artist = Field()
class BillboardSpider(Spider):
name = 'billboard'
allowed_urls = ['http://www.billboard.com/']
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
def parse(self, response):
print(response.url)
print("----------")
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
song = Selector(text=row).xpath('//td[2]/text()').extract()
artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem(issue_date=issue_date, song=song, artist=artist)
yield item
希望这对您有所帮助。 :)