添加字符串到 scraped url (scrapy)
Adding string to scraped url (scrapy)
我制作了一个爬虫来浏览论坛中的帖子并保存所有 links post 由用户编辑。问题是论坛使用了 "do you really want to leave the site" 东西。这使得我抓取的 links 不完整,如下所示:
/leave.php?u=http%3A%2F%2Fwww.lonestatistik.se%2Floner.asp%2Fyrke%2FUnderskoterska-1242
要运行它需要 link 开头的网站域。
有什么办法可以添加吗?或者只是抓取目标 url.
def parse(self, response):
next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
if len(next_link):
yield self.make_requests_from_url(urljoin(response.url, next_link))
posts = Selector(response).xpath('//div[@class="post_message"]')
for post in posts:
i = TextPostItem()
i['url'] = post.xpath('a/@href').extract()
yield i
-编辑-
所以,根据 eLRuLL 的回答,我这样做了。
def parse(self, response):
next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
if len(next_link):
yield self.make_requests_from_url(urljoin(response.url, next_link))
posts = Selector(response).xpath('//div[@class="post_message"]')
for post in posts:
i = TextPostItem()
url = post.xpath('./a/@href').extract_first()
i['new_url'] = urljoin(response.url, url)
yield i
哪个有效。除此之外,我现在为每个 post 抓取一个 url,即使那个 post 没有 link posted.
您似乎需要在新 url 的开头添加域 url。您可以尝试使用 response.url
将基础 url 附加到新基础上,例如:
from urlparse import urljoin
...
url = post.xpath('./a/@href').extract_first()
new_url = urljoin(response.url, url) # someurl.com/leave.php?...
yield Request(new_url, ...)
...
我制作了一个爬虫来浏览论坛中的帖子并保存所有 links post 由用户编辑。问题是论坛使用了 "do you really want to leave the site" 东西。这使得我抓取的 links 不完整,如下所示:
/leave.php?u=http%3A%2F%2Fwww.lonestatistik.se%2Floner.asp%2Fyrke%2FUnderskoterska-1242
要运行它需要 link 开头的网站域。
有什么办法可以添加吗?或者只是抓取目标 url.
def parse(self, response):
next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
if len(next_link):
yield self.make_requests_from_url(urljoin(response.url, next_link))
posts = Selector(response).xpath('//div[@class="post_message"]')
for post in posts:
i = TextPostItem()
i['url'] = post.xpath('a/@href').extract()
yield i
-编辑- 所以,根据 eLRuLL 的回答,我这样做了。
def parse(self, response):
next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
if len(next_link):
yield self.make_requests_from_url(urljoin(response.url, next_link))
posts = Selector(response).xpath('//div[@class="post_message"]')
for post in posts:
i = TextPostItem()
url = post.xpath('./a/@href').extract_first()
i['new_url'] = urljoin(response.url, url)
yield i
哪个有效。除此之外,我现在为每个 post 抓取一个 url,即使那个 post 没有 link posted.
您似乎需要在新 url 的开头添加域 url。您可以尝试使用 response.url
将基础 url 附加到新基础上,例如:
from urlparse import urljoin
...
url = post.xpath('./a/@href').extract_first()
new_url = urljoin(response.url, url) # someurl.com/leave.php?...
yield Request(new_url, ...)
...