Scrapy:需要迭代项目和页面
Scrapy: need to iterate over items and pages
我在 .txt 文件中有一个 instagram 列表。
这是我必须 抓取 的 URL:https://brandfollowers.io/kol/all-post?uid=$INSTAGRAM$&page_num=$PAGENUMBER$
(注意我把 $INSTAGRAM$ 和 $PAGENUMBER$ 放在我需要更改的地方变量)
例如,在这个 URL https://brandfollowers.io/kol/all-post?uid=philipppleinofficial&page_num=1
我对此很陌生,但实际上我设法为列表中的每个 instagram 获取第 1 页中的所有项目。但是,我无法遍历每个 instagram 的所有页面。
你能给我一些小费吗?我对这个话题很陌生。
这是我正确知道的:
# -*- coding: utf-8 -*-
import scrapy
import json
class ContenidoSpider(scrapy.Spider):
name = 'BACKUP_contenido'
allowed_domains = ['brandfollowers.io']
start_urls = ['http://brandfollowers.io/']
base_url = 'http://brandfollowers.io/kol/all-post?uid='
def parse(self, response):
FILE = open('list.txt', 'r').readlines()
instagrams = []
for lines in FILE:
new_line = lines.replace('https://www.instagram.com/', '')
instagrams.append(new_line)
for instagram in instagrams:
posts_url = self.base_url + instagram
yield scrapy.Request(posts_url, callback=self.parse_json)
def parse_json(self, response):
current_page = 0
pagesize = 6
json_response = json.loads(response.text)
path = json_response["data"]["models"]
while current_page < pagesize:
brand = path[current_page]["author"]["platform_unique_id"]
date = path[current_page]["platform_create_time"]
comments = path[current_page]["comment_count"]
likes = path[current_page]["like_count"]
engagement_rate = path[current_page]["share_count"]
description = path[current_page]["description"]
url_post = path[current_page]["post_url"]
picture_link = path[current_page]["picture_link"]
yield {
'BRAND': brand,
'DATE': date,
'COMMENTS': comments,
'LIKES': likes,
'ENGAGEMENT RATE': engagement_rate,
'DESCRIPTION': description,
'URL': url_post,
'PICTURE LINK': picture_link,
}
current_page += 1
假设有 10 页:
amount_of_pages = 10
pages = [f"https://brandfollowers.io/kol/all-post?uid=philipppleinofficial&page_num={n+1}" for n in range(amount_of_pages)]
for page in pages:
#Do something
我在 .txt 文件中有一个 instagram 列表。
这是我必须 抓取 的 URL:https://brandfollowers.io/kol/all-post?uid=$INSTAGRAM$&page_num=$PAGENUMBER$
(注意我把 $INSTAGRAM$ 和 $PAGENUMBER$ 放在我需要更改的地方变量)
例如,在这个 URL https://brandfollowers.io/kol/all-post?uid=philipppleinofficial&page_num=1
我对此很陌生,但实际上我设法为列表中的每个 instagram 获取第 1 页中的所有项目。但是,我无法遍历每个 instagram 的所有页面。
你能给我一些小费吗?我对这个话题很陌生。
这是我正确知道的:
# -*- coding: utf-8 -*-
import scrapy
import json
class ContenidoSpider(scrapy.Spider):
name = 'BACKUP_contenido'
allowed_domains = ['brandfollowers.io']
start_urls = ['http://brandfollowers.io/']
base_url = 'http://brandfollowers.io/kol/all-post?uid='
def parse(self, response):
FILE = open('list.txt', 'r').readlines()
instagrams = []
for lines in FILE:
new_line = lines.replace('https://www.instagram.com/', '')
instagrams.append(new_line)
for instagram in instagrams:
posts_url = self.base_url + instagram
yield scrapy.Request(posts_url, callback=self.parse_json)
def parse_json(self, response):
current_page = 0
pagesize = 6
json_response = json.loads(response.text)
path = json_response["data"]["models"]
while current_page < pagesize:
brand = path[current_page]["author"]["platform_unique_id"]
date = path[current_page]["platform_create_time"]
comments = path[current_page]["comment_count"]
likes = path[current_page]["like_count"]
engagement_rate = path[current_page]["share_count"]
description = path[current_page]["description"]
url_post = path[current_page]["post_url"]
picture_link = path[current_page]["picture_link"]
yield {
'BRAND': brand,
'DATE': date,
'COMMENTS': comments,
'LIKES': likes,
'ENGAGEMENT RATE': engagement_rate,
'DESCRIPTION': description,
'URL': url_post,
'PICTURE LINK': picture_link,
}
current_page += 1
假设有 10 页:
amount_of_pages = 10
pages = [f"https://brandfollowers.io/kol/all-post?uid=philipppleinofficial&page_num={n+1}" for n in range(amount_of_pages)]
for page in pages:
#Do something