如何在 python 中抓取完整的 Instagram 页面？

Question

长话短说，我正在尝试创建一个 Instagram python 抓取工具，它会加载整个页面并抓取所有图像链接。我让它工作，唯一的问题是，它只加载 Instagram 显示的原始 12 张照片。无论如何我可以告诉请求加载整个页面吗？

工作代码；

import json
import requests
from bs4 import BeautifulSoup
import sys

r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    image_src = post['node']['display_url']
    print(image_src)

Answer 1

正如 Scratch 已经提到的，Instagram 使用 "infinite scrolling" 不允许您加载整个页面。但是您可以在页面顶部查看消息总数（在 _fd86t class 的第一个范围内）。然后您可以检查该页面是否已包含所有消息。否则，您将不得不使用 GET 请求来获得新的 JSON 响应。这样做的好处是这个请求包含 first 字段，这似乎可以让你修改你得到的消息数。您可以将其从标准 12 修改为获取所有剩余消息（希望如此）。

必要的请求类似于以下内容（我已将实际条目匿名化，并在评论中提供了一些帮助）：

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}

Answer 2

parse_ig.py

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from InstagramAPI import InstagramAPI
import time

c = webdriver.Chrome()
# load IG page here, whether a hashtag or a public user's page using c.get(url)

for i in range(10):
    c.send_keys(Keys.END)
    time.sleep(1)

soup = BeautifulSoup(c.page_source, 'html.parser')
ids = [a['href'].split('/') for a in soup.find_all('a') if 'tagged' in a['href']]

获得 ID 后，您可以使用 Instagram 的旧 API 获取这些 ID 的数据。我不确定它是否仍然有效，但我使用了一个 API——它受到 FB 逐渐弃用旧 API 部分的限制。这是 link，以防您不想自己访问 Instagram API :)

您还可以对这个简单的代码进行改进。就像不是 "for" 循环一样，您可以改为执行 "while" 循环（即，当页面仍在加载时，继续按 END 按钮。）

Answer 3

@zero 的回答不完整（至少截至 2019 年 1 月 15 日）。 c.send_keys 不是有效方法。相反，这就是我所做的：

c = webdriver.Chrome()
c.get(some_url)

element = c.find_element_by_tag_name('body') # or whatever tag you're looking to scrape from

for i in range(10):
    element.send_keys(Keys.END)
    time.sleep(1)

soup = BeautifulSoup(c.page_source, 'html.parser')

Answer 4

这是一个 link 很好的教程，用于抓取 Instagram 个人资料信息和帖子，它也处理分页并在 2022 年工作：Scraping Instagram

总而言之，您必须使用 Instagram GraphQL API 端点，该端点需要来自上一页响应的用户标识符和光标：https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

如何在 python 中抓取完整的 Instagram 页面？

How do I scrape a full instagram page in python?

python

python-3.x

instagram

python-requests