可以使用请求在 Google 搜索的顶部打印滚动条中的所有元素吗?

Can requests be used to print all elements in the scroll at the top of a Google search?

目标是在输入 "New York City neighborhoods"

等字词时,在 Google 搜索顶部的滚动条中打印所有社区的文本

虽然使用请求时没有编码问题...

googleSearch = BeautifulSoup(requests.get('https://www.google.com/search?q=new+york+city+neighborhoods').content, "html.parser")

...它没有 return 我期待的所有响应 HTML(尽管有邮递员和 Chrome 响应,但滚动中只有少数项目存在显示所有这些)[1] ,这就是为什么尝试使用以下方法(但对我来说有编码问题):

url = "https://www.google.com/search"

querystring = {"q":"New York City neighborhoods"}

headers = {
    'upgrade-insecure-requests': "1",
    'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
    'x-chrome-uma-enabled': "1",
    'x-client-data': "CIy2yQEIo7bJAQjEtskBCIuZygEI+pzKAQipncoB",
    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    'accept-encoding': "gzip, deflate, sdch, br",
    'avail-dictionary': "MC9c6ZtH",
    'accept-language': "en-US,en;q=0.8",
    'cookie': "HSID=AQGYffYcWgUgwoIGG; SSID=AsyTtOTpG3P0TWe_e; APISID=DZOqFSNpfZmThOP6/A15eY85jEZTDT47_j; SAPISID=4jqCaE3zLEcO8GG4/ANI8HEy3etCmKfit2; SID=4AMk07dZM5wKaFcBAD7PgfLgMV1imGkqULwEdE9VI3lwoNRghaVTGT4ZT0mCGgzehY3mFg.; OGPC=5062210-7:765334528-2:699960320-1:961419264-9:; NID=97=bZNps3TAJFPAppe9EQbLyUDwXDbEFN57lT_capK2DQMWMVo7nEnYlPV-_g5OkOCERrN6MS5PxJXuVUOhjHeZGhCkS4FubcEapEzyuSQVS9rJM99rPzwE98ra47eP-ay0YTR-TawjFJ-0hAqT_j7SI7vQGVIU6yj4awM0hEt4ZXTd4k0RnH6kJPb0qVCc8AnQQLg4VZ0Kc1s83vJo6k7jFm-GCEoi; HSID=AQGYffYcWgUgwoIGG; SSID=AsyTtOTpG3P0TWe_e; APISID=DZOqFSNpfZmThOP6/A15eY85jEZTDT47_j; SAPISID=4jqCaE3zLEcO8GG4/ANI8HEy3etCmKfit2; SID=4AMk07dZM5wKaFcBAD7PgfLgMV1imGkqULwEdE9VI3lwoNRghaVTGT4ZT0mCGgzehY3mFg.; OGPC=5062210-7:765334528-2:699960320-1:961419264-9:; NID=97=bZNps3TAJFPAppe9EQbLyUDwXDbEFN57lT_capK2DQMWMVo7nEnYlPV-_g5OkOCERrN6MS5PxJXuVUOhjHeZGhCkS4FubcEapEzyuSQVS9rJM99rPzwE98ra47eP-ay0YTR-TawjFJ-0hAqT_j7SI7vQGVIU6yj4awM0hEt4ZXTd4k0RnH6kJPb0qVCc8AnQQLg4VZ0Kc1s83vJo6k7jFm-GCEoi; DV=Qg7Cq8EJDPcYvgxe_quK9y6d3FXJtAI",
    'cache-control': "no-cache",
    'postman-token': "e6cec459-250e-1795-0e78-c450e5dfd56b"
    }

尝试检索响应(状态代码为 200)时:

googleSearch = BeautifulSoup(requests.request("GET", url, headers=headers, params=querystring).content, "html.parser")

googleSearch.text 打印为:

找不到记录器的处理程序"bs4.dammit" ��������[��#ٕ ֑��RK=��V��i$��YU��$����+Y��j2H&��L>”��R*^ $��gDefukz0��j����|��ax���1��k�a��6y=��X����X�þ��`ɬ.MK;pgoĽ�{��{��D5�gLJ�� ...

...还有更多奇怪的字符

请求可以用于 google 搜索,还是需要另一个模块?

[1] 预期 HTML:Postman 应用中的响应中显示的 html 和 Chrome 包含 div[class=“kltat ”] 元素(页面顶部滚动中的每个项目(在本例中为邻域),即使尚未在滚动上显示),而其他数据包含 HTML,它仅包含一些滚动项目和没有 div[class=“kltat”] 元素

包括这一行是在告诉 google 的服务器他们可以使用 http-compression:

进行响应
'accept-encoding': "gzip, deflate, sdch, br"

我的猜测是使用的压缩是 gzip,尽管它也允许 deflate、Brotli 和 Google 共享字典压缩。

您可以从 headers 中删除 accept-encoding 行;或导入 gzip 库并解压缩内容。

您实际上只需要发送 user-agent,据我所知,其余的 headers 是多余的。

出现奇怪的字符 (binary) 是因为您使用的是 .content 方法,returns binary data for non-text requests and "accept-encoding": "gzip, deflate, sdch, br" headers, you need to use .text method instead which automatically decodes the content 因此您没有收到奇怪的字符。

下面的代码(在本例中是 ~40+ 中的 33 个元素),要获得更多结果,您需要使用 selenium 或其他浏览器自动化并单击右箭头按钮以加载其他元素并刮掉它们。


online IDE 中也抓取缩略图的代码和示例:

from bs4 import BeautifulSoup
import requests, lxml, re, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'new york city neighborhoods',
  'gl': 'us',
}

def bs4_get_top_carousel():

  html = requests.get('https://www.google.com/search', headers=headers, params=params)
  soup = BeautifulSoup(html.text, 'lxml')

  carousel_name = soup.select_one('.F0gfrd+ .z4P7Tc').text

  data = {f"{carousel_name}": []}

  all_script_tags = soup.select('script')

  # https://regex101.com/r/NYdrL5/1
  thumbnails = re.findall(r"<script nonce=\".*?\">\(\w+\(\)\{\w+\s?\w+='(.*?)';\w+\s?\w+=\['\w+'\];\w+\(\w+,\w+\);\}\)\(\);<\/script>", str(all_script_tags))

  for result, thumbnail in zip(soup.select('.ct5Ked'), thumbnails):
    title = result["aria-label"]
    link = f"https://www.google.com{result['href']}"
    try:
      extensions = result.select_one(".cp7THd .FozYP").text
    except: extensions = None
    
    decoded_thumbnail = bytes(thumbnail, 'ascii').decode('unicode-escape')
    # print(f'{title}\n{link}\n{extensions}\n{decoded_thumbnail}\n')

    data[carousel_name].append({
      'title': title,
      'link': link,
      'extensions': [extensions],
      'thumbnail': decoded_thumbnail
    })
  
  print(json.dumps(data, indent=2, ensure_ascii=False))

---------------------
'''
]
 ...
   {
      "title": "Lower East Side",
      "link": "https://www.google.com/search?gl=us&q=Lower+East+Side&stick=H4sIAAAAAAAAAONgFuLUz9U3MIo3sjBTAjMNKy2NzbUUspOt9HPykxNLMvPz9AtyEpNTrfJSM9MzkvKLMvLzU4ofMfpxC7z8cU9YynXSmpPXGO25CGoREudic80rySypFOKV4uZCWGzFpMHEs4iV3ye_PLVIwTWxuEQhODMldQIbIwAs7VbHoAAAAA&sa=X&ved=2ahUKEwimh-H-q93zAhXCZc0KHXj3DAYQ-BZ6BQgBEI4B",
      "extensions": [
        null
      ],
      "thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4QAqRXhpZgAASUkqAAgAAAABADEBAgAHAAAAGgAAAAAAAABHb29nbGUAAP/bAIQAAwICAw0CCg0LAwgLDhAKDgsKCg4RChANCg4OCxAIDQsIEAgLCgkLDQoIDQ0LCgoKCgoLCgoLDQ0QCgsNCwoJCgEDBAQGBQYKBgYKEA0LDhAQEBAQEA8QEA8PDxAPDg8QDw0PDw8ODw0ODxAPDQ0QDg8NDQ4PDw8QEA8NDQ0NDw0O/8AAEQgASABgAwERAAIRAQMRAf/EABwAAAIDAAMBAAAAAAAAAAAAAAUHAwQGAQIIAP/EAEMQAAICAAQEAgYGBgcJAAAAAAECAxEEBRIhAAYTMQciIzJBUWGBFFJxkaHBFRYzNEKxFyZDU2LR0wgkcnOSpMLh8f/EABoBAAIDAQEAAAAAAAAAAAAAAAMEAQIFAAb/xAAyEQABAwIEBAQGAgIDAAAAAAABAAIRAyEEEjFBE1FhcQWBocEiMpGx4fDR8RSCFULC/9oADAMBAAIRAxEAPwDYcr5hmAyl0OEejXdW/DbhQtEyhtELjLMJJ+v8JMZHpE9h+t9348cfkKKB8SihUDPYTqHqrt94+z8eO2UEfFKPrKv9IeH849ZD9zniB8hV4+IK7n6qvM8PpB3H4SVxVnylXf8AMFu8FzrmsPNuFEeO0CSaOOVaX0iGZFKjUrEbMbIo0e/bgTRYq7wmzzbyPyfLiA0+B1UgEdtIpUFew0MpNMBQYH7BZHCNN7mD4UWo3N8y85y8q5WvOzhJAo6WJXTZtR9ElG+rzH32f5caTXOc24SxaAbFUMhyTDrmZprGmx8DfsrgkEhUFijuYZepyOcWANUbG/tZfb/xDiQFxuIWTXl6IdsR+Hz9/DDSQlC0FdM2yYHKsP6Yfs2H/cSvfu7OPbxxKqRpCMZFkyBPU93v4CUdsovleE/rLF61a1NW31u1XX4cUIsigmVPlwxAxUQ+lTdlunkH3gNR+fficohTmMo7isRixnUVZliQLGwkmA7/AMQDhT8xxAYI0Vi4yr2fS5gOYBWb4v1jY6s9euR26ldvhxzW2K5xuEVxmIxI5hw1Y+UjqrqBZjfnXY6ib2v5ccwc1LzyTGxeWytF5MzkJU6TZYsV9jNqCnvsfKBt9tKtdl1RntmyXecYDGjmR/SI3o5QDpjJ/d3C+shI32on4VVjh1rgQlyCEKwME4n3Mfb6kP8Ap8WVFNNBYkXpwm4g28WHPaZBuGiIOze327jfccJUnRBZ8qX+5w1f8nB/6HFxKCSqL4Fzk6+ig2lkQehwdVUcmwMBUevvQF+2+LFUOiEQZdi1wQJHw9vu4QFQEo14UGFzKccxodBrUu+/e/v78GtC4EysbzV4gcwxczQdHIzNH0wznzghhNFGUBsjzQSPIPKa6BG+sFLggi+qmLqvz3418wCeFoOT5WFyCTWSDGUKFGpQdQdC+way6AAspMg5saEq8brSYzxkxjYmNxyli2umahQsszkeZjQStN6jZYEEjzcc0gC6s4GZCO43xgdseGXlTGhY6k1EHz+cbIN3utyNOquysTp4hsDVS4Erfc1eMHMKQwsvK88ysOpJNGXDQozOhSUJaGRCsF7hQZiQzbF0mEOMGyZqMgWKW+deOOafpnblPGJamjLewMdW/nABNsVKswqjudaI7ThKOlVsF4wZt+n1QcoYlgYGYOTQEobSIn7soKjWsgVlINXZAJRBGqpBGqY/0wGFm6RBOH7W23ponI81Haq3APwG44rmgKIlBziZyn7J/hub/H/L5cSXAIEEqhHmMv6LIED/ALdzubu44R8B/DwQ6CUIuOgXkfFcyeNa5KkhzrGrCxNTmLB9M6SEJjYJJfmIWmKMD6yqPNxkuxVFji2LhbLMK5wEEe6pZ7zXzzHgGZfEaSQh0UoI4VYF5VhBPUiApXYE13ANXtadDxVlWoKeQiZvMiwnZOVPDnMYXZgfJCsu5m8ZfpqrLnLxrRtyMMVFKSo2hLHURQ8p3O5HcMjG0XGCQO6qcG9okD9+qyWN8SPGQ5a7Nm8/rFFhKYLVsSCX1RhQtil3s0TXaiuxdBjZLh68un3VG4aq4kBpH09ymcOfecg4B5hkj81XpwxFn20sHlF3tZPw961XG02WjeE1Twb3bo5Lzrz0saf1ncEk/wAGH2oX9Qg9tiNuK/5jQQC3VScIYMO0TJw+L8W/1bSSbniUrIGhv/d9hGTUbBIKvQNSi2BAJsUaC7xClPyevRXbgHRGf0/KVviLzx4iBpU+nSyvoqOTVhaBETdK1lwwFWgUFiqj2mlJLdPGUi3NEapd+EqB0TP7+7pRYbxT8d2zKo84xZPbURgqB91x4bc+7zaT9aq4h+PoMbJePKZ9QoZgqrjGQ+ZHsUz/AA85r8W5NIm51xNmZombThx/ZfSlChYz2ABJL2WANe7PreLim95F2huaP9g3WE4zw3O1o0OaCf8AUnSfdGIc/wDEt83ZVz/GMA7ors0A1FH6ZsCMlRYsUG247/naYYCRrG/PyQz4O4ugH0/Kzo5p8Zv1kRBm86qxIa3hoMrMrKdEeokBPWAAIA37XoVsfTDc2fYGwnWI5cwkqeBeXZcm8a8vqvReSYiD+hx4v0iU14d4yt1rJdSVC2NVoCa7kC+y2PIPNQ4t4DiBnII2/wC2t1vta3gsOUE5QZ32WI5y5byFuVppD5hcJidKPm+nwIDuDalWKvtshJGkgOubgM7cZkNonXn2B5StLFAOoBw3+0HeOyWnO/P2Xrzu0a8yB41xKYdk0yPpYyOv7RTpCoU0OKYKzWzKAVPpW4KrVw/ELCDBI6iBeOuo+yyziKTHhuYagEdT16ae6uT5HAOdir5hClLMCkhA1jqoCNPrd29lMDuNwDxXBUHVGjTbXsr4mo2m42O+ndHeauRQxcpiYtp1O5qkbWwra7alAsAHVsfcxi8G8sL5Ai/kBHv0FkKhiG5mtg7j39lyOWMU2cYUHGQxr1zGWkJGoFXT0dd9NaySQNCHe6BtQp8XJJ0HPWxFv3ZdVfkzW391r8Tio5OQdCZ1hZTFguuAxIiMkesOjdNiA0hkjeMhx6MVqqg6dfBPbJLTALQY5G06aDfRGZiWmADczyt627XQqfKMnbnHR1o28sUYKmwwLTQtQJA2rdgDQZR2I4WaXf4pgQIf9YH75IxaOOATJlv3KxXih448tYHN5FGVyTKyaVZK8p1FTq1LsbtdyNxdN6vHYbwgYjCtruf8WaMsxaJmYMjsN9VetjTTxBotaIInNr0iLQf4QvlfxYy9eY9InZ1XNJGZADqEYylCD5VW21rIlFh5rsUQOPS+H4PBOrRjAA3KNS4hwz1MwIE2EMdIE6hYWOrYltDNhnEuzGwyiPgpwRMay9tzHbVOvw+znlwZbG7lyHxEqobD7sVcNIaQ0x7uFIDXZABYY1fw5rq9QNZAEGBoBfQRoO+ydo4wtpMLnTMiTuba90u/Gbm9cNn7SplZk6bvKsakDqao5BQNNpp7LkK1X6p7cScKKrm0wYlrR2v+FHH4Yc8jQk+n5QDnXmgPlsXQnkYQzsk4R8KC41GDQRicTDoVcQQrMy6vKGC6R1V3qPh8Go46F+aY6nroNOusRrlPxcBjQb5I35DpqdfSZS4h/wBpeFcCMEmXSazjEYya4GVQ8kZ0RthZpUOnu489trQ6TqAviPD6b8QK5NxsOxuT5qlHFvFE0RoTJm+8xHcfwlvyrh8Q+UkS5ajEg1VRv59Q1sYlZpSklFVYpqoL1NFhvQhx4BqN1iekDYcgVlwOKGO5xO99ydyNVP4r4zNxFqTCuTpxCkndrPSthW4KgMwNNRBO16h5zBMGUgxMj7D8LdxjyHS3kfU69xcrS+IeCzKTN5NKhunmGHlosw0qFxABG6kU4saSLPvA4FVPDFUuGojTvrzRKTOIacbGTftonBipMy+h4VosMgId2dTpb+xdQFN0CZSt1sVvuCCEcFmpRaRljba+/b9unMSG1JGhnee23fqsdy74a4l+a5+pI8aiDpAjo79PEOjC5XonpqFKFQaZtLowYcbuIfwiHA628iN/2Vk0M1VjqZFteV7b+SvZNhedBNA0mBgtcRHE7x6aK6YiHrq2rMBO1INCAUqr5VbqQpuY5sbnVRVL2uaZ2XlbLf039E0vhHXXI+tTqXUdCEWBQJ7ncGiDWntwtX4ZYCwzGm9v2EfDcQVTnbE+X97816+8IRImDeWTL4qbHOUsAs9YZ4d6QgI6odNkk0223GJjMM81mVGmxa9uu8Oct92IolgDWwYZPUiGmPoF9y5zPmknKcZXFqrfpBXdHjZDoEkUOlguKlC1hzrdkAXrjQYylzt6nESyrULLyyxgxPxEb8umq8lSAdTYHbO03iwO3ONdkE/2gpsWsGr6OZfVGgVpfViAQvnDbVJTXsV2PCGDomm+i53X/wBndHxLw9tRo6fZo2WU5h5R5e+nu4mXU8mtlGm9bPRJAJOrz+YkbAEnYGmTWLXOuYv23SbWAtbYTb2Wii8PMogyfUqb9aM/G2mUbnuQCe3avcOEaWO41YC+h+xT1TD5KRIj9KW78odTxgTpOuhSJZGLKsaeioLIzssKBpYxp6jIC9gW5CnSoPqPwml5IHaAfoL/AGSdQNbiNbRPnJ9dFpcacqdDJHNHMBJMhCamN2hsBFawEo6hezD3mlQ2s1pI9OyZL6TiASmzByf1M3m18ryuDIjKek51Fer5ltRqGmTZhY83+LhisHubACrQeGnVbJ+SsKuV7coYpey2QVUBiFqnBq7ruBZHCLadcGwTrqtI6lC8by+v0uQjJwfO++uNe7l/afaTfz4bIxBNksHUQFT5aTGu86x4ePUk8ErrrjoKUkjLaiwUnbZVJYkduGKVJ8Ozb/hLVKjfhjZKLPsh53OWEfqYxLJ3D4X0bV/im8xB7kdx7j2zqXhD2FpzTBkj6W1TbvFW3yiOR/sK14bZDz4eYCk2Fk6Y6AiDHyxllZJN7AIUu2phaqpJBsm9CthHvcwtsBmnzBA+6SbjGhrsx1yx5GUyovBfmcYo1icGAX1byqTvv7CT62+18HGGqRDil3Ylk2QXnzwwzx441klwzjzdTS8lDsFG+HYMauxtV7Mdxxd2EmLkQgHFxNplM2TkjLgNkv2+0f8AlR+7vw2KEm4WeMR1X36oqY/3Tb7Sfv3/AD4IMM3kpOIdzUuQeHcMWEZYsvhjDEayqoHYCiEZwvWMSsA6wFzCsg6gjDlnJG0Q3RQaxOqLYTw+gGI1HBx3VFioJIu6s71e9bix24nhKwqrY5TydIRXWI2rYdx7u358DLI2RhUlaDD+EBKbTtf2fy93Asx5IkyppfAjNvrMw+3t9tcWFVqgtKFYrwokSSn6i/HuPwPB2jMPhQXPy6rtD4SYBo/2zfd/7/McSWuGyFnaVFifB0BtkJ+ZB+6/zPFwJQnPhVG8MH9sB+ZP5ngopygmrC4k8Oz8B8P/ALxYUkE1Veh5Qj9ov7h+XBxTSoqItheXIwvF+Gp4iuDJIK9TieErcRTR5Xhup+xUfH3/AMuI4SuKqJQYKG9q+XHcNu6vnOy1GTrKRtG7fYGP3UOF302Dkjte481sMtweY6NkkHyII/6q4Tc1iYDnrnGZTmhNNA25oXoo7aqHxqzXuB9x4luTZQ4uOqC4jkibq/urKftWvnpJ4cD/ADSjmr6PlvMAa6S19o/IHi1jsh3G66/qtiK/gX4bn5dq/HiYVfNRycky/wB4nyH+bfyJ4sCguAX/2Q=="
    }
  ...
]
'''

如果您需要有关此主题的更多信息,我写了一篇关于 how to scrape Google Carousel results 的专门博客 post。


或者,您可以使用 SerpApi 中的 Google Knowledge Graph API 来实现相同的目的。这是付费 API 和免费计划。

你的情况的不同之处在于你不必处理提取过程并弄清楚要使用什么 CSS 选择器或如何处理其他不同的事情,而你几乎只需要迭代结构化 JSON 并获取您想要的数据。

要集成的代码:

from serpapi import GoogleSearch
import os, json

def serpapi_get_top_carousel():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "new york city neighborhoods",
      "hl": "en"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['knowledge_graph']['neighborhoods']:
        print(json.dumps(result, indent=2, ensure_ascii=False))


---------------
'''
"neighborhoods": [
  {
  "name":"Harlem",
  "link": "https://www.google.com/search?q=Harlem&stick=H4sIAAAAAAAAAONgFuLUz9U3MIo3sjBT4gAx0yxNSrQUspOt9HPykxNLMvPz9AtyEpNTrfJSM9MzkvKLMvLzU4ofMfpxC7z8cU9YynXSmpPXGO25CGoREudic80rySypFOKV4uZC2GvFpMHEs4iVzSOxKCc1dwIbIwBbfLXHlgAAAA&sa=X&ved=2ahUKEwiipvO2q93zAhUylGoFHcyFA4wQ-BZ6BAgBEDQ",
  "image": "https://serpapi.com/searches/61725341e4a23d51edb9dabf/images/d59e4f2f273f964cdd7164417183fc3f42a0f8724e78a4815f5e934903209df6acfc73bbca024b9a.jpeg"
  }
...
]
'''

Disclaimer, I work for SerpApi.