可以使用请求在 Google 搜索的顶部打印滚动条中的所有元素吗?

Can requests be used to print all elements in the scroll at the top of a Google search?

目标是在输入 "New York City neighborhoods"

等字词时,在 Google 搜索顶部的滚动条中打印所有社区的文本

虽然使用请求时没有编码问题...

googleSearch = BeautifulSoup(requests.get('https://www.google.com/search?q=new+york+city+neighborhoods').content, "html.parser")

...它没有 return 我期待的所有响应 HTML(尽管有邮递员和 Chrome 响应,但滚动中只有少数项目存在显示所有这些)[1] ,这就是为什么尝试使用以下方法(但对我来说有编码问题):

url = "https://www.google.com/search"

querystring = {"q":"New York City neighborhoods"}

headers = {
    'upgrade-insecure-requests': "1",
    'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
    'x-chrome-uma-enabled': "1",
    'x-client-data': "CIy2yQEIo7bJAQjEtskBCIuZygEI+pzKAQipncoB",
    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    'accept-encoding': "gzip, deflate, sdch, br",
    'avail-dictionary': "MC9c6ZtH",
    'accept-language': "en-US,en;q=0.8",
    'cookie': "HSID=AQGYffYcWgUgwoIGG; SSID=AsyTtOTpG3P0TWe_e; APISID=DZOqFSNpfZmThOP6/A15eY85jEZTDT47_j; SAPISID=4jqCaE3zLEcO8GG4/ANI8HEy3etCmKfit2; SID=4AMk07dZM5wKaFcBAD7PgfLgMV1imGkqULwEdE9VI3lwoNRghaVTGT4ZT0mCGgzehY3mFg.; OGPC=5062210-7:765334528-2:699960320-1:961419264-9:; NID=97=bZNps3TAJFPAppe9EQbLyUDwXDbEFN57lT_capK2DQMWMVo7nEnYlPV-_g5OkOCERrN6MS5PxJXuVUOhjHeZGhCkS4FubcEapEzyuSQVS9rJM99rPzwE98ra47eP-ay0YTR-TawjFJ-0hAqT_j7SI7vQGVIU6yj4awM0hEt4ZXTd4k0RnH6kJPb0qVCc8AnQQLg4VZ0Kc1s83vJo6k7jFm-GCEoi; HSID=AQGYffYcWgUgwoIGG; SSID=AsyTtOTpG3P0TWe_e; APISID=DZOqFSNpfZmThOP6/A15eY85jEZTDT47_j; SAPISID=4jqCaE3zLEcO8GG4/ANI8HEy3etCmKfit2; SID=4AMk07dZM5wKaFcBAD7PgfLgMV1imGkqULwEdE9VI3lwoNRghaVTGT4ZT0mCGgzehY3mFg.; OGPC=5062210-7:765334528-2:699960320-1:961419264-9:; NID=97=bZNps3TAJFPAppe9EQbLyUDwXDbEFN57lT_capK2DQMWMVo7nEnYlPV-_g5OkOCERrN6MS5PxJXuVUOhjHeZGhCkS4FubcEapEzyuSQVS9rJM99rPzwE98ra47eP-ay0YTR-TawjFJ-0hAqT_j7SI7vQGVIU6yj4awM0hEt4ZXTd4k0RnH6kJPb0qVCc8AnQQLg4VZ0Kc1s83vJo6k7jFm-GCEoi; DV=Qg7Cq8EJDPcYvgxe_quK9y6d3FXJtAI",
    'cache-control': "no-cache",
    'postman-token': "e6cec459-250e-1795-0e78-c450e5dfd56b"
    }

尝试检索响应(状态代码为 200)时:

googleSearch = BeautifulSoup(requests.request("GET", url, headers=headers, params=querystring).content, "html.parser")

googleSearch.text 打印为:

找不到记录器的处理程序"bs4.dammit" ��������[��#ٕ ֑��RK=��V��i$��YU��$����+Y��j2H&��L>”��R*^ $��gDefukz0��j����|��ax���1��k�a��6y=��X����X�þ��`ɬ.MK;pgoĽ�{��{��D5�gLJ�� ...

...还有更多奇怪的字符

请求可以用于 google 搜索,还是需要另一个模块?

[1] 预期 HTML:Postman 应用中的响应中显示的 html 和 Chrome 包含 div[class=“kltat ”] 元素(页面顶部滚动中的每个项目(在本例中为邻域),即使尚未在滚动上显示),而其他数据包含 HTML,它仅包含一些滚动项目和没有 div[class=“kltat”] 元素

包括这一行是在告诉 google 的服务器他们可以使用 http-compression:

进行响应
'accept-encoding': "gzip, deflate, sdch, br"

我的猜测是使用的压缩是 gzip,尽管它也允许 deflate、Brotli 和 Google 共享字典压缩。

您可以从 headers 中删除 accept-encoding 行;或导入 gzip 库并解压缩内容。

您实际上只需要发送 user-agent,据我所知,其余的 headers 是多余的。

出现奇怪的字符 (binary) 是因为您使用的是 .content 方法,returns binary data for non-text requests and "accept-encoding": "gzip, deflate, sdch, br" headers, you need to use .text method instead which automatically decodes the content 因此您没有收到奇怪的字符。

下面的代码(在本例中是 ~40+ 中的 33 个元素),要获得更多结果,您需要使用 selenium 或其他浏览器自动化并单击右箭头按钮以加载其他元素并刮掉它们。


online IDE 中也抓取缩略图的代码和示例:

from bs4 import BeautifulSoup
import requests, lxml, re, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'new york city neighborhoods',
  'gl': 'us',
}

def bs4_get_top_carousel():

  html = requests.get('https://www.google.com/search', headers=headers, params=params)
  soup = BeautifulSoup(html.text, 'lxml')

  carousel_name = soup.select_one('.F0gfrd+ .z4P7Tc').text

  data = {f"{carousel_name}": []}

  all_script_tags = soup.select('script')

  # https://regex101.com/r/NYdrL5/1
  thumbnails = re.findall(r"<script nonce=\".*?\">\(\w+\(\)\{\w+\s?\w+='(.*?)';\w+\s?\w+=\['\w+'\];\w+\(\w+,\w+\);\}\)\(\);<\/script>", str(all_script_tags))

  for result, thumbnail in zip(soup.select('.ct5Ked'), thumbnails):
    title = result["aria-label"]
    link = f"https://www.google.com{result['href']}"
    try:
      extensions = result.select_one(".cp7THd .FozYP").text
    except: extensions = None
    
    decoded_thumbnail = bytes(thumbnail, 'ascii').decode('unicode-escape')
    # print(f'{title}\n{link}\n{extensions}\n{decoded_thumbnail}\n')

    data[carousel_name].append({
      'title': title,
      'link': link,
      'extensions': [extensions],
      'thumbnail': decoded_thumbnail
    })
  
  print(json.dumps(data, indent=2, ensure_ascii=False))

---------------------
'''
]
 ...
   {
      "title": "Lower East Side",
      "link": "https://www.google.com/search?gl=us&q=Lower+East+Side&stick=H4sIAAAAAAAAAONgFuLUz9U3MIo3sjBTAjMNKy2NzbUUspOt9HPykxNLMvPz9AtyEpNTrfJSM9MzkvKLMvLzU4ofMfpxC7z8cU9YynXSmpPXGO25CGoREudic80rySypFOKV4uZCWGzFpMHEs4iV3ye_PLVIwTWxuEQhODMldQIbIwAs7VbHoAAAAA&sa=X&ved=2ahUKEwimh-H-q93zAhXCZc0KHXj3DAYQ-BZ6BQgBEI4B",
      "extensions": [
        null
      ],
      "thumbnail": ""
    }
  ...
]
'''

如果您需要有关此主题的更多信息,我写了一篇关于 how to scrape Google Carousel results 的专门博客 post。


或者,您可以使用 SerpApi 中的 Google Knowledge Graph API 来实现相同的目的。这是付费 API 和免费计划。

你的情况的不同之处在于你不必处理提取过程并弄清楚要使用什么 CSS 选择器或如何处理其他不同的事情,而你几乎只需要迭代结构化 JSON 并获取您想要的数据。

要集成的代码:

from serpapi import GoogleSearch
import os, json

def serpapi_get_top_carousel():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "new york city neighborhoods",
      "hl": "en"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['knowledge_graph']['neighborhoods']:
        print(json.dumps(result, indent=2, ensure_ascii=False))


---------------
'''
"neighborhoods": [
  {
  "name":"Harlem",
  "link": "https://www.google.com/search?q=Harlem&stick=H4sIAAAAAAAAAONgFuLUz9U3MIo3sjBT4gAx0yxNSrQUspOt9HPykxNLMvPz9AtyEpNTrfJSM9MzkvKLMvLzU4ofMfpxC7z8cU9YynXSmpPXGO25CGoREudic80rySypFOKV4uZC2GvFpMHEs4iVzSOxKCc1dwIbIwBbfLXHlgAAAA&sa=X&ved=2ahUKEwiipvO2q93zAhUylGoFHcyFA4wQ-BZ6BAgBEDQ",
  "image": "https://serpapi.com/searches/61725341e4a23d51edb9dabf/images/d59e4f2f273f964cdd7164417183fc3f42a0f8724e78a4815f5e934903209df6acfc73bbca024b9a.jpeg"
  }
...
]
'''

Disclaimer, I work for SerpApi.