如何使用 BeautifulSoup 解析 google 中的搜索结果 Python

How to use BeautifulSoup to parse google search results in Python

我正在尝试解析 google 搜索结果的第一页。具体来说,就是提供的标题和小摘要。这是我目前所拥有的:

from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests

address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

print(newString)

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

print(newWord)

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

我现在坚持的部分是沿着 HTML 路径解析我想要的特定数据。到目前为止我所尝试的一切都只是抛出一个错误,说它没有属性或者它只是返回“[]”。

我是 Python 和 BeautifulSoup 的新手,所以我不确定如何到达我想要的位置的语法。我发现这些是页面中的个别搜索结果:

https://ibb.co/jfRakR

任何关于添加什么来解析每个搜索结果的标题和摘要的帮助将不胜感激。

谢谢!

你的 url 对我不起作用。但是 https://google.com/search?q= 我得到了结果。

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'hello world'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

阅读Beautiful Soup Documentation

  1. 默认Google搜索地址开始不包含#符号。相反,它应该有 ?/search pathname:
---> https://google.com/#q=
---> https://www.google.com/search?q=cake
  1. 确保您传递 user-agent into HTTP request headers because the default requests user-agent is python-requests 并且网站可以识别它是一个机器人并阻止请求因此您会收到一个不同的 HTML 以及包含不同 elements/selectors 这就是您得到空结果的原因。

Check what's your user-agent, and a list of user-agents 适用于手机、平板电脑等

Pass user-agent in request headers:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)

代码和example in the online IDE:

from bs4 import BeautifulSoup
import requests, json, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'tesla',  # query 
  'gl': 'us',    # country to search from
  'hl': 'en',    # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

data = []

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

  # sometimes there's no description and we need to handle this exception
  try: 
    snippet = result.select_one('#rso .lyLwlc').text
  except: snippet = None

data.append({
   'title': title,
   'link': link,
   'snippet': snippet,
})

print(json.dumps(data, indent=2, ensure_ascii=False))

-------------
'''
[
  {
    "title": "Tesla: Electric Cars, Solar & Clean Energy",
    "link": "https://www.tesla.com/",
    "snippet": "Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ..."
  },
  {
    "title": "Tesla, Inc. - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Tesla,_Inc.",
    "snippet": "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ..."
  },
  {
    "title": "Nikola Tesla - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Nikola_Tesla",
    "snippet": "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the ..."
  }
]
'''

或者,您可以使用 SerpApi 中的 Google Organic Results API 来实现相同的目的。这是一个付费 API 和一个免费计划,只是为了测试 API。

你的情况的不同之处在于你不必弄清楚为什么输出为空以及导致这种情况发生的原因,绕过 Google 或其他搜索引擎的块,然后维护解析器随着时间的推移。相反,您只需要快速地从结构化 JSON 中获取数据。

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "tesla",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:

  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

----------
'''
Title: Tesla: Electric Cars, Solar & Clean Energy
Summary: Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ...
Link: https://www.tesla.com/

Title: Tesla, Inc. - Wikipedia
Summary: Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ...
Link: https://en.wikipedia.org/wiki/Tesla,_Inc.
'''

Disclaimer, I work for SerpApi.