Google 学者中的网页抓取 beautifulsoup 和 python 中的硒

Web Scraping in Google Scholar with beautifulsoup and selenium in python

我正在尝试从 Google 学者档案中抓取。我需要具有我指定的特殊规格的配置文件。我在 Python 中使用 Beautifulsoup 和硒。例如,我需要一所大学的教授来研究我指定的某些学科。你有什么想法?

我的方式很慢,需要访问每个个人资料页面来检查我的特殊规格。如果你知道,请给我一个更快的方法。

如果有更快更好的方法来完成这项工作,请说出来。

您可以像这样在 url 中添加您需要的主题:

https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision+label:machine_learning

我在这里搜索计算机视觉和机器学习两个领域的作者

您可以在 double-quotes "<univ. name>" 中的标签后添加大学名称来完成,例如:label:computer_vision "Michigan State University"。通过这种方式,您将只能通过工作场所或电子邮件从密歇根州立大学获得作者,例如 msu.edu,他们的兴趣与计算机视觉直接相关。

注意:有时作者会写简短的大学缩写,例如密歇根大学 -> U.Michigan、as Honglak Lee does.

要同时包含此例外,您可以使用竖线 | 符号,我相信它代表 or。因此搜索查询将变为:label:computer_vision "Michigan State University"|"U.Michigan",转换为密歇根州立大学或 U.Michigan.

我找到的唯一一个地方get an idea of how to make such search queries on the Google Scholar Search Tips under How do I search by Title?但是没有任何关于搜索在某所大学工作的作者的内容。显示的结果是通过反复试验获得的,似乎有效。


Code and example in the online IDE:

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query 
    "hl": "en",                  # language
    "view_op": "search_authors"  # author results
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_affiliation = profile.css('.gs_hlt::text').get()  # selects only university name without additional affiliation, e.g: Assistant Professor
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_affiliations": profile_affiliation,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))


# part of the output:
'''
[
  {
    "author_name": "Anil K. Jain",
    "author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
    "author_affiliations": "Michigan State University",
    "author_email": "Verified email at cse.msu.edu",
    "author_interests": [
      "Biometrics",
      "Computer vision",
      "Pattern recognition",
      "Machine learning",
      "Image processing"
    ]
  } # ...other profiles
]
'''

注意:我正在使用 Parsel library instead of the most popular parsing library BeautifulSoup,但它非常相似并且支持 XPath,并且有自己的 CSS pseudo-elements 支持,例如 ::text::attr(<attribute>).


或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API 实现相同的目的。这是付费 API 和免费计划。

这种情况的不同之处在于你不必弄清楚抓取部分,例如选择正确的 selector/XPath 从中抓取数据或如何绕过搜索引擎的阻止,以及如何扩展请求的数量。

要集成的示例代码:

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "en",                          # language
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Anil K. Jain",
  "link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
  "author_id": "g-_ZXGsAAAAJ",
  "affiliations": "Michigan State University",
  "email": "Verified email at cse.msu.edu",
  "cited_by": 233876,
  "interests": [
    {
      "title": "Biometrics",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
    },
    {
      "title": "Computer vision",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
    },
    {
      "title": "Pattern recognition",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
    },
    {
      "title": "Machine learning",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
    },
    {
      "title": "Image processing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
    }
  ],
  "thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles

'''

Disclaimer, I work for SerpApi.