Google 学者中的网页抓取 beautifulsoup 和 python 中的硒
Web Scraping in Google Scholar with beautifulsoup and selenium in python
我正在尝试从 Google 学者档案中抓取。我需要具有我指定的特殊规格的配置文件。我在 Python 中使用 Beautifulsoup 和硒。例如,我需要一所大学的教授来研究我指定的某些学科。你有什么想法?
我的方式很慢,需要访问每个个人资料页面来检查我的特殊规格。如果你知道,请给我一个更快的方法。
如果有更快更好的方法来完成这项工作,请说出来。
您可以像这样在 url 中添加您需要的主题:
我在这里搜索计算机视觉和机器学习两个领域的作者
您可以在 double-quotes "<univ. name>"
中的标签后添加大学名称来完成,例如:label:computer_vision "Michigan State University"
。通过这种方式,您将只能通过工作场所或电子邮件从密歇根州立大学获得作者,例如 msu.edu
,他们的兴趣与计算机视觉直接相关。
注意:有时作者会写简短的大学缩写,例如密歇根大学 -> U.Michigan、as Honglak Lee does.
要同时包含此例外,您可以使用竖线 |
符号,我相信它代表 or
。因此搜索查询将变为:label:computer_vision "Michigan State University"|"U.Michigan"
,转换为密歇根州立大学或 U.Michigan.
我找到的唯一一个地方get an idea of how to make such search queries on the Google Scholar Search Tips under How do I search by Title?但是没有任何关于搜索在某所大学工作的作者的内容。显示的结果是通过反复试验获得的,似乎有效。
Code and example in the online IDE:
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query
"hl": "en", # language
"view_op": "search_authors" # author results
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_affiliation = profile.css('.gs_hlt::text').get() # selects only university name without additional affiliation, e.g: Assistant Professor
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_affiliations": profile_affiliation,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
# part of the output:
'''
[
{
"author_name": "Anil K. Jain",
"author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"author_affiliations": "Michigan State University",
"author_email": "Verified email at cse.msu.edu",
"author_interests": [
"Biometrics",
"Computer vision",
"Pattern recognition",
"Machine learning",
"Image processing"
]
} # ...other profiles
]
'''
注意:我正在使用 Parsel
library instead of the most popular parsing library BeautifulSoup
,但它非常相似并且支持 XPath,并且有自己的 CSS pseudo-elements 支持,例如 ::text
或 ::attr(<attribute>)
.
或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API 实现相同的目的。这是付费 API 和免费计划。
这种情况的不同之处在于你不必弄清楚抓取部分,例如选择正确的 selector/XPath 从中抓取数据或如何绕过搜索引擎的阻止,以及如何扩展请求的数量。
要集成的示例代码:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "en", # language
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Anil K. Jain",
"link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "g-_ZXGsAAAAJ",
"affiliations": "Michigan State University",
"email": "Verified email at cse.msu.edu",
"cited_by": 233876,
"interests": [
{
"title": "Biometrics",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
},
{
"title": "Computer vision",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
},
{
"title": "Pattern recognition",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
},
{
"title": "Machine learning",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
},
{
"title": "Image processing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
}
],
"thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles
'''
Disclaimer, I work for SerpApi.
我正在尝试从 Google 学者档案中抓取。我需要具有我指定的特殊规格的配置文件。我在 Python 中使用 Beautifulsoup 和硒。例如,我需要一所大学的教授来研究我指定的某些学科。你有什么想法?
我的方式很慢,需要访问每个个人资料页面来检查我的特殊规格。如果你知道,请给我一个更快的方法。
如果有更快更好的方法来完成这项工作,请说出来。
您可以像这样在 url 中添加您需要的主题:
我在这里搜索计算机视觉和机器学习两个领域的作者
您可以在 double-quotes "<univ. name>"
中的标签后添加大学名称来完成,例如:label:computer_vision "Michigan State University"
。通过这种方式,您将只能通过工作场所或电子邮件从密歇根州立大学获得作者,例如 msu.edu
,他们的兴趣与计算机视觉直接相关。
注意:有时作者会写简短的大学缩写,例如密歇根大学 -> U.Michigan、as Honglak Lee does.
要同时包含此例外,您可以使用竖线 |
符号,我相信它代表 or
。因此搜索查询将变为:label:computer_vision "Michigan State University"|"U.Michigan"
,转换为密歇根州立大学或 U.Michigan.
我找到的唯一一个地方get an idea of how to make such search queries on the Google Scholar Search Tips under How do I search by Title?但是没有任何关于搜索在某所大学工作的作者的内容。显示的结果是通过反复试验获得的,似乎有效。
Code and example in the online IDE:
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query
"hl": "en", # language
"view_op": "search_authors" # author results
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_affiliation = profile.css('.gs_hlt::text').get() # selects only university name without additional affiliation, e.g: Assistant Professor
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_affiliations": profile_affiliation,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
# part of the output:
'''
[
{
"author_name": "Anil K. Jain",
"author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"author_affiliations": "Michigan State University",
"author_email": "Verified email at cse.msu.edu",
"author_interests": [
"Biometrics",
"Computer vision",
"Pattern recognition",
"Machine learning",
"Image processing"
]
} # ...other profiles
]
'''
注意:我正在使用 Parsel
library instead of the most popular parsing library BeautifulSoup
,但它非常相似并且支持 XPath,并且有自己的 CSS pseudo-elements 支持,例如 ::text
或 ::attr(<attribute>)
.
或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API 实现相同的目的。这是付费 API 和免费计划。
这种情况的不同之处在于你不必弄清楚抓取部分,例如选择正确的 selector/XPath 从中抓取数据或如何绕过搜索引擎的阻止,以及如何扩展请求的数量。
要集成的示例代码:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "en", # language
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Anil K. Jain",
"link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "g-_ZXGsAAAAJ",
"affiliations": "Michigan State University",
"email": "Verified email at cse.msu.edu",
"cited_by": 233876,
"interests": [
{
"title": "Biometrics",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
},
{
"title": "Computer vision",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
},
{
"title": "Pattern recognition",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
},
{
"title": "Machine learning",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
},
{
"title": "Image processing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
}
],
"thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles
'''
Disclaimer, I work for SerpApi.