获取 Google 搜索请求
Get request to Google Search
我正在尝试使用 Google 的搜索结果获取 HTML。例如发送 GET 请求到:
https://www.google.ru/?q=1111
但是如果在浏览器中一切正常,当我尝试将它与 curl 一起使用或在 Google 中使用 "View source" 获取源代码时,只有一些 Javascript 代码, 没有搜索结果。那是某种保护吗?我能做什么?
您现在必须使用 Google Search API 来发出 GET 请求。
所有其他方法已被阻止。
您可以在浏览器中加载它,然后通过 Javascript 抓取结果。
或者您可以使用 Google API,但如果您每天请求超过 100 次,似乎需要付费。
给答案加点调味料,因为它们不正确,甚至没有回答您的问题。
首先,只要您不通过它损害他们的服务(类似 DoS),抓取 Google 是完全合法的。
而且方法也没有被屏蔽,只是没那么简单。
速度看你的方法,不一定要很慢..
如果需要,您可以在一分钟内抓取数万个关键字页面。
您将在此处找到该主题的更好答案:Is it ok to scrape data from Google results?
你的 curl 问题确实来自保护,Google 不允许自动访问,它有一套非常复杂的检测算法。
它们从简单的用户代理检查(这就是直接阻止您的原因)到试图检测异常查询或相关查询的人工智能。
您问题中的页面是带有输入字段的 Google 搜索页面。
搜索结果页是这个:
https://www.google.ru/search?q=1111
轮换代理和用户代理,并延迟类似的请求,以从 Google 搜索结果页面中获得 HTML 更少的禁令。
或使用 SerpApi 访问 HTML 并从中提取数据。它有免费试用版。
curl -s 'https://serpapi.com/search?q=coffee'
输出
{
// Omitted
"organic_results": [
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "en.wikipedia.org › wiki › Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...",
"sitelinks": {
"expanded": [
{
"title": "History",
"link": "https://en.wikipedia.org/wiki/History_of_coffee",
"snippet": "The history of coffee dates back to the 15th century, and possibly ..."
},
{
"title": "International Coffee Day",
"link": "https://en.wikipedia.org/wiki/International_Coffee_Day",
"snippet": "International Coffee Day (1 October) is an occasion that is ..."
},
{
"title": "List of coffee drinks",
"link": "https://en.wikipedia.org/wiki/List_of_coffee_drinks",
"snippet": "Milk coffee - Nitro cold brew coffee - List of coffee dishes - ..."
},
{
"title": "Portal:Coffee",
"link": "https://en.wikipedia.org/wiki/Portal:Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the ..."
},
{
"title": "Coffee bean",
"link": "https://en.wikipedia.org/wiki/Coffee_bean",
"snippet": "A coffee bean is a seed of the Coffea plant and the source for ..."
},
{
"title": "Geisha",
"link": "https://en.wikipedia.org/wiki/Geisha_(coffee)",
"snippet": "Geisha coffee, sometimes referred to as Gesha coffee, is a type of ..."
}
],
"list": [
{
"date": "Color: Black, dark brown, light brown, beige"
}
]
},
"rich_snippet": {
"bottom": {
"detected_extensions": {
"introduced_th_century": 15
},
"extensions": [
"Introduced: 15th century",
"Color: Black, dark brown, light brown, beige"
]
}
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=2&hl=sv&ct=clnk&gl=se",
"related_pages_link": "https://www.google.se/search?gl=se&hl=sv&q=related:https://en.wikipedia.org/wiki/Coffee+coffee&sa=X&ved=2ahUKEwjJ9p2p_KXuAhVlRN8KHf22D8wQHzABegQIAhAJ"
}
},
// ...
}
免责声明:我在SerpApi工作。
我正在尝试使用 Google 的搜索结果获取 HTML。例如发送 GET 请求到:
https://www.google.ru/?q=1111
但是如果在浏览器中一切正常,当我尝试将它与 curl 一起使用或在 Google 中使用 "View source" 获取源代码时,只有一些 Javascript 代码, 没有搜索结果。那是某种保护吗?我能做什么?
您现在必须使用 Google Search API 来发出 GET 请求。
所有其他方法已被阻止。
您可以在浏览器中加载它,然后通过 Javascript 抓取结果。
或者您可以使用 Google API,但如果您每天请求超过 100 次,似乎需要付费。
给答案加点调味料,因为它们不正确,甚至没有回答您的问题。
首先,只要您不通过它损害他们的服务(类似 DoS),抓取 Google 是完全合法的。
而且方法也没有被屏蔽,只是没那么简单。
速度看你的方法,不一定要很慢..
如果需要,您可以在一分钟内抓取数万个关键字页面。
您将在此处找到该主题的更好答案:Is it ok to scrape data from Google results?
你的 curl 问题确实来自保护,Google 不允许自动访问,它有一套非常复杂的检测算法。
它们从简单的用户代理检查(这就是直接阻止您的原因)到试图检测异常查询或相关查询的人工智能。
您问题中的页面是带有输入字段的 Google 搜索页面。
搜索结果页是这个:
https://www.google.ru/search?q=1111
轮换代理和用户代理,并延迟类似的请求,以从 Google 搜索结果页面中获得 HTML 更少的禁令。
或使用 SerpApi 访问 HTML 并从中提取数据。它有免费试用版。
curl -s 'https://serpapi.com/search?q=coffee'
输出
{
// Omitted
"organic_results": [
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "en.wikipedia.org › wiki › Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...",
"sitelinks": {
"expanded": [
{
"title": "History",
"link": "https://en.wikipedia.org/wiki/History_of_coffee",
"snippet": "The history of coffee dates back to the 15th century, and possibly ..."
},
{
"title": "International Coffee Day",
"link": "https://en.wikipedia.org/wiki/International_Coffee_Day",
"snippet": "International Coffee Day (1 October) is an occasion that is ..."
},
{
"title": "List of coffee drinks",
"link": "https://en.wikipedia.org/wiki/List_of_coffee_drinks",
"snippet": "Milk coffee - Nitro cold brew coffee - List of coffee dishes - ..."
},
{
"title": "Portal:Coffee",
"link": "https://en.wikipedia.org/wiki/Portal:Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the ..."
},
{
"title": "Coffee bean",
"link": "https://en.wikipedia.org/wiki/Coffee_bean",
"snippet": "A coffee bean is a seed of the Coffea plant and the source for ..."
},
{
"title": "Geisha",
"link": "https://en.wikipedia.org/wiki/Geisha_(coffee)",
"snippet": "Geisha coffee, sometimes referred to as Gesha coffee, is a type of ..."
}
],
"list": [
{
"date": "Color: Black, dark brown, light brown, beige"
}
]
},
"rich_snippet": {
"bottom": {
"detected_extensions": {
"introduced_th_century": 15
},
"extensions": [
"Introduced: 15th century",
"Color: Black, dark brown, light brown, beige"
]
}
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=2&hl=sv&ct=clnk&gl=se",
"related_pages_link": "https://www.google.se/search?gl=se&hl=sv&q=related:https://en.wikipedia.org/wiki/Coffee+coffee&sa=X&ved=2ahUKEwjJ9p2p_KXuAhVlRN8KHf22D8wQHzABegQIAhAJ"
}
},
// ...
}
免责声明:我在SerpApi工作。