抓取 google 时如何检测验证码?
How to detect captchas when scraping google?
我正在使用 requests
包和 BeautifulSoup
来抓取 Google 新闻以获得查询的搜索结果数。我得到两种类型的 IndexError
,我想区分它们:
- 当搜索结果数为空时。这里
#resultStats
returns 空字符串 '[]'
。似乎正在发生的事情是,当查询字符串太长时,google 甚至不会说“0 个搜索结果”;它什么也没说。
- 第二个
IndexError
是google给我一个验证码。
我需要区分这些情况,因为我希望我的抓取工具在 google 向我发送验证码时等待五分钟,而不是在它只是一个空结果字符串时等待五分钟。
我目前采用陪审团操纵的方法,我发送另一个具有已知非零数量搜索结果的查询,这使我能够区分两者 IndexErrors
。我想知道是否有更优雅和直接的方法来做到这一点,使用 BeautifulSoup
.
这是我的代码:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)
我只是搜索 "captcha" 元素,例如,如果这是 Google Recaptcha,您可以搜索包含令牌的隐藏输入:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None
我正在使用 requests
包和 BeautifulSoup
来抓取 Google 新闻以获得查询的搜索结果数。我得到两种类型的 IndexError
,我想区分它们:
- 当搜索结果数为空时。这里
#resultStats
returns 空字符串'[]'
。似乎正在发生的事情是,当查询字符串太长时,google 甚至不会说“0 个搜索结果”;它什么也没说。 - 第二个
IndexError
是google给我一个验证码。
我需要区分这些情况,因为我希望我的抓取工具在 google 向我发送验证码时等待五分钟,而不是在它只是一个空结果字符串时等待五分钟。
我目前采用陪审团操纵的方法,我发送另一个具有已知非零数量搜索结果的查询,这使我能够区分两者 IndexErrors
。我想知道是否有更优雅和直接的方法来做到这一点,使用 BeautifulSoup
.
这是我的代码:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)
我只是搜索 "captcha" 元素,例如,如果这是 Google Recaptcha,您可以搜索包含令牌的隐藏输入:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None