网页抓取 Google 个搜索结果

Question

我正在逐页抓取 Google 学术搜索结果。在一定数量的页面后，验证码弹出并中断我的代码。我读到 Google 限制了我每小时可以发出的请求。有什么办法可以绕过这个限制吗？我阅读了一些有关 API 的内容，但我不确定这是否有帮助。

Answer 1

自从我过去从 Google 刮过后，我感受到了你的痛苦。为了完成我的工作，我尝试了以下方法。此列表从最简单到最难的技术排序。

每秒限制您的请求： Google 和许多其他网站每秒会识别出来自同一台机器的大量请求并自动阻止它们作为针对 Denial-of-Service attacks 的防御行动。例如，您需要做的只是温柔一点，每 1-5 秒只执行 1 个请求，以免很快被禁止。
随机化您的休眠时间： 让您的代码恰好休眠 1 秒太容易被检测为脚本。让它在每次迭代中随机休眠一段时间。显示了如何随机化它的示例。
使用启用 cookie 的网络抓取程序库： 如果您从头开始编写抓取代码，Google 会注意到您的请求不会 return它收到的 cookie。使用一个好的库，例如 Scrapy 来避免这个问题。
使用多个 IP 地址： 节流肯定会降低您的抓取吞吐量。如果你真的需要快速抓取你的数据，你将需要使用多个 IP 地址以避免被禁止。有几家公司在 Internet 上提供这种服务，收取一定的费用。我用过 ProxyMesh，非常喜欢他们的质量、文档和客户支持。
使用真实的浏览器： 如果某些网站不处理 JavaScript 或没有图形界面，则可以识别您的抓取工具。例如，使用带有 Selenium 的真实浏览器将解决此问题。

您还可以查看 my crawler project，这是为纽约大学的网络搜索引擎课程编写的。它本身并不抓取 Google ，但包含一些上述技术，例如节流和随机化睡眠时间。

Answer 2

来自个人经验的抓取Google学者。 45 秒足以避免验证码和机器人检测。我有一个刮板运行超过 3 天而没有被发现。如果你确实被标记了，等待大约 2 小时就足以重新开始。 Here is an extract from my code.。

class ScholarScrape():
    def __init__(self):
        self.page = None
        self.last_url = None
        self.last_time = time.time()
        self.min_time_between_scrape = int(ConfigFile.instance().config.get('scholar','bot_avoidance_time'))
        self.header = {'User-Agent':ConfigFile.instance().config.get('scholar','user_agent')}
        self.session = requests.Session()
        pass

    def search(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
        url = self.get_url(query, year_lo, year_hi, title_only, publication_string, author_string, include_citations, include_patents)
        while True:
            wait_time = self.min_time_between_scrape - (time.time() - self.last_time)
            if wait_time > 0:
                logger.info("Delaying search by {} seconds to avoid bot detection.".format(wait_time))
                time.sleep(wait_time)
            self.last_time = time.time()
            logger.info("SCHOLARSCRAPE: " + url)
            self.page = BeautifulSoup(self.session.get(url, headers=self.header).text, 'html.parser')
            self.last_url = url

            if "Our systems have detected unusual traffic from your computer network" in str(self.page):
                raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")

            return

    def get_url(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
        base_url = "https://scholar.google.com.au/scholar?"
        url = base_url + "as_q=" + urllib.parse.quote(query)

        if year_lo is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_lo))):
            url += "&as_ylo=" + str(year_lo)

        if year_hi is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_hi))):
            url += "&as_yhi=" + str(year_hi)

        if title_only:
            url += "&as_yhi=title"
        else:
            url += "&as_yhi=any"

        if publication_string is not None:
            url += "&as_publication=" + urllib.parse.quote('"' + str(publication_string) + '"')

        if author_string is not None:
            url += "&as_sauthors=" + urllib.parse.quote('"' + str(author_string) + '"')

        if include_citations:
            url += "&as_vis=0"
        else:
            url += "&as_vis=1"

        if include_patents:
            url += "&as_sdt=0"
        else:
            url += "&as_sdt=1"

        return url

    def get_results_count(self):
        e = self.page.findAll("div", {"class": "gs_ab_mdw"})
        try:
            item = e[1].text.strip()
        except IndexError as ex:
            if "Our systems have detected unusual traffic from your computer network" in str(self.page):
                raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
            else:
                raise ex

        if self.has_numbers(item):
            return self.get_results_count_from_soup_string(item)
        for item in e:
            item = item.text.strip()
            if self.has_numbers(item):
                return self.get_results_count_from_soup_string(item)
        return 0

    @staticmethod
    def get_results_count_from_soup_string(element):
        if "About" in element:
            num = element.split(" ")[1].strip().replace(",","")
        else:
            num = element.split(" ")[0].strip().replace(",","")
        return num

    @staticmethod
    def has_numbers(input_string):
        return any(char.isdigit() for char in input_string)


class BotDetectionException(Exception):
    pass

if __name__ == "__main__":
    s = ScholarScrape()
    s.search(**{
        "query":"\"policy shaping\"",
        # "publication_string":"JMLR",
        "author_string": "gilboa",
        "year_lo": "1995",
        "year_hi": "2005",

    })
    x = s.get_results_count()
    print(x)

网页抓取 Google 个搜索结果

Web scraping Google search results

python

beautifulsoup

web-crawler

web-scraping