加快网络抓取

Speed up web-scraping

我有一个项目,我必须抓取所有 50 分的评分 actors/actresses,这意味着我必须访问和抓取大约 3500 个网页。这花费的时间比我预期的要长,我正在寻找一种加快速度的方法。我知道有像 scrapy 这样的框架,但我想在没有任何其他模块的情况下工作。是否有一种快速简便的方法来重写我的代码,或者这会花费太多时间吗? 我的代码如下:

    def getMovieRatingDf(movie_links):

        counter = -1
        movie_name = []
        movie_rating = []
        movie_year = []

        for movie in movie_links.tolist()[0]:
            counter += 1

            request = requests.get('http://www.imdb.com/' + movie_links.tolist()[0][counter])
            film_soup = BeautifulSoup(request.text, 'html.parser')

            if (film_soup.find('div', {'class': 'title_wrapper'}).find('a').text).isdigit():
            movie_year.append(int(film_soup.find('div', {'class': 'title_wrapper'}).find('a').text))

            # scrap the name and year of the current film
            movie_name.append(list(film_soup.find('h1'))[0])

            try:
                movie_rating.append(float(film_soup.find('span', {'itemprop': 'ratingValue'}).text))

           except AttributeError:
                movie_rating.append(-1)
      else:
        continue

      rating_df = pd.DataFrame(data={"movie name": movie_name, "movie rating": movie_rating, "movie year": movie_year})
      rating_df = rating_df.sort_values(['movie rating'], ascending=False)

return rating_df

主要瓶颈很容易通过查看代码来确定。它具有 阻塞性质 。在处理当前页面之前,您不会 download/parse 下一页。

如果您想加快速度,请以非阻塞方式异步执行。这是 Scrapy 开箱即用的功能:

Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

另一种选择是从 requests 切换到 grequests,示例代码可以在这里找到:

  • How to use python-requests and event hooks to write a web crawler with a callback function?

我们还可以在 HTML-解析阶段改进一些事情:

  • 切换到lxml from html.parser (requires lxml to be installed):

    film_soup = BeautifulSoup(request.text, 'lxml')
    
  • 使用 SoupStrainer 仅解析文档的相关部分