加快网络抓取
Speed up web-scraping
我有一个项目,我必须抓取所有 50 分的评分 actors/actresses,这意味着我必须访问和抓取大约 3500 个网页。这花费的时间比我预期的要长,我正在寻找一种加快速度的方法。我知道有像 scrapy 这样的框架,但我想在没有任何其他模块的情况下工作。是否有一种快速简便的方法来重写我的代码,或者这会花费太多时间吗?
我的代码如下:
def getMovieRatingDf(movie_links):
counter = -1
movie_name = []
movie_rating = []
movie_year = []
for movie in movie_links.tolist()[0]:
counter += 1
request = requests.get('http://www.imdb.com/' + movie_links.tolist()[0][counter])
film_soup = BeautifulSoup(request.text, 'html.parser')
if (film_soup.find('div', {'class': 'title_wrapper'}).find('a').text).isdigit():
movie_year.append(int(film_soup.find('div', {'class': 'title_wrapper'}).find('a').text))
# scrap the name and year of the current film
movie_name.append(list(film_soup.find('h1'))[0])
try:
movie_rating.append(float(film_soup.find('span', {'itemprop': 'ratingValue'}).text))
except AttributeError:
movie_rating.append(-1)
else:
continue
rating_df = pd.DataFrame(data={"movie name": movie_name, "movie rating": movie_rating, "movie year": movie_year})
rating_df = rating_df.sort_values(['movie rating'], ascending=False)
return rating_df
主要瓶颈很容易通过查看代码来确定。它具有 阻塞性质 。在处理当前页面之前,您不会 download/parse 下一页。
如果您想加快速度,请以非阻塞方式异步执行。这是 Scrapy 开箱即用的功能:
Here you notice one of the main advantages about Scrapy: requests are
scheduled and processed asynchronously. This means that Scrapy doesn’t
need to wait for a request to be finished and processed, it can send
another request or do other things in the meantime. This also means
that other requests can keep going even if some request fails or an
error happens while handling it.
另一种选择是从 requests
切换到 grequests
,示例代码可以在这里找到:
- How to use python-requests and event hooks to write a web crawler with a callback function?
我们还可以在 HTML-解析阶段改进一些事情:
切换到lxml
from html.parser
(requires lxml
to be installed):
film_soup = BeautifulSoup(request.text, 'lxml')
使用 SoupStrainer
仅解析文档的相关部分
我有一个项目,我必须抓取所有 50 分的评分 actors/actresses,这意味着我必须访问和抓取大约 3500 个网页。这花费的时间比我预期的要长,我正在寻找一种加快速度的方法。我知道有像 scrapy 这样的框架,但我想在没有任何其他模块的情况下工作。是否有一种快速简便的方法来重写我的代码,或者这会花费太多时间吗? 我的代码如下:
def getMovieRatingDf(movie_links):
counter = -1
movie_name = []
movie_rating = []
movie_year = []
for movie in movie_links.tolist()[0]:
counter += 1
request = requests.get('http://www.imdb.com/' + movie_links.tolist()[0][counter])
film_soup = BeautifulSoup(request.text, 'html.parser')
if (film_soup.find('div', {'class': 'title_wrapper'}).find('a').text).isdigit():
movie_year.append(int(film_soup.find('div', {'class': 'title_wrapper'}).find('a').text))
# scrap the name and year of the current film
movie_name.append(list(film_soup.find('h1'))[0])
try:
movie_rating.append(float(film_soup.find('span', {'itemprop': 'ratingValue'}).text))
except AttributeError:
movie_rating.append(-1)
else:
continue
rating_df = pd.DataFrame(data={"movie name": movie_name, "movie rating": movie_rating, "movie year": movie_year})
rating_df = rating_df.sort_values(['movie rating'], ascending=False)
return rating_df
主要瓶颈很容易通过查看代码来确定。它具有 阻塞性质 。在处理当前页面之前,您不会 download/parse 下一页。
如果您想加快速度,请以非阻塞方式异步执行。这是 Scrapy 开箱即用的功能:
Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.
另一种选择是从 requests
切换到 grequests
,示例代码可以在这里找到:
- How to use python-requests and event hooks to write a web crawler with a callback function?
我们还可以在 HTML-解析阶段改进一些事情:
切换到
lxml
fromhtml.parser
(requireslxml
to be installed):film_soup = BeautifulSoup(request.text, 'lxml')
使用
SoupStrainer
仅解析文档的相关部分