为什么 responses.get 有时无法获取整个 html 页面?
Why does responses.get sometimes not get the whole html page?
在抓取中 this page 我想提取电影的评级(PG、PG-13 等),除了名为 "Reis".[=15 的电影外,一切似乎都正常=]
(12) 有一个证书,但似乎 Responses.get 没有下载那部分的 HTML 代码(beautifulsoup 没有找到任何东西,我也看了一下在 response.text。)在某些情况下,我也遇到了与 urllib.request 类似的问题。两种情况下的响应都是成功的(它 returns 200)。处理问题的最佳方式是什么?
这是我的代码:
from requests import get
from bs4 import BeautifulSoup
def movie_catalog_pages(base_url):
response = None
try:
response = get(base_url)
except:
print("Not loaded "+ base_url)
return response
url = 'https://www.imdb.com/search/title/?release_date=2017-01-01,2017-12-31&sort=num_votes,desc&start=101'
response = movie_catalog_pages(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
movies = html_soup.find_all('div', class_='lister-item mode-advanced')
for movie in movies:
# Movie number
try:
temp = movie.h3.span.text
except:
temp = None
if (temp == None):
i = (np.NaN)
else:
i = (int(temp.replace('.','').replace(',','')))
# movie certificate
try:
temp = movie.p.find('span', class_="certificate").text
except:
temp = None
print('Error================================', i)
if (temp == None):
pass
else:
print(i,temp)
感谢评论,我注意到我的问题是由我自己的 IP 地址和我进行抓取的计算机引起的。
在抓取中 this page 我想提取电影的评级(PG、PG-13 等),除了名为 "Reis".[=15 的电影外,一切似乎都正常=]
(12) 有一个证书,但似乎 Responses.get 没有下载那部分的 HTML 代码(beautifulsoup 没有找到任何东西,我也看了一下在 response.text。)在某些情况下,我也遇到了与 urllib.request 类似的问题。两种情况下的响应都是成功的(它 returns 200)。处理问题的最佳方式是什么?
这是我的代码:
from requests import get
from bs4 import BeautifulSoup
def movie_catalog_pages(base_url):
response = None
try:
response = get(base_url)
except:
print("Not loaded "+ base_url)
return response
url = 'https://www.imdb.com/search/title/?release_date=2017-01-01,2017-12-31&sort=num_votes,desc&start=101'
response = movie_catalog_pages(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
movies = html_soup.find_all('div', class_='lister-item mode-advanced')
for movie in movies:
# Movie number
try:
temp = movie.h3.span.text
except:
temp = None
if (temp == None):
i = (np.NaN)
else:
i = (int(temp.replace('.','').replace(',','')))
# movie certificate
try:
temp = movie.p.find('span', class_="certificate").text
except:
temp = None
print('Error================================', i)
if (temp == None):
pass
else:
print(i,temp)
感谢评论,我注意到我的问题是由我自己的 IP 地址和我进行抓取的计算机引起的。