如何迭代页面并获取每篇新闻文章的 link 和标题
How to iterate pages and get the link and title of each news article
我正在从该网站 https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance(及其后续页面)抓取 10 个页面
我预计 pagelinks 中应存储总共 100 个链接和标题。
但是,只保存了 10 个链接和 10 个标题。
如何抓取 10 页并存储文章 links/titles?
如有任何帮助,我们将不胜感激!
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
urls = [f"{url}{x}" for x in range(1,11)]
params = {
"orderby": "relevance",
}
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text)
#select all the articles for a single page
containers = soup_page.findAll("li", {'class': 'article'})
#scrape the links of the articles
pagelinks = []
for link in containers:
url = link.find('a')
pagelinks.append(url.get('href'))
print(pagelinks)
#scrape the titles of the articles
title = []
for link in containers:
atitle = link.find(class_ = 'entry-heading').find('a')
thetitle = atitle.get_text()
title.append(thetitle)
print(title)
从 for page in urls:
中取出 pagelinks = []
。
通过将它放在 for page in urls:
循环中,您将在页面的每次迭代中覆盖页面链接列表,因此,最后,您仅从最后一页获得 10 个链接。
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
urls = [f"{url}{x}" for x in range(1,11)]
params = {
"orderby": "relevance",
}
pagelinks = []
title = []
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text)
#select all the articles for a single page
containers = soup_page.findAll("li", {'class': 'article'})
#scape the links of the articles
for link in containers:
url = link.find('a')
pagelinks.append(url.get('href'))
for link in containers:
atitle = link.find(class_ = 'entry-heading').find('a')
thetitle = atitle.get_text()
title.append(thetitle)
print(title)
print(pagelinks)
我正在从该网站 https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance(及其后续页面)抓取 10 个页面
我预计 pagelinks 中应存储总共 100 个链接和标题。 但是,只保存了 10 个链接和 10 个标题。
如何抓取 10 页并存储文章 links/titles?
如有任何帮助,我们将不胜感激!
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
urls = [f"{url}{x}" for x in range(1,11)]
params = {
"orderby": "relevance",
}
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text)
#select all the articles for a single page
containers = soup_page.findAll("li", {'class': 'article'})
#scrape the links of the articles
pagelinks = []
for link in containers:
url = link.find('a')
pagelinks.append(url.get('href'))
print(pagelinks)
#scrape the titles of the articles
title = []
for link in containers:
atitle = link.find(class_ = 'entry-heading').find('a')
thetitle = atitle.get_text()
title.append(thetitle)
print(title)
从 for page in urls:
中取出 pagelinks = []
。
通过将它放在 for page in urls:
循环中,您将在页面的每次迭代中覆盖页面链接列表,因此,最后,您仅从最后一页获得 10 个链接。
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
urls = [f"{url}{x}" for x in range(1,11)]
params = {
"orderby": "relevance",
}
pagelinks = []
title = []
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text)
#select all the articles for a single page
containers = soup_page.findAll("li", {'class': 'article'})
#scape the links of the articles
for link in containers:
url = link.find('a')
pagelinks.append(url.get('href'))
for link in containers:
atitle = link.find(class_ = 'entry-heading').find('a')
thetitle = atitle.get_text()
title.append(thetitle)
print(title)
print(pagelinks)