使用此代码，我可以获得第一个 url 的作者列表和书名！！如何使用 beautifulsoup 抓取多个 url 的数据？

Question

import requests, bs4
import numpy as np
import requests
import pandas as pd
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame


urls = ['http://www.gutenberg.org/ebooks/search/? 
sort_order=title','http://www.gutenberg.org/ebooks/search/?sort_order=title&start_index=26']
for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    tb = soup.find_all('span', class_='cell content')
    soup_books = soup.findAll("span",{"class":"title"})  #books
    soup_authors= soup.findAll("span",{"class":"subtitle"}) #authors

    article_title = []
    article_author = []

    soup_title= soup.findAll("span",{"class":"title"})  # books
    soup_para= soup.findAll("span",{"class":"subtitle"})  #authors
for x in range(len(soup_para)):
    article_title.append(soup_title[x].text.strip())
    article_author.append(soup_para[x].text)

data = {'Article_Author':article_author, 'Article_Title':article_title}
df = DataFrame(data, columns = ['Article_Title','Article_Author'])
print(df)
len(df)

I need to crawl data from website 'http://www.gutenberg.org/ebooks/search/? sort_order=title' till the end of the page how can i iterate through the pages to get all the authors and titles of there work in that section

Answer 1

您是说在前 25 个结果之后，您想导航到下一页并获取下一页的结果吗？可以使用beatufiulsoup获取页面右下角"Next"按钮的URL：

next_url = soup.find('a', {'title': 'Go to the next page results.'})

然后运行您的代码再次使用新的 URL。

使用此代码，我可以获得第一个 url 的作者列表和书名！！如何使用 beautifulsoup 抓取多个 url 的数据？

with this code i could get list of author and book title from first url!! how to crawl multiple urls data using beautifulsoup?

python

beautifulsoup

web-crawler

web-scraping

pandas