如何使用 Python 自动解析跨越多个页面的表格

Question

我想解析跨越多个页面的 table（或多个 table）。我在下面执行此操作的方法可行，但过于手动，我希望它能自动解析来自不同页面的 table 并将它们合并为一个。页数可能并不总是相同。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

one = "https://rittresultater.no/nb/sb_tid/923?page=0&pv2=11027&pv1=U"
two = "https://rittresultater.no/nb/sb_tid/923?page=1&pv2=11027&pv1=U"
three = "https://rittresultater.no/nb/sb_tid/923?page=2&pv2=11027&pv1=U"

#parse the first page
html = urlopen(one)
soup = BeautifulSoup(html, "lxml")
table = soup.find_all(class_="table-condensed")
one = pd.read_html(str(table))[0]

#parse the second page
html = urlopen(two)
soup = BeautifulSoup(html, "lxml")
table = soup.find_all(class_="table-condensed")
two = pd.read_html(str(table))[0]

#parse thr third page
html = urlopen(three)
soup = BeautifulSoup(html, "lxml")
table = soup.find_all(class_="table-condensed")
three = pd.read_html(str(table))[0]

df = pd.concat([one,two,three], axis = 0)
df

请注意，网址仅在 "page=X" 处有所不同。网页本身也包含指向例如的链接。下一页。

Answer 1

results = {}
for page_num in range(1, 10): #change depending on max page
    address = 'https://rittresultater.no/nb/sb_tid/923?page=' + \
               str(page_num) + '&pv2=11027&pv1=U' 

    html = urlopen(address)
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find_all(class='table-condensed')
    output = pd.read_html(str(table))[0]
    results[page_num] = output

当它完成后使用列表理解来做相关的输出，如果它是你代码中的最后一行但是放大了这样做：

df = pd.concat([v for v in results.values()], axis = 0)

如何使用 Python 自动解析跨越多个页面的表格

How can I automatically parse tables spanning over multiple pages with Python

parsing

html-table

beautifulsoup

threadpool

python-3.x