通过使用 python 更改 ID 从 HTML 页面读取表格

Question

我正在使用下面的 html link 阅读页面中的 table:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664

link(allbin) 的最后一部分是 ID。此 ID 会更改，通过使用不同的 ID，您可以访问不同的 table 和记录。 link 虽然保持不变，但最后的 ID 每次都会改变。我有 1000 个这样的不同 ID。那么，我如何实际使用这些不同的 ID 来访问不同的 table 并将它们连接在一起？

这样输出，

ID         Number         Type             FileDate
2016664   NB 14581-26     New Building    12/21/2020
4257909   NB 1481-29      New Building    3/6/2021
4138920   NB 481-29       New Building    9/4/2020

使用的其他ID列表：

['4257909', '4138920', '4533715']

这是我的尝试，我可以用这个代码阅读一个 table。

import requests
import pandas as pd

url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
    
df

Answer 1

要从 ID 列表中获取所有页面，您可以使用下一个示例：

import requests
import pandas as pd
from io import StringIO

url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"


def get_info(ID, page=1):
    out = []
    while True:
        try:
            print("ID: {} Page: {}".format(ID, page))
            t = requests.get(url.format(ID, page), timeout=1).text
            df = pd.read_html(StringIO(t))[3].loc[1:, :]
            if len(df) == 0:
                break
            df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
            df["ID"] = ID
            out.append(df)
            page += 25
        except requests.exceptions.ReadTimeout:
            print("Timeout...")
            continue
    return out


list_of_ids = [2016664, 4257909, 4138920, 4533715]

dfs = []
for ID in list_of_ids:
    dfs.extend(get_info(ID))

df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)

打印：

                                                                              NUMBER                                                                            NUMBER                                                                              TYPE                                                                         FILE DATE       ID
1                                                                      ALT 1469-1890                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
2                                                                      ALT 1313-1874                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
3                                                                        BN 332-1938                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
4                                                                        BN 636-1916                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
5                                                                    CO NB 1295-1923                                                                             (PDF)                                                          CERTIFICATE OF OCCUPANCY                                                                        00/00/0000  2016664

...

并保存 data.csv（来自 LibreOffice 的屏幕截图）：

Answer 2

下面的代码将提取网页中的所有表格

将 numpy 导入为 np

导入 pandas 作为 pd

url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'

df_list = pd.read_html(url) #returns 作为来自网页的数据帧列表

print(len(df_list)) #打印数据帧的数量

我=0

while i < len(df_list): #遍历列表以打印所有表

df = df_list[i]

print(df)

i = i + 1

通过使用 python 更改 ID 从 HTML 页面读取表格

Read tables from HTML page by changing the ID using python

html

python

datatable

web-scraping

pandas