如何从一个网站的多个 URL 中以一种有文化的方式将特定 table 数据 (div\tr\td) 提取到 CSV 中（附示例）

Question

我正在学习 python 并练习它以在 public 站点中提取数据。但是我在这次学习中发现了一个问题。我想得到你的好心帮助我。提前感谢您的帮助！我会每天跟踪这个话题，等待您的好评:)

用途：
在一个脚本中将所有 65 页的列、行及其内容提取到一个 csv 文件中

65 页 URL 循环规则：
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65

问题一：
当运行ning 下面的一页脚本将一页数据提取到 csv 中时。我不得不用不同的文件名运行两次，然后数据可以提取到第一次运行文件例如，如果我运行它与 test.csv，excel 保持 0kb 状态，在我将文件名更改为 test2 之后，然后再次运行这个脚本，之后数据可以提取到test.csv...，但 test2.csv 不保留 0 KB 的数据。有什么想法吗？

这里是一页提取码：

import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
         for tr in div.find_all("tr")[1:]:
            data = []
            for td in tr.find_all("td"):
                data.append(td.text.strip())
            if data:
                print("Inserting data: {}".format(','.join(data)))
                csv_writer.writerow(data)

问题2： 我发现读取 65 页 url 以将数据提取到 csv 中的问题。它不起作用...任何想法修复它..

这里有65个页面url的提取码：

import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
      for url in [url.format(pageNo) for pageNo in range(1,65)]:
        soup = bs(url.content, 'html.parser')
        for div in soup.find_all("div", class_ = "iiright"):
            for tr in div.find_all("tr"):
                data = []
                for td in tr.find_all("td"):
                    data.append(td.text.strip())
                    if data:
                        print("Inserting data: {}".format(','.join(data)))
                        writer.writerow(data)
                
if __name__ == '__main__':
        with open("test.csv","w",newline="") as infile:
            writer = csv.writer(infile)
            get_data(url)

Answer 1

只是另一种方法

尽量保持简单，可以使用 pandas，因为它会在后台为您完成所有这些事情。

定义一个列表（数据）来保存您的结果
使用 pd.read_html
concat data中的数据帧并写入to_csv或to_excel

read_html

找到匹配字符串的 table -> match='预售信息查询：' 和 select 它与 [0] 因为 read_html() 总是会给你一个列表 tables
取特殊行作为headerheader =2
摆脱由于错误 colspan 和 .iloc[:-1,:-1]

例子

import pandas as pd

data = []

for pageNo in range(1,5):
    data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询：')[0].iloc[:-1,:-1])

pd.concat(data).to_csv('test.csv', index=False)

示例（基于您的函数代码）

import pandas as pd

url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="

def get_data(url):
    
    data = []

    for pageNo in range(1,2):
        data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询：')[0].iloc[:-1,:-1])

    pd.concat(data).to_csv('test.csv', index=False)
                
if __name__ == '__main__':
    get_data(url)

如何从一个网站的多个 URL 中以一种有文化的方式将特定 table 数据 (div\tr\td) 提取到 CSV 中（附示例）

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

html

python

csv

url

只是另一种方法

read_html

例子

示例（基于您的函数代码）