遍历多个 html 文件并转换为 csv

Question

我有 32 个单独的 html 文件，其中的数据采用 table 格式，包含 8 列数据。每个文件都针对特定种类的真菌。

我需要将 32 个 html 文件与数据一起转换为 32 个 csv 文件。我有单个文件的脚本，但无法弄清楚如何使用几个命令对所有 32 个文件系统地执行此操作，而不是运行我有 32 次的命令。

这是我正在使用的脚本，试图让它遍历所有 32 个文件：

directory = r'../html/species'
data = []
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
            for element in HTML_data: 
                sub_data = [] 
                for sub_element in element: 
                    try: 
                        sub_data.append(sub_element.get_text())
                    except: 
                        continue
                data.append(sub_data) 
data

以下是为复制目的而简化的上述脚本的一些输出数据：

[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Kenya',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Malawi, Ministry of Agriculture (1990)',
  ''],
 ['Mozambique',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
  ''],
 ['Nigeria',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
  ''],
 ['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Casulli (1979); Martin et al. (1997)',
  ''],
 ['Zambia',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
 ['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
 ['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
 ['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  ''],
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Ethiopia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Libya',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Morocco',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Mozambique',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['South Africa',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Sudan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
 ['Uganda',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['Afghanistan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Armenia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Azerbaijan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]

我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] 或者在我的输出中我需要：

['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  '']], # AN EXTRA SQUARE BRACKET RIGHT HERE
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',

Answer 1

您是否考虑过使用 pandas 阅读 table 标签？

import pandas as pd
import os

directory = r'../html/species'

for filename in os.listdir(directory):
    if filename.endswith('.html'):
        csv_filename = filename.replace('.html','.csv')
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            table = pd.read_html(f.read())[0]
            table.to_csv(csv_filename, index=False)

print(data)

遍历多个 html 文件并转换为 csv

Iterating through multiple html files and converting to csv

python

csv

beautifulsoup

web-scraping

data-cleaning