遍历多个 html 文件并转换为 csv
Iterating through multiple html files and converting to csv
我有 32 个单独的 html 文件,其中的数据采用 table 格式,包含 8 列数据。每个文件都针对特定种类的真菌。
我需要将 32 个 html 文件与数据一起转换为 32 个 csv 文件。我有单个文件的脚本,但无法弄清楚如何使用几个命令对所有 32 个文件系统地执行此操作,而不是 运行 我有 32 次的命令。
这是我正在使用的脚本,试图让它遍历所有 32 个文件:
directory = r'../html/species'
data = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
HTML_data = soup.find_all("table")[0].find_all("tr")[1:]
for element in HTML_data:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
data
以下是为复制目的而简化的上述脚本的一些输出数据:
[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Kenya',
'Present',
'',
'Introduced',
'',
'',
'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Malawi, Ministry of Agriculture (1990)',
''],
['Mozambique',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
''],
['Nigeria',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
''],
['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Casulli (1979); Martin et al. (1997)',
''],
['Zambia',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
''],
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Ethiopia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Libya',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Morocco',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Mozambique',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['South Africa',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Sudan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
['Uganda',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['Afghanistan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Armenia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Azerbaijan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]
我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]]
或者在我的输出中我需要:
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
'']], # AN EXTRA SQUARE BRACKET RIGHT HERE
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
您是否考虑过使用 pandas 阅读 table 标签?
import pandas as pd
import os
directory = r'../html/species'
for filename in os.listdir(directory):
if filename.endswith('.html'):
csv_filename = filename.replace('.html','.csv')
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
table = pd.read_html(f.read())[0]
table.to_csv(csv_filename, index=False)
print(data)
我有 32 个单独的 html 文件,其中的数据采用 table 格式,包含 8 列数据。每个文件都针对特定种类的真菌。
我需要将 32 个 html 文件与数据一起转换为 32 个 csv 文件。我有单个文件的脚本,但无法弄清楚如何使用几个命令对所有 32 个文件系统地执行此操作,而不是 运行 我有 32 次的命令。
这是我正在使用的脚本,试图让它遍历所有 32 个文件:
directory = r'../html/species'
data = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
HTML_data = soup.find_all("table")[0].find_all("tr")[1:]
for element in HTML_data:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
data
以下是为复制目的而简化的上述脚本的一些输出数据:
[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Kenya',
'Present',
'',
'Introduced',
'',
'',
'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Malawi, Ministry of Agriculture (1990)',
''],
['Mozambique',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
''],
['Nigeria',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
''],
['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Casulli (1979); Martin et al. (1997)',
''],
['Zambia',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
''],
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Ethiopia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Libya',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Morocco',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Mozambique',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['South Africa',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Sudan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
['Uganda',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['Afghanistan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Armenia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Azerbaijan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]
我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] 或者在我的输出中我需要:
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
'']], # AN EXTRA SQUARE BRACKET RIGHT HERE
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
您是否考虑过使用 pandas 阅读 table 标签?
import pandas as pd
import os
directory = r'../html/species'
for filename in os.listdir(directory):
if filename.endswith('.html'):
csv_filename = filename.replace('.html','.csv')
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
table = pd.read_html(f.read())[0]
table.to_csv(csv_filename, index=False)
print(data)