循环进入包含 HTML 个文件的文件夹并对每个文件执行预定义的函数

Question

我对编码还是个新手。我需要编写代码来遍历许多 html 文件的数据文件夹并执行预定义函数（从 HTML 文档中提取特定表）。我使用 bs4 来解析 html 文件。下面建议的解决方案允许检索文件并从每个 html 文件中提取表格。

from bs4 import BeautifulSoup
import glob

def get_soup(html_file_path):  
    f = html_file_path.open()
    return BeautifulSoup(f, "lxml")

def get_all_tables(soup):
    return soup.find_all("table")

def get_all_html_files(root_path):
    return Path(root_path).glob("**/*.html")

if __name__ == "__main__":
    html_root = Path("data_file_pathname/")

    soup = get_soup(html_file)

    tables = get_all_tables(soup)
    print(f"[+] Found a total of {len(tables)} tables.")

谢谢

Answer 1

您可以使用 pathlib standard library module 中的 Path.glob 函数。

例如：

from pathlib import Path

def get_soup(html_file_path):  # added argument
    f = html_file_path.open()
    return BeautifulSoup(f, "lxml")

def get_all_tables(soup):
    return soup.find_all("table")

def get_all_html_files(root_path):
    return Path(root_path).glob("**/*.html")

if __name__ == "__main__":
    html_root = Path("./html_files/")  # This is the folder with the html files

    for html_file in get_all_html_files(html_root):
        soup = get_soup(html_file)
        tables = get_all_tables(soup)

循环进入包含 HTML 个文件的文件夹并对每个文件执行预定义的函数

Looping into a folder of HTML files and executing a predefined function on each file

html

python

iteration

extract

beautifulsoup