如何用python和bs4读取和覆盖文件夹中的所有*.txt文件?

How to read and overwrite all *.txt files in a folder with python and bs4?

我有一个包含数千个文件的文件夹。我正在尝试使用 beautifulsoup4.

解析其中的 XML 标签

我可以单独为每个文件执行此操作,但无法使用 for 循环使我的脚本工作。

到目前为止,这是我的代码:

import bs4 as bs
import glob


path = r"~/Desktop/pythontest/*.txt"
files = glob.glob(path)

# ------------------------READ AND PARSE TEXT-----------------------------------------


for f in files:
    # open file in read mode
    source = open(f, "rt")

    # parse xml as soup
    soup = bs.BeautifulSoup(source, "lxml")
    soupText = soup.get_text()
    text = soupText.replace(r"\n", " ")

    # close file
    source.close()


# --------------------------OVERWRITE FILE---------------------------------------------
for f in files:
    # open file in write mode
    source = open(f, "wt")

    # overwrite the file with the soup
    source.write((text))
    # # close file
    source.close()

print(text)

当我 运行 它时,控制台给了我这个:

Traceback (most recent call last):
  File "./camltest.py", line 34, in <module>
    print(text)
NameError: name 'text' is not defined

我怀疑这是范围问题,但无法修复。有什么建议么?谢谢

请注意,text 是在您的第一个 for 循环中定义的。

如果files是一个空列表,text将永远不会被定义。

您可以在同一循环中简单地读取然后写入文件。

for f in files:
    source = open(f, "w+")
    soup = bs.BeautifulSoup(source, "lxml")
    soupText = soup.get_text()
    text = soupText.replace(r"\n", " ")
    source.write(text)
    source.close()