如何用python和bs4读取和覆盖文件夹中的所有*.txt文件?
How to read and overwrite all *.txt files in a folder with python and bs4?
我有一个包含数千个文件的文件夹。我正在尝试使用 beautifulsoup4.
解析其中的 XML 标签
我可以单独为每个文件执行此操作,但无法使用 for 循环使我的脚本工作。
到目前为止,这是我的代码:
import bs4 as bs
import glob
path = r"~/Desktop/pythontest/*.txt"
files = glob.glob(path)
# ------------------------READ AND PARSE TEXT-----------------------------------------
for f in files:
# open file in read mode
source = open(f, "rt")
# parse xml as soup
soup = bs.BeautifulSoup(source, "lxml")
soupText = soup.get_text()
text = soupText.replace(r"\n", " ")
# close file
source.close()
# --------------------------OVERWRITE FILE---------------------------------------------
for f in files:
# open file in write mode
source = open(f, "wt")
# overwrite the file with the soup
source.write((text))
# # close file
source.close()
print(text)
当我 运行 它时,控制台给了我这个:
Traceback (most recent call last):
File "./camltest.py", line 34, in <module>
print(text)
NameError: name 'text' is not defined
我怀疑这是范围问题,但无法修复。有什么建议么?谢谢
请注意,text
是在您的第一个 for 循环中定义的。
如果files
是一个空列表,text
将永远不会被定义。
您可以在同一循环中简单地读取然后写入文件。
for f in files:
source = open(f, "w+")
soup = bs.BeautifulSoup(source, "lxml")
soupText = soup.get_text()
text = soupText.replace(r"\n", " ")
source.write(text)
source.close()
我有一个包含数千个文件的文件夹。我正在尝试使用 beautifulsoup4.
解析其中的 XML 标签我可以单独为每个文件执行此操作,但无法使用 for 循环使我的脚本工作。
到目前为止,这是我的代码:
import bs4 as bs
import glob
path = r"~/Desktop/pythontest/*.txt"
files = glob.glob(path)
# ------------------------READ AND PARSE TEXT-----------------------------------------
for f in files:
# open file in read mode
source = open(f, "rt")
# parse xml as soup
soup = bs.BeautifulSoup(source, "lxml")
soupText = soup.get_text()
text = soupText.replace(r"\n", " ")
# close file
source.close()
# --------------------------OVERWRITE FILE---------------------------------------------
for f in files:
# open file in write mode
source = open(f, "wt")
# overwrite the file with the soup
source.write((text))
# # close file
source.close()
print(text)
当我 运行 它时,控制台给了我这个:
Traceback (most recent call last):
File "./camltest.py", line 34, in <module>
print(text)
NameError: name 'text' is not defined
我怀疑这是范围问题,但无法修复。有什么建议么?谢谢
请注意,text
是在您的第一个 for 循环中定义的。
如果files
是一个空列表,text
将永远不会被定义。
您可以在同一循环中简单地读取然后写入文件。
for f in files:
source = open(f, "w+")
soup = bs.BeautifulSoup(source, "lxml")
soupText = soup.get_text()
text = soupText.replace(r"\n", " ")
source.write(text)
source.close()