从给定目录中删除损坏的 xlsx 文件
remove corrupted xlsx files from a given directory
更新
特定目录中有一些 .xlsx
文件已损坏,因为尝试打开工作簿时出现 windows 消息如下:
Excel cannot open the file 'filename.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.`
我想知道是否可以检测到这些损坏的文件并将其从目录中删除。
我的试用:
############### path settlement and file names ##########
path_reportes = os.path.join(os.getcwd(), 'Reports', 'xlsx_folder')
file_names = os.listdir(path_reportes)
overall_df = dict()
############## concatenate all reports ##################
for file_name in file_names:
data_file_path = os.path.join(path_reportes, file_name)
"""
try open spreadsheets, save them and store them in a dictionary key
except when the file is corrupted, if so, remove it from the
folder
"""
try:
# Start by opening the spreadsheet and selecting the main sheet
workbook = openpyxl.load_workbook(filename=data_file_path)
sheet = workbook.active
# Save the spreadsheet
workbook.save(filename=data_file_path)
df_report_dict = pd.read_excel(data_file_path, sheet_name=None, engine='openpyxl')
for key in df_report_dict:
df_report_dict[key]['report_name'] = file_name
try:
overall_df[key] = overall_df[key].append(df_report_dict[key], ignore_index=True)
except:
overall_df[key] = df_report_dict[key]
# when file corrupted then remove it from the folder
except BadZipFile:
os.remove(data_file_path)
抛出下一个错误:
NameError: name 'BadZipFile' is not defined
是否可以检测损坏的文件?
我该如何处理它们?
当您尝试加载损坏的 Excel 文件时遇到什么异常? 运行那个实验,然后写一个try-except
块来处理条件。
try:
# load PANDAS df
except CorruptedExcelFile:
os.remove(filename)
从您引用的 post 来看,问题似乎是在尝试解压缩文件时发生的,因此适当的例外是 BadZipFile
。在 except
语句中使用它。您可能希望限制对特定异常的处理,因为结果是删除有问题的文件。
场景:我在名为 xlsx_folder
的目录中创建了 三个 个相同的 excel 个文件,并希望将所有文件合并到一个 data frame
。为此,我建议使用 glob
,而不是使用 os
模块。
import os # for deleting corrupted file
import glob # to list out a specific file type
import pandas as pd
# here is a list of all the file in the directory
print(glob.glob("xlsx_folder/*.xlsx"))
输出:
['xlsx_folder\file1 - Copy (2).xlsx',
'xlsx_folder\file1 - Copy.xlsx',
'xlsx_folder\file1.xlsx',
'xlsx_folder\~$file1.xlsx']
注意: 在 windows 中,当 excel 文件打开时 - 它会创建一个带有 ~$
符号的临时文件,这是一个临时文件文件(在这种情况下,我将其视为损坏的文件)。
现在,您可以读取目录中的所有文件,并制作一个单独的数据框,如下所示:
overall_df = []
for f in glob.glob("xlsx_folder/*.xlsx"):
try:
overall_df.append(pd.read_excel(f)) # if there is an encoding error, handle it here
except Exception as err:
print(f"Unable to Read: {f}.\n{err}") # use format if not familiar with f-strings
# delete the file with os.remove
# os.remove(f)
overall_df = pd.concat(overall_df, ignore_index = True)
这将打印一条警告语句,例如:
Unable to Read: xlsx_folder\~$file1.xlsx.
[Errno 13] Permission denied: 'xlsx_folder\~$file1.xlsx'
如果您仍然遇到 BadZipFile
未定义的问题,那么:
由于异常 class BadZipFile
在模块 zipfile
中,您只需要一个 import
语句,例如:
from zipfile import BadZipFile
然后您应该能够处理异常。
更新
特定目录中有一些 .xlsx
文件已损坏,因为尝试打开工作簿时出现 windows 消息如下:
Excel cannot open the file 'filename.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.`
我想知道是否可以检测到这些损坏的文件并将其从目录中删除。
我的试用:
############### path settlement and file names ##########
path_reportes = os.path.join(os.getcwd(), 'Reports', 'xlsx_folder')
file_names = os.listdir(path_reportes)
overall_df = dict()
############## concatenate all reports ##################
for file_name in file_names:
data_file_path = os.path.join(path_reportes, file_name)
"""
try open spreadsheets, save them and store them in a dictionary key
except when the file is corrupted, if so, remove it from the
folder
"""
try:
# Start by opening the spreadsheet and selecting the main sheet
workbook = openpyxl.load_workbook(filename=data_file_path)
sheet = workbook.active
# Save the spreadsheet
workbook.save(filename=data_file_path)
df_report_dict = pd.read_excel(data_file_path, sheet_name=None, engine='openpyxl')
for key in df_report_dict:
df_report_dict[key]['report_name'] = file_name
try:
overall_df[key] = overall_df[key].append(df_report_dict[key], ignore_index=True)
except:
overall_df[key] = df_report_dict[key]
# when file corrupted then remove it from the folder
except BadZipFile:
os.remove(data_file_path)
抛出下一个错误:
NameError: name 'BadZipFile' is not defined
是否可以检测损坏的文件? 我该如何处理它们?
当您尝试加载损坏的 Excel 文件时遇到什么异常? 运行那个实验,然后写一个try-except
块来处理条件。
try:
# load PANDAS df
except CorruptedExcelFile:
os.remove(filename)
从您引用的 post 来看,问题似乎是在尝试解压缩文件时发生的,因此适当的例外是 BadZipFile
。在 except
语句中使用它。您可能希望限制对特定异常的处理,因为结果是删除有问题的文件。
场景:我在名为 xlsx_folder
的目录中创建了 三个 个相同的 excel 个文件,并希望将所有文件合并到一个 data frame
。为此,我建议使用 glob
,而不是使用 os
模块。
import os # for deleting corrupted file
import glob # to list out a specific file type
import pandas as pd
# here is a list of all the file in the directory
print(glob.glob("xlsx_folder/*.xlsx"))
输出:
['xlsx_folder\file1 - Copy (2).xlsx',
'xlsx_folder\file1 - Copy.xlsx',
'xlsx_folder\file1.xlsx',
'xlsx_folder\~$file1.xlsx']
注意: 在 windows 中,当 excel 文件打开时 - 它会创建一个带有 ~$
符号的临时文件,这是一个临时文件文件(在这种情况下,我将其视为损坏的文件)。
现在,您可以读取目录中的所有文件,并制作一个单独的数据框,如下所示:
overall_df = []
for f in glob.glob("xlsx_folder/*.xlsx"):
try:
overall_df.append(pd.read_excel(f)) # if there is an encoding error, handle it here
except Exception as err:
print(f"Unable to Read: {f}.\n{err}") # use format if not familiar with f-strings
# delete the file with os.remove
# os.remove(f)
overall_df = pd.concat(overall_df, ignore_index = True)
这将打印一条警告语句,例如:
Unable to Read: xlsx_folder\~$file1.xlsx.
[Errno 13] Permission denied: 'xlsx_folder\~$file1.xlsx'
如果您仍然遇到 BadZipFile
未定义的问题,那么:
由于异常 class BadZipFile
在模块 zipfile
中,您只需要一个 import
语句,例如:
from zipfile import BadZipFile
然后您应该能够处理异常。