从给定目录中删除损坏的 xlsx 文件

remove corrupted xlsx files from a given directory

更新

特定目录中有一些 .xlsx 文件已损坏,因为尝试打开工作簿时出现 windows 消息如下:

Excel cannot open the file 'filename.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.`

我想知道是否可以检测到这些损坏的文件并将其从目录中删除。

我的试用:

############### path settlement and file names ##########
path_reportes = os.path.join(os.getcwd(), 'Reports', 'xlsx_folder')
file_names = os.listdir(path_reportes)
overall_df = dict()

############## concatenate all reports ##################

for file_name in file_names:

    data_file_path = os.path.join(path_reportes, file_name)
    """
    try open spreadsheets, save them and store them in a dictionary key
    except when the file is corrupted, if so, remove it from the 
    folder
    """
    try:
     # Start by opening the spreadsheet and selecting the main sheet
        workbook = openpyxl.load_workbook(filename=data_file_path)
        sheet = workbook.active
    
     # Save the spreadsheet
        workbook.save(filename=data_file_path)
        df_report_dict = pd.read_excel(data_file_path, sheet_name=None, engine='openpyxl')
    
        for key in df_report_dict:
            
            df_report_dict[key]['report_name'] = file_name
            
            try:
                  overall_df[key] = overall_df[key].append(df_report_dict[key], ignore_index=True)
            except:
                  overall_df[key] = df_report_dict[key]
                
                
    # when file corrupted then remove it from the folder             
    except BadZipFile:
                   os.remove(data_file_path)
            

抛出下一个错误:

NameError: name 'BadZipFile' is not defined

是否可以检测损坏的文件? 我该如何处理它们?

当您尝试加载损坏的 Excel 文件时遇到什么异常? 运行那个实验,然后写一个try-except块来处理条件。

try:
    # load PANDAS df

except CorruptedExcelFile:
    os.remove(filename)

从您引用的 post 来看,问题似乎是在尝试解压缩文件时发生的,因此适当的例外是 BadZipFile。在 except 语句中使用它。您可能希望限制对特定异常的处理,因为结果是删除有问题的文件。

场景:我在名为 xlsx_folder 的目录中创建了 三个 个相同的 excel 个文件,并希望将所有文件合并到一个 data frame。为此,我建议使用 glob,而不是使用 os 模块。

import os   # for deleting corrupted file
import glob # to list out a specific file type
import pandas as pd

# here is a list of all the file in the directory
print(glob.glob("xlsx_folder/*.xlsx"))

输出:

['xlsx_folder\file1 - Copy (2).xlsx',
 'xlsx_folder\file1 - Copy.xlsx',
 'xlsx_folder\file1.xlsx',
 'xlsx_folder\~$file1.xlsx']

注意: 在 windows 中,当 excel 文件打开时 - 它会创建一个带有 ~$ 符号的临时文件,这是一个临时文件文件(在这种情况下,我将其视为损坏的文件)。

现在,您可以读取目录中的所有文件,并制作一个单独的数据框,如下所示:

overall_df = []
for f in glob.glob("xlsx_folder/*.xlsx"):
    try:
        overall_df.append(pd.read_excel(f)) # if there is an encoding error, handle it here
    except Exception as err:
        print(f"Unable to Read: {f}.\n{err}") # use format if not familiar with f-strings
        # delete the file with os.remove
        # os.remove(f)
        
overall_df = pd.concat(overall_df, ignore_index = True)

这将打印一条警告语句,例如:

Unable to Read: xlsx_folder\~$file1.xlsx.
[Errno 13] Permission denied: 'xlsx_folder\~$file1.xlsx'

如果您仍然遇到 BadZipFile 未定义的问题,那么:

由于异常 class BadZipFile 在模块 zipfile 中,您只需要一个 import 语句,例如:

from zipfile import BadZipFile

然后您应该能够处理异常。