根据关键字导入到 Python 时排除 Excel 行

Exclude Excel rows while importing to Python based on keyword

要求:

  1. 将一个目录下的所有XLS文件合并成1个XLSXsheet
  2. 仅包含几列(由列位置确定,例如 A、F、G)
  3. 但是由于数据量太大,我需要排除几行(由几个关键字决定,例如“类别”,“所有者”从几列)

我需要关于第 3 点的帮助。

下面是当前代码。

import pandas as pd
import glob

path=r'C:\Users\user.name\Documents\TEST'
files_xls = glob.glob(path + "/*.xls")

df = pd.DataFrame()

for f in files_xls:
    data = pd.read_excel(f, usecols="A,F,G,H,I,L,M,Q")
    df = df.append(data)

df.to_excel("CombinedTest.xlsx")

错误信息:

C:\Users\user.name\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/user.name/Documents/TEST/Combine XLS.py"

Traceback (most recent call last):
  File "C:/Users/user.name/Documents/TEST/Combine XLS.py", line 14, in <module>
    df.to_excel("CombinedTest.xlsx")
  File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 2181, in to_excel
    engine=engine,
  File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\excel.py", line 719, in write
    f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols} "
ValueError: This sheet is too large! Your sheet size is: 1233080, 8 Max sheet size is: 1048576, 16384

Process finished with exit code 1

将过滤作为 post 处理步骤执行会更容易并且可能更快,因为如果您决定在阅读时进行过滤,那么您将迭代地增长数据帧,这效率不高。

因此您应该在 for 循环之后使用以下代码

drop_list = ['Category','Owner']
df=df[~df.isin(drop_list)]
df = df.dropna()
df.reset_index(drop=True,inplace=True)

df.to_excel("CombinedTest.xlsx")

希望对您有所帮助:)