根据关键字导入到 Python 时排除 Excel 行

Question

要求：

将一个目录下的所有XLS文件合并成1个XLSXsheet
仅包含几列（由列位置确定，例如 A、F、G）
但是由于数据量太大，我需要排除几行（由几个关键字决定，例如“类别”，“所有者”从几列）

我需要关于第 3 点的帮助。

下面是当前代码。

import pandas as pd
import glob

path=r'C:\Users\user.name\Documents\TEST'
files_xls = glob.glob(path + "/*.xls")

df = pd.DataFrame()

for f in files_xls:
    data = pd.read_excel(f, usecols="A,F,G,H,I,L,M,Q")
    df = df.append(data)

df.to_excel("CombinedTest.xlsx")

错误信息：

C:\Users\user.name\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/user.name/Documents/TEST/Combine XLS.py"

Traceback (most recent call last):
  File "C:/Users/user.name/Documents/TEST/Combine XLS.py", line 14, in <module>
    df.to_excel("CombinedTest.xlsx")
  File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 2181, in to_excel
    engine=engine,
  File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\excel.py", line 719, in write
    f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols} "
ValueError: This sheet is too large! Your sheet size is: 1233080, 8 Max sheet size is: 1048576, 16384

Process finished with exit code 1

Answer 1

将过滤作为 post 处理步骤执行会更容易并且可能更快，因为如果您决定在阅读时进行过滤，那么您将迭代地增长数据帧，这效率不高。

因此您应该在 for 循环之后使用以下代码

drop_list = ['Category','Owner']
df=df[~df.isin(drop_list)]
df = df.dropna()
df.reset_index(drop=True,inplace=True)

df.to_excel("CombinedTest.xlsx")

希望对您有所帮助:)

根据关键字导入到 Python 时排除 Excel 行

Exclude Excel rows while importing to Python based on keyword

python

excel

row

dataframe