Pandas 阅读 excel 并跳过带删除线的单元格

Question

我必须处理从外部来源收到的一些 xlsx。有没有更直接的方法来加载 pandas 中的 xlsx 同时还跳过带有删除线的行？

目前我必须做这样的事情：

import pandas as pd, openpyxl

working_file = r"something.xlsx"

working_wb = openpyxl.load_workbook(working_file, data_only=True)

working_sheet = working_wb.active

empty = []

for row in working_sheet.iter_rows("B", row_offset=3):
    for cell in row:
        if cell.font.strike is True:
            p_id = working_sheet.cell(row=cell.row, column=37).value
            empty.append(p_id)

df = pd.read_excel(working_file, skiprows=3)
df = df[~df["ID"].isin(empty)]
...

这有效，但只能通过每个 excel sheet 两次。

Answer 1

在这种情况下，我不会使用 Pandas。只需使用 openpyxl，从工作表的 end 开始工作并相应地删除行。从工作表末尾开始倒退意味着您在删除行时不会遇到副作用。

Answer 2

结束了 pd.ExcelFile 和 _OpenpyxlReader 的子类化。这比我想象的要容易:)

import pandas as pd
from pandas.io.excel._openpyxl import _OpenpyxlReader
from pandas._typing import Scalar
from typing import List
from pandas.io.excel._odfreader import _ODFReader
from pandas.io.excel._xlrd import _XlrdReader

class CustomReader(_OpenpyxlReader):
    def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
        data = []
        for row in sheet.rows:
            first = row[1] # I need the strikethrough check on this cell only
            if first.value is not None and first.font.strike: continue
            else:
                data.append([self._convert_cell(cell, convert_float) for cell in row])
        return data

class CustomExcelFile(pd.ExcelFile):

    _engines = {"xlrd": _XlrdReader, "openpyxl": CustomReader, "odf": _ODFReader}

设置自定义类后，现在只需像正常 ExcelFile 一样传递文件，将引擎指定为 openpyxl 瞧！带删除线单元格的行消失了。

excel = CustomExcelFile(r"excel_file_name.xlsx", engine="openpyxl")

df = excel.parse()

print (df)

Pandas 阅读 excel 并跳过带删除线的单元格

Pandas read excel and skip cells with strikethrough

python-3.x

pandas

openpyxl