如何检测在 pandas 上读取 excel 文件时要跳过的行数

Question

我想使用 python pandas 阅读 .xlsx。问题是在 excel 文件的开头，它有一些额外的数据，例如 table 和 table 内容的标题或描述。这引入了未命名的列，因为 pandas DataReader 将其作为列。但是 tables 内容在几行之后开始。

A                              B                     C
this is description
last updated: Mar 18th,2014
                               Table content
Country                        Year                 Product_output
Canada                         2017                 3002
Bulgaria                       2016                 2201
...

table 内容从第 4 行开始。列必须是 "Country"、"year"、"proudct_output" 而不是 "this is description"、"unnamed" , "unnamed"。对于这种特定情况，将 skiprows 参数设置为 3 解决了问题（来自 Mikhail Venkov）。但是我要处理很多excel个文件，不知道要提前跳过多少行。我认为可能有解决方案，因为每个 table 列 header 都有一个过滤器。

Answer 1

如果您知道特定文本（如国家/地区）必须位于第一列，您可以执行以下操作：

import xlrd
xl_work = xlrd.open_workbook("Classeur1.xlsx")
mySheet = xl_work.sheet_by_index(0)

nl = 0
while mySheet.cell_value(nl,0) != "Country" :
    nl += 1

line_with_headers = nl

然后使用 skiprows 和 nl 而不是 3.

Answer 2

我也在寻找与您相同的解决方案，但是，我可以让您的代码更短、更高效：

import pandas as pd


file = pd.read_excel("Classeur1.xlsx", header = 10)
file.head()

这种阅读方式可以跳过所有从 0 ==> 9 开始的行，从第 10 行开始阅读。

如何检测在 pandas 上读取 excel 文件时要跳过的行数

How to detect the number of the rows to skip in reading excel file on pandas

python

excel

xlsx

tableau-api

pandas