Pandas 读取包含重复 header 行的 csv

Question

我有一个csv文件，里面的数据如下：

    Col1    Col2    Col3
v1  5       9       5
v2  6       10      6
    Col1    Col2    Col3
x1  2       4       6
x2  1       2       10
x3  10      2       1
    Col1    Col2    Col3
y1  9       2       7

即有 3 个不同的表，它们的 header 相同，彼此重叠。我正在尝试以 python 方式摆脱重复的 header 行并获得以下结果：

    Col1    Col2    Col3
v1  5       9       5
v2  6       10      6
x1  2       4       6
x2  1       2       10
x3  10      2       1
y1  9       2       7

我不确定如何进行。

Answer 1

您可以读取数据并删除与列相同的行：

df = pd.read_csv('file.csv')

df = df[df.ne(df.columns).any(1)]

输出：

   Col1 Col2 Col3
v1    5    9    5
v2    6   10    6
x1    2    4    6
x2    1    2   10
x3   10    2    1
y1    9    2    7

Answer 2

另一种解决方案是先检测重复的 header 行，然后在 read_csv().

中使用 skiprows=... 参数

这有读取数据两次的缺点，但优点是它允许 read_csv() 自动解析正确的数据类型，并且您以后不必使用 astype() 转换它们.

此示例使用 hard-coded 列名作为第一列，但更高级的版本可以从第一行确定 header，然后检测重复项。

# read the file once to detect the repeated header rows
header_rows = []
header_start = "Col1"
with open('file.csv') as f:
    for i, line in enumerate(f):
        if line.startswith(header_start):
            header_rows.append(i)

# the first (real) row should always be detected
assert header_rows[0] == 0

# skip all header rows except for the first one (the real one)
df = pd.read_csv('file.csv', skiprows=header_rows[1:])

输出：

   Col1 Col2 Col3
v1    5    9    5
v2    6   10    6
x1    2    4    6
x2    1    2   10
x3   10    2    1
y1    9    2    7

Pandas 读取包含重复 header 行的 csv

Pandas read csv with repeating header rows

pandas

python-3.8