将具有多个 header 行的数据框拆分为唯一的数据框

Question

我有一些烦人的 csv 文件，其中包含多个 header 不同长度的文件，看起来像这样：

data = {'Line': ['0', '0', 'Line', '0', '0'], 
        'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'], 
        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB',
                       'LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
df = pd.DataFrame(data)

看起来像：

   Line       Date               LibraryFile
0     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
2  Line       Date               LibraryFile
3     0  8/25/2021    PSI_210825_G2_ASD3.LIB
4     0  8/25/2021    PSI_210825_G2_ASD3.LIB

每个“header”行在 LibraryFile 列之后有不同的列名，所以我想做的是拆分每个“行”行的文件并将该行保留为新行header 及其下方的数据。我试图查看使用拆分函数但没有运气的选项。目前我正在尝试使用每个数据块唯一的 LibraryFile 列。我试过使用 pandas groupby 函数

grouped = df.groupby(df['LibraryFile'])
path_to_directory = 'filepath'
for lib in df['LibraryFile'].unique():
    temporary_df = grouped.get_group(lib)
    temporary_df.to_csv(f'filepath/temp.csv')

这为我提供了一大块数据，但我无法弄清楚从这里如何最好地保留“Line”行作为所有数据块的新 header。

我也试过 numpy:

dfs = np.split(df, np.flatnonzero(df[0] == 'Line'))
print(*dfs, sep='\n\n')

但这只是抛出一个错误。不幸的是，在很长一段时间没有使用它之后，我正在重新学习 Python，所以我确信有一个我不知道的解决方案。

Answer 1

这是一个解决方案，它将根据行中列名的每次出现拆分数据帧：

f = df.eq(df.columns)
groups = [g.reset_index(drop=True) for _, g in df[~f.iloc[:, 0]].groupby(f.cumsum()[~f.iloc[:, 0]].iloc[:, 0])]

输出：

>>> groups
[  Line       Date               LibraryFile
 0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
 1    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB,
   Line       Date             LibraryFile
 0    0  8/25/2021  PSI_210825_G2_ASD3.LIB
 1    0  8/25/2021  PSI_210825_G2_ASD3.LIB]
 
>>> groups[0]
  Line       Date               LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB

>>> groups[1]
  Line       Date             LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD3.LIB
1    0  8/25/2021  PSI_210825_G2_ASD3.LIB

Answer 2

下面我做了一个暴力破解的方法

我还使用了来自@enke 的片段来获取 row-header 索引。我最初认为所有 row-header 的值都是 Line，所以我将其注释掉并使用了来自@enke

的片段

请注意，我更改了您必须的数据以使输出拆分数据更易于查看。我将行索引 2 更改为具有 DateX 和 LibraryFileX 以查看应用的新 header 以及列中的值 Line

"""
Extract row as header and split up df into chunks
"""

import pandas as pd

# Original data
#data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
#        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Changed 'Line' Values and the row-header values to show outputs better
data = {'Line': ['0', '1', 'Line', '2', '3'], 'Date': ['8/25/2021', '8/25/2021', 'DateX', '8/25/2021', '8/25/2021'],
        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFileX','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print("INPUT")
print(df)
print("")

FIELD_TO_SEARCH = 'Line'
# VALUE_TO_MATCH_FIELD = 'Line'
# header_indices = df.index[df[FIELD_TO_SEARCH] == VALUE_TO_MATCH_FIELD].tolist()
header_indices = df.index[pd.to_numeric(df[FIELD_TO_SEARCH], errors='coerce').isna()].tolist()
# Add one row past the end so we can have a stopping point later
header_indices.append(df.shape[0] + 1)

# Preallocate output df list with the first chunk (using existing headers).
list_of_dfs = [df.iloc[0:header_indices[0]]]

if len(header_indices) > 1:
    for idx in range(len(header_indices) - 1):
        # Extract new header
        header_index = header_indices[idx]
        next_header_index = header_indices[idx + 1]
        current_header = df.iloc[[header_index]].values.flatten().tolist()
    
        # Make a df from this chunk
        current_df = df[header_index + 1:next_header_index]
        # Apply the new header
        current_df.columns = current_header
        current_df.reset_index(drop=True, inplace=True)
        list_of_dfs.append(current_df)

# Show output
print("OUTPUT")
for df_index, current_df in enumerate(list_of_dfs):
    print("DF chunk index: {}".format(df_index))
    print(current_df)

这是我运行:

的输出

INPUT
   Line       Date               LibraryFile
0     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1     1  8/25/2021  PSI_210825_G2_ASD4_F.LIB
2  Line      DateX              LibraryFileX
3     2  8/25/2021    PSI_210825_G2_ASD3.LIB
4     3  8/25/2021    PSI_210825_G2_ASD3.LIB

OUTPUT
DF chunk index: 0
  Line       Date               LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1    1  8/25/2021  PSI_210825_G2_ASD4_F.LIB
DF chunk index: 1
  Line      DateX            LibraryFileX
0    2  8/25/2021  PSI_210825_G2_ASD3.LIB
1    3  8/25/2021  PSI_210825_G2_ASD3.LIB

将具有多个 header 行的数据框拆分为唯一的数据框

Split a dataframe with multiple header rows into unique dataframes

python

split

numpy

pandas

pandas-groupby