CSV & Pandas:未命名列和 multi-index

CSV & Pandas: Unnamed columns and multi-index

我有一组数据:

,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8

我想要实现的期望输出是:

我知道我可以读取 CSV 并删除任何 NaN 行:

df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)

我的两个主要问题是:

  1. 如何对未命名的列进行分组,使它们只是国家“英格兰”和“法国”
  2. 如何设置索引,使 3 家商店都属于相关国家/地区?

我相信我可以对标题使用分层索引,但我遇到的所有示例都使用漂亮、干净的数据框,这与我的 CSV 不同。如果有人能指出正确的方向,我将不胜感激,因为我对 pandas.

还很陌生

谢谢。

您必须自己设置(多)索引和 headers:

df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)

# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
                                    df.iloc[1].dropna().unique(),
                                    df.iloc[2].dropna().unique()])

df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col 
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)

输出:

          England                                                  ...  France                                                
          Store 1             Store 2             Store 3          ... Store 1         Store 2             Store 3            
                F   P   M   D       F   P   M   D       F   P   M  ...       P   M   D       F   P   M   D       F   P   M   D
0       1                                                          ...                                                        
Year 1  M       0   5   7   9       2  18   5  10       4   9   6  ...      14  18  11      10  19  18  20       3  17  19  13
        F       0  13  14  11       0   6   8   6       2  12  14  ...      17  12  18       6  17  16  14       0   4   2   5
Year 2  M       5  10   6   6       1  20   5  18       4   9   6  ...      13  15  19       2  18  16  13       1  19   5  12
        F       1  11  14  15       0   9   9   2       2  12  14  ...      17  18  14       9  18  13  14       0   9   2  10
Evening M       4  10   6   5       3  13  19   5       4   9   6  ...      17  10  18       3  11  20  11       4  18  17  20
        F       4  12  12  13       0   9   3   8       2  12  14  ...      18  11  18       1  13  13  10       0   6   2   8

[6 rows x 24 columns]

你可以试试这个:

from io import StringIO
import pandas as pd
import numpy as np

test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")

df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)


df.columns = pd.MultiIndex.from_frame(df.columns
                                        .to_frame()
                                        .apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
                          .ffill())

df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())

print(df)

输出:

0         England                                              ...  France                                            
1         Store 1             Store 2             Store 3      ... Store 1     Store 2             Store 3            
2               F   P   M   D       F   P   M   D       F   P  ...       M   D       F   P   M   D       F   P   M   D
0       1                                                      ...                                                    
Year 1  M       0   5   7   9       2  18   5  10       4   9  ...      18  11      10  19  18  20       3  17  19  13
        F       0  13  14  11       0   6   8   6       2  12  ...      12  18       6  17  16  14       0   4   2   5
Year 2  M       5  10   6   6       1  20   5  18       4   9  ...      15  19       2  18  16  13       1  19   5  12
        F       1  11  14  15       0   9   9   2       2  12  ...      18  14       9  18  13  14       0   9   2  10
Evening M       4  10   6   5       3  13  19   5       4   9  ...      10  18       3  11  20  11       4  18  17  20
        F       4  12  12  13       0   9   3   8       2  12  ...      11  18       1  13  13  10       0   6   2   8

[6 rows x 24 columns]