CSV & Pandas:未命名列和 multi-index
CSV & Pandas: Unnamed columns and multi-index
我有一组数据:
,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8
我想要实现的期望输出是:
我知道我可以读取 CSV 并删除任何 NaN 行:
df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)
我的两个主要问题是:
- 如何对未命名的列进行分组,使它们只是国家“英格兰”和“法国”
- 如何设置索引,使 3 家商店都属于相关国家/地区?
我相信我可以对标题使用分层索引,但我遇到的所有示例都使用漂亮、干净的数据框,这与我的 CSV 不同。如果有人能指出正确的方向,我将不胜感激,因为我对 pandas.
还很陌生
谢谢。
您必须自己设置(多)索引和 headers:
df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)
# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
df.iloc[1].dropna().unique(),
df.iloc[2].dropna().unique()])
df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)
输出:
England ... France
Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
F P M D F P M D F P M ... P M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]
你可以试试这个:
from io import StringIO
import pandas as pd
import numpy as np
test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")
df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns
.to_frame()
.apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
.ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
print(df)
输出:
0 England ... France
1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
2 F P M D F P M D F P ... M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]
我有一组数据:
,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8
我想要实现的期望输出是:
我知道我可以读取 CSV 并删除任何 NaN 行:
df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)
我的两个主要问题是:
- 如何对未命名的列进行分组,使它们只是国家“英格兰”和“法国”
- 如何设置索引,使 3 家商店都属于相关国家/地区?
我相信我可以对标题使用分层索引,但我遇到的所有示例都使用漂亮、干净的数据框,这与我的 CSV 不同。如果有人能指出正确的方向,我将不胜感激,因为我对 pandas.
还很陌生谢谢。
您必须自己设置(多)索引和 headers:
df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)
# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
df.iloc[1].dropna().unique(),
df.iloc[2].dropna().unique()])
df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)
输出:
England ... France
Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
F P M D F P M D F P M ... P M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]
你可以试试这个:
from io import StringIO
import pandas as pd
import numpy as np
test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")
df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns
.to_frame()
.apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
.ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
print(df)
输出:
0 England ... France
1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
2 F P M D F P M D F P ... M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]