从大的不均匀 Pandas 数据帧中创建具有多个数据帧拆分的字典
Create a dictionary with multiple Split of data frame from big uneven Pandas data frame
我有一个杂乱的大 CSV 文件,其中包含很多 Nan 值,我使用 pd.read_csv(file, names = range(int))
读取了数据帧。我想将此数据拆分为多个数据框并使用给定的数据键存储在字典中。我准备了一个简单的例子来解释我的问题。
原始数据示例:我的数据看起来与给定的相似,但列数和行数更多。
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=([1,2,3,4]))
df.loc[0,:] = ['Home -AA',np.nan,np.nan,np.nan]
df.loc[1,:] = ['place/time','value1','value2','value3']
df.loc[2,:] = ['Home time1',1, 2, 3]
df.loc[3,:] = ['Home time2',4, 5, 6]
df.loc[4,:] = ['Home time3',7, 8, 9]
df.loc[5,:] = ['sum',11,np.nan , np.nan]
df.loc[6,:] = ['agg',12,np.nan , np.nan]
df.loc[7,:] = ['max',6,np.nan , np.nan]
df.loc[8,:] = ['min',8,np.nan , np.nan]
df.loc[9,:] = ['med',1,np.nan , np.nan]
df.loc[10,:] = ['Home -BB',np.nan,np.nan,np.nan]
df.loc[11,:] = ['place/time','value1','value2','value3']
df.loc[12,:] = ['Home time1',11, 12, 13]
df.loc[13,:] = ['Home time2',14, 15, 16]
df.loc[14,:] = ['Home time3',17, 18, 19]
df.loc[15,:] = ['sum',101,np.nan , np.nan]
df.loc[16,:] = ['agg',122,np.nan , np.nan]
df.loc[17,:] = ['max',62,np.nan , np.nan]
df.loc[18,:] = ['min',83,np.nan , np.nan]
df.loc[19,:] = ['med',12,np.nan , np.nan]
df.loc[20,:] = ['Home -CC',np.nan,np.nan,np.nan]
df.loc[21,:] = ['place/time','value1','value2','value3']
df.loc[22,:] = ['Home -DD',np.nan,np.nan,np.nan]
df.loc[23,:] = ['place/time','value1','value2','value3']
df.loc[24,:] = ['Home -EE',np.nan,np.nan,np.nan]
df.loc[25,:] = ['place/time','value1','value2','value3']
df.loc[26,:] = ['Home -FF',np.nan,np.nan,np.nan]
df.loc[27,:] = ['place/time','value1','value2','value3']
df.loc[28,:] = ['Home time1',211, 212, 213]
df.loc[29,:] = ['Home time1',212, 213, 214]
df.loc[30,:] = ['sum',115,np.nan , np.nan]
df.loc[31,:] = ['agg',124,np.nan , np.nan]
df.loc[32,:] = ['max',65,np.nan , np.nan]
df.loc[33,:] = ['min',85,np.nan , np.nan]
df.loc[34,:] = ['med',16,np.nan , np.nan]
想要的结果:我想将这个数据框转换为多个数据框,定义房屋钥匙并存储在字典 dict1 中。 (结果示例)
df1 = pd.DataFrame(columns=([1,2,3,4]))
df1.loc[1,:] = ['place/time','value1','value2','value3']
df1.loc[2,:] = ['Home time1',1, 2, 3]
df1.loc[3,:] = ['Home time2',4, 5, 6]
df1.loc[4,:] = ['Home time3',7, 8, 9]
df2 = pd.DataFrame(columns=([1,2,3,4]))
df2.loc[11,:] = ['place/time','value1','value2','value3']
df2.loc[12,:] = ['Home time1',11, 12, 13]
df2.loc[13,:] = ['Home time2',14, 15, 16]
df2.loc[14,:] = ['Home time3',17, 18, 19]
df3 = pd.DataFrame(columns=([1,2,3,4]))
df3.loc[21,:] = ['place/time','value1','value2','value3']
df4 = pd.DataFrame(columns=([1,2,3,4]))
df4.loc[23,:] = ['place/time','value1','value2','value3']
df5 = pd.DataFrame(columns=([1,2,3,4]))
df5.loc[25,:] = ['place/time','value1','value2','value3']
df6 = pd.DataFrame(columns=([1,2,3,4]))
df6.loc[27,:] = ['place/time','value1','value2','value3']
df6.loc[28,:] = ['Home time1',211, 212, 213]
df6.loc[29,:] = ['Home time1',212, 213, 214]
dict1 = {'House -AA':df1, 'House -BB': df2,'House -CC': df3 , 'House -DD':df4, 'House -EE':df5, 'House -FF':df6}
使用 for 循环准备了代码,但我无法以正确的方式拆分所有数据帧。如果我不打破循环,那么我将收到一个错误(列表索引超出范围)。你能帮我得到与我上面解释的类似的结果吗?
准备好的代码思路:
namesplit = lambda x: x.split('-')[0]
postion = 'Home '
rawname = []
for i in df[1]:
x = namesplit(i)
if postion == x:
rawname.append(i)
test = {}
for i in range(len(rawname)):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
if y == len(df) -9:
break
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
您可以通过遍历整个数据帧并在分隔符行上发出较小的数据帧来完成此操作。这是蛮力,但有效。
results = {}
for i, row in df.iterrows():
if "Home -" in row[1]:
accumulator = pd.DataFrame(columns=[1, 2, 3, 4])
key = row[1]
results[key] = accumulator
else:
results[key] = results[key].append(row)
输出:
In [9]: results
Out[9]:
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9
5 sum 11 NaN NaN
6 agg 12 NaN NaN
7 max 6 NaN NaN
8 min 8 NaN NaN
9 med 1 NaN NaN,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19
15 sum 101 NaN NaN
16 agg 122 NaN NaN
17 max 62 NaN NaN
18 min 83 NaN NaN
19 med 12 NaN NaN,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214
30 sum 115 NaN NaN
31 agg 124 NaN NaN
32 max 65 NaN NaN
33 min 85 NaN NaN
34 med 16 NaN NaN}
出现列表索引超出范围错误的原因是循环中的 y
使用了列表 rawname
的第 i+1
个值。所以你只想循环到 len(rawname)-1
如下:
test = {}
for i in range(len(rawname)-1):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
您可以简单地使用 groupby
和 cumsum
:
result = {}
for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
name, d = i[1].iat[0], i.iloc[1:]
result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]
print (result)
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214}
我有一个杂乱的大 CSV 文件,其中包含很多 Nan 值,我使用 pd.read_csv(file, names = range(int))
读取了数据帧。我想将此数据拆分为多个数据框并使用给定的数据键存储在字典中。我准备了一个简单的例子来解释我的问题。
原始数据示例:我的数据看起来与给定的相似,但列数和行数更多。
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=([1,2,3,4]))
df.loc[0,:] = ['Home -AA',np.nan,np.nan,np.nan]
df.loc[1,:] = ['place/time','value1','value2','value3']
df.loc[2,:] = ['Home time1',1, 2, 3]
df.loc[3,:] = ['Home time2',4, 5, 6]
df.loc[4,:] = ['Home time3',7, 8, 9]
df.loc[5,:] = ['sum',11,np.nan , np.nan]
df.loc[6,:] = ['agg',12,np.nan , np.nan]
df.loc[7,:] = ['max',6,np.nan , np.nan]
df.loc[8,:] = ['min',8,np.nan , np.nan]
df.loc[9,:] = ['med',1,np.nan , np.nan]
df.loc[10,:] = ['Home -BB',np.nan,np.nan,np.nan]
df.loc[11,:] = ['place/time','value1','value2','value3']
df.loc[12,:] = ['Home time1',11, 12, 13]
df.loc[13,:] = ['Home time2',14, 15, 16]
df.loc[14,:] = ['Home time3',17, 18, 19]
df.loc[15,:] = ['sum',101,np.nan , np.nan]
df.loc[16,:] = ['agg',122,np.nan , np.nan]
df.loc[17,:] = ['max',62,np.nan , np.nan]
df.loc[18,:] = ['min',83,np.nan , np.nan]
df.loc[19,:] = ['med',12,np.nan , np.nan]
df.loc[20,:] = ['Home -CC',np.nan,np.nan,np.nan]
df.loc[21,:] = ['place/time','value1','value2','value3']
df.loc[22,:] = ['Home -DD',np.nan,np.nan,np.nan]
df.loc[23,:] = ['place/time','value1','value2','value3']
df.loc[24,:] = ['Home -EE',np.nan,np.nan,np.nan]
df.loc[25,:] = ['place/time','value1','value2','value3']
df.loc[26,:] = ['Home -FF',np.nan,np.nan,np.nan]
df.loc[27,:] = ['place/time','value1','value2','value3']
df.loc[28,:] = ['Home time1',211, 212, 213]
df.loc[29,:] = ['Home time1',212, 213, 214]
df.loc[30,:] = ['sum',115,np.nan , np.nan]
df.loc[31,:] = ['agg',124,np.nan , np.nan]
df.loc[32,:] = ['max',65,np.nan , np.nan]
df.loc[33,:] = ['min',85,np.nan , np.nan]
df.loc[34,:] = ['med',16,np.nan , np.nan]
想要的结果:我想将这个数据框转换为多个数据框,定义房屋钥匙并存储在字典 dict1 中。 (结果示例)
df1 = pd.DataFrame(columns=([1,2,3,4]))
df1.loc[1,:] = ['place/time','value1','value2','value3']
df1.loc[2,:] = ['Home time1',1, 2, 3]
df1.loc[3,:] = ['Home time2',4, 5, 6]
df1.loc[4,:] = ['Home time3',7, 8, 9]
df2 = pd.DataFrame(columns=([1,2,3,4]))
df2.loc[11,:] = ['place/time','value1','value2','value3']
df2.loc[12,:] = ['Home time1',11, 12, 13]
df2.loc[13,:] = ['Home time2',14, 15, 16]
df2.loc[14,:] = ['Home time3',17, 18, 19]
df3 = pd.DataFrame(columns=([1,2,3,4]))
df3.loc[21,:] = ['place/time','value1','value2','value3']
df4 = pd.DataFrame(columns=([1,2,3,4]))
df4.loc[23,:] = ['place/time','value1','value2','value3']
df5 = pd.DataFrame(columns=([1,2,3,4]))
df5.loc[25,:] = ['place/time','value1','value2','value3']
df6 = pd.DataFrame(columns=([1,2,3,4]))
df6.loc[27,:] = ['place/time','value1','value2','value3']
df6.loc[28,:] = ['Home time1',211, 212, 213]
df6.loc[29,:] = ['Home time1',212, 213, 214]
dict1 = {'House -AA':df1, 'House -BB': df2,'House -CC': df3 , 'House -DD':df4, 'House -EE':df5, 'House -FF':df6}
使用 for 循环准备了代码,但我无法以正确的方式拆分所有数据帧。如果我不打破循环,那么我将收到一个错误(列表索引超出范围)。你能帮我得到与我上面解释的类似的结果吗?
准备好的代码思路:
namesplit = lambda x: x.split('-')[0]
postion = 'Home '
rawname = []
for i in df[1]:
x = namesplit(i)
if postion == x:
rawname.append(i)
test = {}
for i in range(len(rawname)):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
if y == len(df) -9:
break
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
您可以通过遍历整个数据帧并在分隔符行上发出较小的数据帧来完成此操作。这是蛮力,但有效。
results = {}
for i, row in df.iterrows():
if "Home -" in row[1]:
accumulator = pd.DataFrame(columns=[1, 2, 3, 4])
key = row[1]
results[key] = accumulator
else:
results[key] = results[key].append(row)
输出:
In [9]: results
Out[9]:
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9
5 sum 11 NaN NaN
6 agg 12 NaN NaN
7 max 6 NaN NaN
8 min 8 NaN NaN
9 med 1 NaN NaN,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19
15 sum 101 NaN NaN
16 agg 122 NaN NaN
17 max 62 NaN NaN
18 min 83 NaN NaN
19 med 12 NaN NaN,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214
30 sum 115 NaN NaN
31 agg 124 NaN NaN
32 max 65 NaN NaN
33 min 85 NaN NaN
34 med 16 NaN NaN}
出现列表索引超出范围错误的原因是循环中的 y
使用了列表 rawname
的第 i+1
个值。所以你只想循环到 len(rawname)-1
如下:
test = {}
for i in range(len(rawname)-1):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
您可以简单地使用 groupby
和 cumsum
:
result = {}
for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
name, d = i[1].iat[0], i.iloc[1:]
result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]
print (result)
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214}