在 pandas 数据框中插入缺失的类别和日期
inserting missing categories and dates in pandas dataframe
我有以下数据框。我想为每个组(a、b、c、d)和所有日期(有两个日期——2020-06-01 和 2020-06-02)添加所有分数级别(高、中、低)
x = pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
'group' : ['a','a','a','b','b','c','c','c','d'],
'score' : ['high','low','mid','low','high','high','high','mid','high'],
'count' : [12,13,2,19,22,3,4,49,12]})
我可以添加以下所有科目的分数类别,但我也无法添加日期
cats = ['high', 'mid','low']
x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score'])
x_re.merge(x, how='left').fillna(0)
预期的输出是这样的:所以每个主题有 6 行,每个日期有 3 行,每个分数类别有一行。然后在缺少数据点的地方用 np.nan(或零也可以)填充计数
pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'],
'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'],
'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})
任何建议都很好,谢谢
您的解决方案可以通过按唯一值添加 date
列进行修改,如果输入数据中的 date, group, score
不是唯一三元组,则此解决方案有效:
cats = ['high', 'mid','low']
x_re = pd.DataFrame(list(product(x['date'].unique(),
x['group'].unique(),
cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)
3 level MultiIndex
reindex
的解决方案类似:
cats = ['high', 'mid','low']
x_re = pd.MultiIndex.from_product([x['date'].unique(),
x['group'].unique(),
cats],names=['date','group', 'score'])
x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a mid NaN
2 2020-06-01 a low 13.0
3 2020-06-01 b high NaN
4 2020-06-01 b mid NaN
5 2020-06-01 b low 19.0
6 2020-06-01 c high 3.0
7 2020-06-01 c mid NaN
8 2020-06-01 c low NaN
9 2020-06-01 d high NaN
10 2020-06-01 d mid NaN
11 2020-06-01 d low NaN
12 2020-06-02 a high NaN
13 2020-06-02 a mid 2.0
14 2020-06-02 a low NaN
15 2020-06-02 b high 22.0
16 2020-06-02 b mid NaN
17 2020-06-02 b low NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c mid 49.0
20 2020-06-02 c low NaN
21 2020-06-02 d high 12.0
22 2020-06-02 d mid NaN
23 2020-06-02 d low NaN
可以使用一次调用 unstack
和一次调用 stack
,但输入数据中必须存在所有唯一值 cats
是必要的:
x = (x.set_index(['date', 'group', 'score'])
.unstack(['group','score'])
.stack([1, 2], dropna=False)
.reset_index())
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
当你没有太多级别时,一个简单的方法是 unstack
/stack
:
(x.set_index(['date', 'group', 'score'])
.unstack('group').stack(dropna=False)
.unstack('score').stack(dropna=False)
.reset_index()
)
输出:
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
如果我理解正确的话,你可以使用 pyjanitor
:
中的 complete
函数抽象出来
# pip install pyjanitor
import pandas as pd
import janitor
x.complete(['date', 'group', 'score'])
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
我有以下数据框。我想为每个组(a、b、c、d)和所有日期(有两个日期——2020-06-01 和 2020-06-02)添加所有分数级别(高、中、低)
x = pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
'group' : ['a','a','a','b','b','c','c','c','d'],
'score' : ['high','low','mid','low','high','high','high','mid','high'],
'count' : [12,13,2,19,22,3,4,49,12]})
我可以添加以下所有科目的分数类别,但我也无法添加日期
cats = ['high', 'mid','low']
x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score'])
x_re.merge(x, how='left').fillna(0)
预期的输出是这样的:所以每个主题有 6 行,每个日期有 3 行,每个分数类别有一行。然后在缺少数据点的地方用 np.nan(或零也可以)填充计数
pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'],
'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'],
'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})
任何建议都很好,谢谢
您的解决方案可以通过按唯一值添加 date
列进行修改,如果输入数据中的 date, group, score
不是唯一三元组,则此解决方案有效:
cats = ['high', 'mid','low']
x_re = pd.DataFrame(list(product(x['date'].unique(),
x['group'].unique(),
cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)
3 level MultiIndex
reindex
的解决方案类似:
cats = ['high', 'mid','low']
x_re = pd.MultiIndex.from_product([x['date'].unique(),
x['group'].unique(),
cats],names=['date','group', 'score'])
x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a mid NaN
2 2020-06-01 a low 13.0
3 2020-06-01 b high NaN
4 2020-06-01 b mid NaN
5 2020-06-01 b low 19.0
6 2020-06-01 c high 3.0
7 2020-06-01 c mid NaN
8 2020-06-01 c low NaN
9 2020-06-01 d high NaN
10 2020-06-01 d mid NaN
11 2020-06-01 d low NaN
12 2020-06-02 a high NaN
13 2020-06-02 a mid 2.0
14 2020-06-02 a low NaN
15 2020-06-02 b high 22.0
16 2020-06-02 b mid NaN
17 2020-06-02 b low NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c mid 49.0
20 2020-06-02 c low NaN
21 2020-06-02 d high 12.0
22 2020-06-02 d mid NaN
23 2020-06-02 d low NaN
可以使用一次调用 unstack
和一次调用 stack
,但输入数据中必须存在所有唯一值 cats
是必要的:
x = (x.set_index(['date', 'group', 'score'])
.unstack(['group','score'])
.stack([1, 2], dropna=False)
.reset_index())
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
当你没有太多级别时,一个简单的方法是 unstack
/stack
:
(x.set_index(['date', 'group', 'score'])
.unstack('group').stack(dropna=False)
.unstack('score').stack(dropna=False)
.reset_index()
)
输出:
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
如果我理解正确的话,你可以使用 pyjanitor
:
complete
函数抽象出来
# pip install pyjanitor
import pandas as pd
import janitor
x.complete(['date', 'group', 'score'])
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN