在 pandas 数据框中为每个组插入缺失的类别
insert missing category for each group in pandas dataframe
我需要为每个组插入缺失的类别,这是一个示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
# test dataframe
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
说我有一些 categories = ['a', 'b', 'c', 'd']
。如果 cat
列
不包含列表中的类别,我想插入
一行,每个组的值为 0
。
如果类别,如何为每个组插入一行,以便获得每个组的所有类别
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d 1 0 0
a 2 3 6
c 2 4 8
b 2 0 0
d 2 0 0
有点复杂,不过可以用groupby
+ reindex
:
categories = ['a', 'b', 'c', 'd']
def f(x):
return x.reindex(categories, fill_value=0)\
.assign(group=x['group'][0].item())
df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()
cat group value value2
0 a 1 0 0
1 b 1 1 2
2 c 1 2 4
3 d 1 0 0
4 a 2 3 6
5 b 2 0 0
6 c 2 4 8
7 d 2 0 0
这是单行解决方案...
df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').\
reindex(categories)).fillna(0).reset_index().drop('level_0',1)
Out[601]:
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 0.0 0.0 0.0
4 a 2.0 3.0 6.0
5 b 0.0 0.0 0.0
6 c 2.0 4.0 8.0
7 d 0.0 0.0 0.0
groupby
这里不是必须的,只需要reindex
by MultiIndex
:
categories = ['a', 'b', 'c', 'd']
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
print (df)
cat group value value2
0 a 1 0 0
1 b 1 1 2
2 c 1 2 4
3 d 1 0 0
4 a 2 3 6
5 b 2 0 0
6 c 2 4 8
7 d 2 0 0
解法很多,我加timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
return df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
def f(x):
return x.reindex(categories, fill_value=0).assign(group=x['group'][0].item())
def coldspeed(df):
return df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()
def zero(df):
from itertools import product
dfo = pd.DataFrame(list(product(df['group'].unique(), categories)),
columns=['group', 'cat'])
return dfo.merge(df, how='left').fillna(0)
def wen(df):
return df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').reindex(categories)).fillna(0).reset_index().drop('level_0',1)
def bharath(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
return mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
print (jez(df))
print (coldspeed(df))
print (zero(df))
print (wen(df))
print (bharath(df))
print (akilat90(df))
In [262]: %timeit (jez(df))
100 loops, best of 3: 11.5 ms per loop
In [263]: %timeit (bharath(df))
100 loops, best of 3: 16 ms per loop
In [264]: %timeit (zero(df))
10 loops, best of 3: 28.3 ms per loop
In [265]: %timeit (wen(df))
1 loop, best of 3: 8.74 s per loop
In [266]: %timeit (coldspeed(df))
1 loop, best of 3: 8.2 s per loop
In [297]: %timeit (akilat90(df))
1 loop, best of 3: 23.6 s per loop
这不是一种优雅的方式;我希望我知道一种在组级别合并的方法,以便可以消除 for 循环。
解决方案
将 categories
列表视为数据框,并在分组依据之后在组级别进行合并。
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
grouped = df.groupby('group')
这是丑陋的部分。我想知道是否有 pandas 方法来消除这个 for 循环:
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
print(merged)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 1.0 0.0 0.0
cat group value value2
0 a 2.0 3.0 6.0
1 b 2.0 0.0 0.0
2 c 2.0 4.0 8.0
3 d 2.0 0.0 0.0
然后我们可以直接连接 merged_list
out = pd.concat(merged_list)
print(out)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 1.0 0.0 0.0
0 a 2.0 3.0 6.0
1 b 2.0 0.0 0.0
2 c 2.0 4.0 8.0
3 d 2.0 0.0 0.0
我们还可以按照@jezreal 的建议使用多索引,然后合并数据,这是一个非常快的解决方案,即
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
ndf = mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)
输出:
cat group value value2
0 a 1 0.0 0.0
1 b 1 1.0 2.0
2 c 1 2.0 4.0
3 d 1 0.0 0.0
4 a 2 3.0 6.0
5 b 2 0.0 0.0
6 c 2 4.0 8.0
7 d 2 0.0 0.0
对 cat, group
的预计算组合使用 merge
In [35]: from itertools import product
In [36]: cats = ['a', 'b', 'c', 'd']
In [37]: dfo = pd.DataFrame(list(product(df['group'].unique(), cats)),
columns=['group', 'cat'])
In [38]: dfo.merge(df, how='left').fillna(0)
Out[38]:
group cat value value2
0 1 a 0.0 0.0
1 1 b 1.0 2.0
2 1 c 2.0 4.0
3 1 d 0.0 0.0
4 2 a 3.0 6.0
5 2 b 0.0 0.0
6 2 c 4.0 8.0
7 2 d 0.0 0.0
我需要为每个组插入缺失的类别,这是一个示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
# test dataframe
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
说我有一些 categories = ['a', 'b', 'c', 'd']
。如果 cat
列
不包含列表中的类别,我想插入
一行,每个组的值为 0
。
如果类别,如何为每个组插入一行,以便获得每个组的所有类别
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d 1 0 0
a 2 3 6
c 2 4 8
b 2 0 0
d 2 0 0
有点复杂,不过可以用groupby
+ reindex
:
categories = ['a', 'b', 'c', 'd']
def f(x):
return x.reindex(categories, fill_value=0)\
.assign(group=x['group'][0].item())
df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()
cat group value value2
0 a 1 0 0
1 b 1 1 2
2 c 1 2 4
3 d 1 0 0
4 a 2 3 6
5 b 2 0 0
6 c 2 4 8
7 d 2 0 0
这是单行解决方案...
df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').\
reindex(categories)).fillna(0).reset_index().drop('level_0',1)
Out[601]:
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 0.0 0.0 0.0
4 a 2.0 3.0 6.0
5 b 0.0 0.0 0.0
6 c 2.0 4.0 8.0
7 d 0.0 0.0 0.0
groupby
这里不是必须的,只需要reindex
by MultiIndex
:
categories = ['a', 'b', 'c', 'd']
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
print (df)
cat group value value2
0 a 1 0 0
1 b 1 1 2
2 c 1 2 4
3 d 1 0 0
4 a 2 3 6
5 b 2 0 0
6 c 2 4 8
7 d 2 0 0
解法很多,我加timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
return df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
def f(x):
return x.reindex(categories, fill_value=0).assign(group=x['group'][0].item())
def coldspeed(df):
return df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()
def zero(df):
from itertools import product
dfo = pd.DataFrame(list(product(df['group'].unique(), categories)),
columns=['group', 'cat'])
return dfo.merge(df, how='left').fillna(0)
def wen(df):
return df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').reindex(categories)).fillna(0).reset_index().drop('level_0',1)
def bharath(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
return mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
print (jez(df))
print (coldspeed(df))
print (zero(df))
print (wen(df))
print (bharath(df))
print (akilat90(df))
In [262]: %timeit (jez(df))
100 loops, best of 3: 11.5 ms per loop
In [263]: %timeit (bharath(df))
100 loops, best of 3: 16 ms per loop
In [264]: %timeit (zero(df))
10 loops, best of 3: 28.3 ms per loop
In [265]: %timeit (wen(df))
1 loop, best of 3: 8.74 s per loop
In [266]: %timeit (coldspeed(df))
1 loop, best of 3: 8.2 s per loop
In [297]: %timeit (akilat90(df))
1 loop, best of 3: 23.6 s per loop
这不是一种优雅的方式;我希望我知道一种在组级别合并的方法,以便可以消除 for 循环。
解决方案
将 categories
列表视为数据框,并在分组依据之后在组级别进行合并。
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
grouped = df.groupby('group')
这是丑陋的部分。我想知道是否有 pandas 方法来消除这个 for 循环:
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
print(merged)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 1.0 0.0 0.0
cat group value value2
0 a 2.0 3.0 6.0
1 b 2.0 0.0 0.0
2 c 2.0 4.0 8.0
3 d 2.0 0.0 0.0
然后我们可以直接连接 merged_list
out = pd.concat(merged_list)
print(out)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d 1.0 0.0 0.0
0 a 2.0 3.0 6.0
1 b 2.0 0.0 0.0
2 c 2.0 4.0 8.0
3 d 2.0 0.0 0.0
我们还可以按照@jezreal 的建议使用多索引,然后合并数据,这是一个非常快的解决方案,即
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
ndf = mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)
输出:
cat group value value2 0 a 1 0.0 0.0 1 b 1 1.0 2.0 2 c 1 2.0 4.0 3 d 1 0.0 0.0 4 a 2 3.0 6.0 5 b 2 0.0 0.0 6 c 2 4.0 8.0 7 d 2 0.0 0.0
对 cat, group
merge
In [35]: from itertools import product
In [36]: cats = ['a', 'b', 'c', 'd']
In [37]: dfo = pd.DataFrame(list(product(df['group'].unique(), cats)),
columns=['group', 'cat'])
In [38]: dfo.merge(df, how='left').fillna(0)
Out[38]:
group cat value value2
0 1 a 0.0 0.0
1 1 b 1.0 2.0
2 1 c 2.0 4.0
3 1 d 0.0 0.0
4 2 a 3.0 6.0
5 2 b 0.0 0.0
6 2 c 4.0 8.0
7 2 d 0.0 0.0