丢弃方差为零的组

Question

假设下一个df:

d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'], 
   'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'], 
   'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
   'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
   'number':[0, 0.001, 0, 0, 0, np.nan],
   'age':[24, 22, 45, np.nan, 60, 32]}


df=pd.DataFrame(d)

想法是按组获取特定列的方差（在本例中为：country、level 和 job title），然后 select方差低于特定阈值的段并将其从原始 df 中删除。

但是当应用时：

# define variance threshold   
threshold = 0.0000000001 

# get the variance by group for specific column 
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']

# select the rows to drop 
rows_to_drop = df[group_vars<threshold].index

# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)

出现下一个错误：

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

预期数据帧将下降：Poland A00 Sales Director 0.000000e+00 所有月份，因为它是零方差的段。

是否可以重新索引 group_vars 以便从原始 df 中删除它？

我错过了什么？

Answer 1

您可以通过 transform

实现

# define variance threshold   
threshold = 0.0000000001 

# get the variance by group for specific column 
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')

# select the rows to drop 
rows_to_drop = df[group_vars<threshold].index

# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)

给出：

        month country level         job title  number   age
0  01/01/2020   Japan   A01  Insights Manager   0.000  24.0
1  01/02/2020   Japan   A01  Insights Manager   0.001  22.0
2  01/03/2020   Japan   A01  Insights Manager   0.000  45.0

丢弃方差为零的组

Drop groups whose variance is zero

python

statistics

dataframe

pandas