丢弃方差为零的组
Drop groups whose variance is zero
假设下一个df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, np.nan],
'age':[24, 22, 45, np.nan, 60, 32]}
df=pd.DataFrame(d)
想法是按组获取特定列的方差(在本例中为:country
、level
和 job title
),然后 select方差低于特定阈值的段并将其从原始 df 中删除。
但是当应用时:
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)
出现下一个错误:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
预期数据帧将下降:Poland A00 Sales Director 0.000000e+00
所有月份,因为它是零方差的段。
是否可以重新索引 group_vars
以便从原始 df 中删除它?
我错过了什么?
您可以通过 transform
实现
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)
给出:
month country level job title number age
0 01/01/2020 Japan A01 Insights Manager 0.000 24.0
1 01/02/2020 Japan A01 Insights Manager 0.001 22.0
2 01/03/2020 Japan A01 Insights Manager 0.000 45.0
假设下一个df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, np.nan],
'age':[24, 22, 45, np.nan, 60, 32]}
df=pd.DataFrame(d)
想法是按组获取特定列的方差(在本例中为:country
、level
和 job title
),然后 select方差低于特定阈值的段并将其从原始 df 中删除。
但是当应用时:
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)
出现下一个错误:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
预期数据帧将下降:Poland A00 Sales Director 0.000000e+00
所有月份,因为它是零方差的段。
是否可以重新索引 group_vars
以便从原始 df 中删除它?
我错过了什么?
您可以通过 transform
实现# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)
给出:
month country level job title number age
0 01/01/2020 Japan A01 Insights Manager 0.000 24.0
1 01/02/2020 Japan A01 Insights Manager 0.001 22.0
2 01/03/2020 Japan A01 Insights Manager 0.000 45.0