使用 Python Pandas 搜索最大值和百分比最大值时的优化

Question

我有如下df

目标输出

我尝试了下面的代码，但它会得到一列的输出，我必须添加 for 循环才能得到整个结果

那么我有大数据，有什么快速解决方案吗

data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],
        'X1':  [1,1,1,1,1,7,7,7,5,4],
        'X2':  [8,9,10,10,10,8,8,10,8,9],
        'X3':  [11,12,13,11,11,11,11,11,1,2],
        }
df = pd.DataFrame(data, columns = ['item', 'X1','X2','X3'])
# get count of unique values 
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value 
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()
# Get Second Value of % 
df['X1'][df['X1']==(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size

Answer 1

您可以为每个测试列以及最大和第二最大使用索引创建字典，因为 Series.value_counts 默认排序：

L = []
cols = ['X1','X2','X3'] 

for c in cols:
    u = df[c].nunique()
    a = df[c].value_counts()
    d = {'No of unique': u, 
         'Highest rep': a.index[0],
         '% of Highest rep': a.iat[0] / len(df),
         'Second Highest rep': a.index[1],
         'Second % of Highest rep': a.iat[1] / len(df)}
    L.append(d)


df = pd.DataFrame(L, index=cols)    
print (df)
    No of unique  Highest rep  % of Highest rep  Second Highest rep  \
X1             4            1               0.5                   7   
X2             3           10               0.4                   8   
X3             5           11               0.6                  13   

    Second % of Highest rep  
X1                      0.3  
X2                      0.4  
X3                      0.1

更一般的解决方案测试是否存在最大值：

L = []
cols = ['X1','X2','X3'] 

for c in cols:
    u = df[c].nunique()
    a = df[c].value_counts()
    
    if len(a) > 1:
        secondmax = a.index[1]
        secondperc = a.iat[1] / len(df)
    else:
        secondmax = np.nan
        secondsecondperc = np.nan
        
    d = {'No of unique': u, 
         'Highest rep': a.index[0],
         '% of Highest rep': a.iat[0] / len(df),
         'Second Highest rep': secondmax,
         'Second % of Highest rep': secondperc}

         
    L.append(d)

df = pd.DataFrame(L, index=cols)

使用 Python Pandas 搜索最大值和百分比最大值时的优化

Optimization when searching for max value and percentage max using Python Pandas

python

bigdata

dataframe

pandas