使用 Python Pandas 搜索最大值和百分比最大值时的优化
Optimization when searching for max value and percentage max using Python Pandas
我有如下df
目标输出
我尝试了下面的代码,但它会得到一列的输出,我必须添加 for 循环才能得到整个结果
那么我有大数据,有什么快速解决方案吗
data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],
'X1': [1,1,1,1,1,7,7,7,5,4],
'X2': [8,9,10,10,10,8,8,10,8,9],
'X3': [11,12,13,11,11,11,11,11,1,2],
}
df = pd.DataFrame(data, columns = ['item', 'X1','X2','X3'])
# get count of unique values
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()
# Get Second Value of %
df['X1'][df['X1']==(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size
您可以为每个测试列以及最大和第二最大使用索引创建字典,因为 Series.value_counts
默认排序:
L = []
cols = ['X1','X2','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
d = {'No of unique': u,
'Highest rep': a.index[0],
'% of Highest rep': a.iat[0] / len(df),
'Second Highest rep': a.index[1],
'Second % of Highest rep': a.iat[1] / len(df)}
L.append(d)
df = pd.DataFrame(L, index=cols)
print (df)
No of unique Highest rep % of Highest rep Second Highest rep \
X1 4 1 0.5 7
X2 3 10 0.4 8
X3 5 11 0.6 13
Second % of Highest rep
X1 0.3
X2 0.4
X3 0.1
更一般的解决方案测试是否存在最大值:
L = []
cols = ['X1','X2','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
if len(a) > 1:
secondmax = a.index[1]
secondperc = a.iat[1] / len(df)
else:
secondmax = np.nan
secondsecondperc = np.nan
d = {'No of unique': u,
'Highest rep': a.index[0],
'% of Highest rep': a.iat[0] / len(df),
'Second Highest rep': secondmax,
'Second % of Highest rep': secondperc}
L.append(d)
df = pd.DataFrame(L, index=cols)
我有如下df
目标输出
我尝试了下面的代码,但它会得到一列的输出,我必须添加 for 循环才能得到整个结果
那么我有大数据,有什么快速解决方案吗
data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],
'X1': [1,1,1,1,1,7,7,7,5,4],
'X2': [8,9,10,10,10,8,8,10,8,9],
'X3': [11,12,13,11,11,11,11,11,1,2],
}
df = pd.DataFrame(data, columns = ['item', 'X1','X2','X3'])
# get count of unique values
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()
# Get Second Value of %
df['X1'][df['X1']==(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size
您可以为每个测试列以及最大和第二最大使用索引创建字典,因为 Series.value_counts
默认排序:
L = []
cols = ['X1','X2','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
d = {'No of unique': u,
'Highest rep': a.index[0],
'% of Highest rep': a.iat[0] / len(df),
'Second Highest rep': a.index[1],
'Second % of Highest rep': a.iat[1] / len(df)}
L.append(d)
df = pd.DataFrame(L, index=cols)
print (df)
No of unique Highest rep % of Highest rep Second Highest rep \
X1 4 1 0.5 7
X2 3 10 0.4 8
X3 5 11 0.6 13
Second % of Highest rep
X1 0.3
X2 0.4
X3 0.1
更一般的解决方案测试是否存在最大值:
L = []
cols = ['X1','X2','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
if len(a) > 1:
secondmax = a.index[1]
secondperc = a.iat[1] / len(df)
else:
secondmax = np.nan
secondsecondperc = np.nan
d = {'No of unique': u,
'Highest rep': a.index[0],
'% of Highest rep': a.iat[0] / len(df),
'Second Highest rep': secondmax,
'Second % of Highest rep': secondperc}
L.append(d)
df = pd.DataFrame(L, index=cols)