如何将 pandas 中的浮点值二值化?
how to binarize float values in pandas?
我有这样的浮点数据,它是由 3 个神经元产生的神经网络输出。我想根据最大行值转换为二进制分类标签(互斥)。
0.423201 0.368718 0.338091
0.246899 0.437535 0.000262
0.978685 0.136219 0.027693
输出应该是
1 0 0
0 1 0
1 0 0
这意味着每一行可以有值 1 连续一次,其余全部为零(最大值变为 1)。
如何在pandas或python中完成?我知道 pandas 中的 get_dummies 是可行的方法,但它不起作用。
如果可以请帮忙
我想你可以使用 rank
and then compare it with max value of df1
. Last convert boolean DataFrame
by astype
到 int
:
print df
0 1 2
0 0.423201 0.368718 0.338091
1 0.246899 0.437535 0.000262
2 0.978685 0.136219 0.027693
df1 = df.rank(method='max', axis=1)
print df1
0 1 2
0 3 2 1
1 2 3 1
2 3 2 1
#get max value of df1
ma = df1.max().max()
print ma
3.0
print (df1 == ma)
0 1 2
0 True False False
1 False True False
2 True False False
print (df1 == ma).astype(int)
0 1 2
0 1 0 0
1 0 1 0
2 1 0 0
编辑:
我想你可以使用 eq
for comparing by rows with max
of df
and last convert by astype
到 int
:
print df.max(axis=1)
0 10
1 8
2 9
dtype: int64
print df.eq(df.max(axis=1), axis=0).astype(int)
0 1 2
0 1 0 0
1 0 1 0
2 1 0 0
计时
len(df) = 3
:
In [418]: %timeit df.eq(df.max(axis=1), axis=0).astype(int)
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 334 µs per loop
In [419]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int)
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.44 ms per loop
In [420]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 656 µs per loop
len(df) = 3000
:
In [426]: %timeit df.eq(df.max(axis=1), axis=0).astype(int)
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 456 µs per loop
In [427]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int)
1 loops, best of 3: 496 ms per loop
In [428]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int)
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.32 ms per loop
我认为这样会更简单、更快。
df.apply(lambda x: x == x.max(), axis='columns').astype(int)
我有这样的浮点数据,它是由 3 个神经元产生的神经网络输出。我想根据最大行值转换为二进制分类标签(互斥)。
0.423201 0.368718 0.338091
0.246899 0.437535 0.000262
0.978685 0.136219 0.027693
输出应该是
1 0 0
0 1 0
1 0 0
这意味着每一行可以有值 1 连续一次,其余全部为零(最大值变为 1)。
如何在pandas或python中完成?我知道 pandas 中的 get_dummies 是可行的方法,但它不起作用。
如果可以请帮忙
我想你可以使用 rank
and then compare it with max value of df1
. Last convert boolean DataFrame
by astype
到 int
:
print df
0 1 2
0 0.423201 0.368718 0.338091
1 0.246899 0.437535 0.000262
2 0.978685 0.136219 0.027693
df1 = df.rank(method='max', axis=1)
print df1
0 1 2
0 3 2 1
1 2 3 1
2 3 2 1
#get max value of df1
ma = df1.max().max()
print ma
3.0
print (df1 == ma)
0 1 2
0 True False False
1 False True False
2 True False False
print (df1 == ma).astype(int)
0 1 2
0 1 0 0
1 0 1 0
2 1 0 0
编辑:
我想你可以使用 eq
for comparing by rows with max
of df
and last convert by astype
到 int
:
print df.max(axis=1)
0 10
1 8
2 9
dtype: int64
print df.eq(df.max(axis=1), axis=0).astype(int)
0 1 2
0 1 0 0
1 0 1 0
2 1 0 0
计时
len(df) = 3
:
In [418]: %timeit df.eq(df.max(axis=1), axis=0).astype(int)
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 334 µs per loop
In [419]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int)
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.44 ms per loop
In [420]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 656 µs per loop
len(df) = 3000
:
In [426]: %timeit df.eq(df.max(axis=1), axis=0).astype(int)
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 456 µs per loop
In [427]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int)
1 loops, best of 3: 496 ms per loop
In [428]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int)
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.32 ms per loop
我认为这样会更简单、更快。
df.apply(lambda x: x == x.max(), axis='columns').astype(int)