Pandas:我使用的应用函数给出了错误的结果
Pandas: The apply function I am using is giving me wrong results
我有一个看起来像这样的数据集
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
我正在使用以下代码
def probability(x):
y=[]
for i in range(len(x)):
y.append(float(x[i])/float(len(x)))
return y
df2['prob']= (df2.groupby('a_id')
.apply(probability(['output']))
.reset_index(level='a_id', drop=True))
理想的结果应该是一个具有以下值的新列
prob
0 0.333334
1 0.333334
2 0.0
3 0.5
4 0.5
5 0
6 0.333334
7 0.333334
但我遇到了错误
y.append(float(x[i])/float(len(x)))
ValueError: could not convert string to float: output
列输出为int格式。我不明白为什么会出现此错误。
我正在尝试计算每个人消费列输出给出的产品的输出概率。例如,如果 sam 收到 soap,并且 soap 也出现在 'c_consumed' 列中,则结果为 1,否则结果为 0。
现在,由于 sam 收到了 3 件产品,他消费了 2 件,因此每件产品消费的概率是 1/3。因此,输出为 1 的概率应为 0.333334,输出为 0 的概率应为 0。
如何达到预期的效果?
我认为您可以简单地将 output
列连同已计算的分组 .groupby('a_id')['output']
传递给 GroupBy
对象,然后使用函数 probability
,其中 return 仅将列 output
与其 len
:
相除
def probability(x):
#print x
return x / len(x)
df2['prob']= (df2.groupby('a_id')['output']
.apply(probability)
.reset_index(level='a_id', drop=True))
或 lambda
:
df2['prob']= (df2.groupby('a_id')['output']
.apply(lambda x: x / len(x) )
.reset_index(level='a_id', drop=True))
更简单、更快速的解决方案是 transform
:
df2['prob']= df2['output'] / df2.groupby('a_id')['output'].transform('count')
print df2
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
prob
0 0.333333
1 0.333333
2 0.000000
3 0.500000
4 0.500000
5 0.000000
6 0.333333
7 0.333333
时间:
In [505]: %timeit (df2.groupby('a_id')['output'].apply(lambda x: x / len(x) ).reset_index(level='a_id', drop=True))
The slowest run took 10.99 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 1.73 ms per loop
In [506]: %timeit df2['output'] / df2.groupby('a_id')['output'].transform('count')
The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 449 µs per loop
我有一个看起来像这样的数据集
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
我正在使用以下代码
def probability(x):
y=[]
for i in range(len(x)):
y.append(float(x[i])/float(len(x)))
return y
df2['prob']= (df2.groupby('a_id')
.apply(probability(['output']))
.reset_index(level='a_id', drop=True))
理想的结果应该是一个具有以下值的新列
prob
0 0.333334
1 0.333334
2 0.0
3 0.5
4 0.5
5 0
6 0.333334
7 0.333334
但我遇到了错误
y.append(float(x[i])/float(len(x)))
ValueError: could not convert string to float: output
列输出为int格式。我不明白为什么会出现此错误。
我正在尝试计算每个人消费列输出给出的产品的输出概率。例如,如果 sam 收到 soap,并且 soap 也出现在 'c_consumed' 列中,则结果为 1,否则结果为 0。
现在,由于 sam 收到了 3 件产品,他消费了 2 件,因此每件产品消费的概率是 1/3。因此,输出为 1 的概率应为 0.333334,输出为 0 的概率应为 0。
如何达到预期的效果?
我认为您可以简单地将 output
列连同已计算的分组 .groupby('a_id')['output']
传递给 GroupBy
对象,然后使用函数 probability
,其中 return 仅将列 output
与其 len
:
def probability(x):
#print x
return x / len(x)
df2['prob']= (df2.groupby('a_id')['output']
.apply(probability)
.reset_index(level='a_id', drop=True))
或 lambda
:
df2['prob']= (df2.groupby('a_id')['output']
.apply(lambda x: x / len(x) )
.reset_index(level='a_id', drop=True))
更简单、更快速的解决方案是 transform
:
df2['prob']= df2['output'] / df2.groupby('a_id')['output'].transform('count')
print df2
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
prob
0 0.333333
1 0.333333
2 0.000000
3 0.500000
4 0.500000
5 0.000000
6 0.333333
7 0.333333
时间:
In [505]: %timeit (df2.groupby('a_id')['output'].apply(lambda x: x / len(x) ).reset_index(level='a_id', drop=True))
The slowest run took 10.99 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 1.73 ms per loop
In [506]: %timeit df2['output'] / df2.groupby('a_id')['output'].transform('count')
The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 449 µs per loop