如何使用 pandas 保留字符值的浮点精度?

How to retain float precision with character values using pandas?

我有一个如下所示的数据框

df = pd.DataFrame({'source_code':['A250.00','C791.0','716.90','493.90','143.21','134.52'],
                   'source_description':['test1', 'test1','test2','test3','test4,'test5'],
                   'key_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

hash_file = pd.DataFrame({'source_id':['A250','C791','716.9','493.9','143.21','134.52'],
                          'source_code':['test1','test2','test3','test4','test5'],
                          'hash_id':[911,512,713,814,616,717]})
id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

我想通过比较 source_codesource_description 列与 source_idsource_code 列来更新 key_id 列的值。

所以,我根据这个

尝试了下面的方法
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

虽然这在正常情况下工作正常,但对于 250250.00791.0791 等不匹配的特定情况,它不会' 工作并产生如下所示的不正确输出

所以,我尝试将它们转换为字符串,但它仍然不起作用

我希望我的输出如下所示

如果可能,将值转换为浮点数:

df['source_code'] = df['source_code'].astype(float)
hash_file['source_id'] = hash_file['source_id'].astype(float)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0       250.00              test1     911
1       791.00              test1     512
2       716.90              test2     713
3       493.90              test3     814
4       143.21              test4     616
5       134.52              test5     717

但是浮点精度应该有问题,一个可能的技巧是多个值一些标量,如 1000,然后转换为整数:

df['source_code'] = df['source_code'].astype(float).mul(100).astype(int)
hash_file['source_id'] = hash_file['source_id'].astype(float).mul(100).astype(int)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0        25000              test1     911
1        79100              test1     512
2        71690              test2     713
3        49390              test3     814
4        14321              test4     616
5        13452              test5     717

编辑:

如果问题只是最后 0 或最后 .0 使用:

df['source_code'] = df['source_code'].str.replace('[\.]*[0]+$','', regex=True)
print (df)
  source_code source_description  key_id
0        A250              test1     NaN
1        C791              test1     NaN
2       716.9              test2     NaN
3       493.9              test3     NaN
4      143.21              test4     NaN
5      134.52              test5     NaN

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

print (df)
  source_code source_description  key_id
0        A250              test1     911
1        C791              test1     512
2       716.9              test2     713
3       493.9              test3     814
4      143.21              test4     616
5      134.52              test5     717

更好的(我希望)正则表达式删除最后一个 .0 如果存在:

import re

#
rgx = re.compile(r'(?:(\.)|(\.\d*?[1-9]\d*?))0+(?=\b|[^0-9])')
df['source_code'] = df['source_code'].str.replace(rgx, r'', regex=True)