如何消除缺失数据
How to eliminate missing data
enter image description here我想从名为 crsp_data 的文件中的 ret 和 dlret 列中删除缺失值。这是我的代码:
crsp_data_ret=crsp_data['ret'].dropna()
crsp_data_dlret=crsp_data['dlret'].dropna()
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
但它给了我以下错误:
KeyError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3062 try:
-> 3063 return self._engine.get_loc(key)
3064 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'dlret'
任何人都可以指出我做错的地方来帮助我吗?
感谢您的帮助!
There are NANs in ret
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret_x'])-1
crsp_data_retadj=crsp_data.dropna(subset=['retadj'])
crsp_data['retadj'].head(50)
0 南
1 -0.248538
2 0.428202
3 -0.086215
4 -0.125488
5 0.030425
6 -0.203367
7 -0.611781
8 -0.051796
9 -0.328013
10 0.065550
11 -0.413984
12 -0.343434
13 0.052632
14 -0.420102
15 -0.089628
16 -0.036559
17 南
18 南
19 0.039082
20 0.480844
21 0.025029
22 0.056209
23 -0.013069
24 -0.060239
25 南
26 0.033846
27 南
28 0.121294
29 0.185520
30 -0.035714
31 南
问题是:
crsp_data_ret=crsp_data['ret']
crsp_data_dlret=crsp_data['dlret']
returnSeries
,所以不可能select以后,Series.dropna
就用了:
crsp_data_ret=crsp_data['ret'].dropna()
crsp_data_dlret=crsp_data['dlret'].dropna()
解决方案是删除 ['ret']
和 ['dlret']
:
crsp_data['retadj']=(1+crsp_data)*(1+crsp_data)-1
另一种解决方案是使用 DataFrame.dropna
,因此 DataFrame
是 returned:
crsp_data_ret=crsp_data.dropna(subset=['ret'])
crsp_data_dlret=crsp_data.dropna(subset=['dlret'])
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
编辑:
如果需要忽略 NaN
s,一种可能的解决方案是使用带有参数 fill_value=0
的 add
,然后得到:
crsp_data = pd.DataFrame({'ret':[1,2,'C', 5, np.nan],
'dlret':[10, np.nan, 7, 1, np.nan]})
crsp_data['ret'] = pd.to_numeric(crsp_data['ret'], errors='coerce')
crsp_data['retadj1']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
crsp_data['retadj2']= crsp_data['ret'].add(1, fill_value=0).mul(crsp_data['dlret'].add(1, fill_value=0)).sub(1)
print (crsp_data)
ret dlret retadj1 retadj2
0 1.0 10.0 21.0 21.0
1 2.0 NaN NaN 2.0
2 NaN 7.0 NaN 7.0
3 5.0 1.0 11.0 11.0
4 NaN NaN NaN 0.0
详情:
print (crsp_data['ret'].add(1, fill_value=0))
0 2.0
1 3.0
2 1.0
3 6.0
4 1.0
Name: ret, dtype: float64
print (crsp_data['dlret'].add(1, fill_value=0).sub(1))
0 10.0
1 0.0
2 7.0
3 1.0
4 0.0
Name: dlret, dtype: float64
enter image description here我想从名为 crsp_data 的文件中的 ret 和 dlret 列中删除缺失值。这是我的代码:
crsp_data_ret=crsp_data['ret'].dropna()
crsp_data_dlret=crsp_data['dlret'].dropna()
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
但它给了我以下错误:
KeyError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3062 try:
-> 3063 return self._engine.get_loc(key)
3064 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'dlret'
任何人都可以指出我做错的地方来帮助我吗? 感谢您的帮助!
There are NANs in ret
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret_x'])-1
crsp_data_retadj=crsp_data.dropna(subset=['retadj'])
crsp_data['retadj'].head(50)
0 南 1 -0.248538 2 0.428202 3 -0.086215 4 -0.125488 5 0.030425 6 -0.203367 7 -0.611781 8 -0.051796 9 -0.328013 10 0.065550 11 -0.413984 12 -0.343434 13 0.052632 14 -0.420102 15 -0.089628 16 -0.036559 17 南 18 南 19 0.039082 20 0.480844 21 0.025029 22 0.056209 23 -0.013069 24 -0.060239 25 南 26 0.033846 27 南 28 0.121294 29 0.185520 30 -0.035714 31 南
问题是:
crsp_data_ret=crsp_data['ret']
crsp_data_dlret=crsp_data['dlret']
returnSeries
,所以不可能select以后,Series.dropna
就用了:
crsp_data_ret=crsp_data['ret'].dropna()
crsp_data_dlret=crsp_data['dlret'].dropna()
解决方案是删除 ['ret']
和 ['dlret']
:
crsp_data['retadj']=(1+crsp_data)*(1+crsp_data)-1
另一种解决方案是使用 DataFrame.dropna
,因此 DataFrame
是 returned:
crsp_data_ret=crsp_data.dropna(subset=['ret'])
crsp_data_dlret=crsp_data.dropna(subset=['dlret'])
crsp_data['retadj']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
编辑:
如果需要忽略 NaN
s,一种可能的解决方案是使用带有参数 fill_value=0
的 add
,然后得到:
crsp_data = pd.DataFrame({'ret':[1,2,'C', 5, np.nan],
'dlret':[10, np.nan, 7, 1, np.nan]})
crsp_data['ret'] = pd.to_numeric(crsp_data['ret'], errors='coerce')
crsp_data['retadj1']=(1+crsp_data['ret'])*(1+crsp_data['dlret'])-1
crsp_data['retadj2']= crsp_data['ret'].add(1, fill_value=0).mul(crsp_data['dlret'].add(1, fill_value=0)).sub(1)
print (crsp_data)
ret dlret retadj1 retadj2
0 1.0 10.0 21.0 21.0
1 2.0 NaN NaN 2.0
2 NaN 7.0 NaN 7.0
3 5.0 1.0 11.0 11.0
4 NaN NaN NaN 0.0
详情:
print (crsp_data['ret'].add(1, fill_value=0))
0 2.0
1 3.0
2 1.0
3 6.0
4 1.0
Name: ret, dtype: float64
print (crsp_data['dlret'].add(1, fill_value=0).sub(1))
0 10.0
1 0.0
2 7.0
3 1.0
4 0.0
Name: dlret, dtype: float64