Pandas 将 NULL 读取为 NaN 浮点数而不是 str
Pandas reading NULL as a NaN float instead of str
给定文件:
$ cat test.csv
a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n
第 3 列将被视为 str
。
当我在列上执行字符串函数时,pandas
已将 NULL
str 读取为 NaN
浮点数:
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', names=[0,1,2,3,4], dtype={0:str, 1:str, 2:str, 3:str, 4:str})
>>> df[3].apply(str.strip)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'
验证:
>>> for i in df[3]:
... print (type(i), i)
...
<class 'float'> nan
<class 'str'> h
<class 'str'> m
我在初始化时指定了 dtype
但不知何故它被覆盖了。
如何强制固定特定列的类型?
有没有办法自动找到这些异常的NaN
浮点数,然后变回'NULL'
字符串?
对我来说作品astype
:
df[3] = df[3].astype(str)
for i in df[3]:
print (type(i), i)
<class 'str'> nan
<class 'str'> h
<class 'str'> m
另一个解决方案是在 read_csv
中使用 keep_default_na=False
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), names=[0,1,2,3,4], keep_default_na=False)
print (df)
0 1 2 3 4
0 a b c NULL d
1 e f g h i
2 j k l m n
for i in df[3]:
print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m
如果需要在数字列中解析 NaN
,则可以使用 na_values
参数,但它必须不同,例如NA
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b,c,NULL,1
e,f,g,h,2
j,k,l,m,NA"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), names=[0,1,2,3,4], keep_default_na=False, na_values=['NA'])
print (df)
0 1 2 3 4
0 a b c NULL 1.0
1 e f g h 2.0
2 j k l m NaN
for i in df[3]:
print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m
for i in df[4]:
print (type(i), i)
<class 'numpy.float64'> 1.0
<class 'numpy.float64'> 2.0
<class 'numpy.float64'> nan
给定文件:
$ cat test.csv
a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n
第 3 列将被视为 str
。
当我在列上执行字符串函数时,pandas
已将 NULL
str 读取为 NaN
浮点数:
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', names=[0,1,2,3,4], dtype={0:str, 1:str, 2:str, 3:str, 4:str})
>>> df[3].apply(str.strip)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'
验证:
>>> for i in df[3]:
... print (type(i), i)
...
<class 'float'> nan
<class 'str'> h
<class 'str'> m
我在初始化时指定了 dtype
但不知何故它被覆盖了。
如何强制固定特定列的类型?
有没有办法自动找到这些异常的NaN
浮点数,然后变回'NULL'
字符串?
对我来说作品astype
:
df[3] = df[3].astype(str)
for i in df[3]:
print (type(i), i)
<class 'str'> nan
<class 'str'> h
<class 'str'> m
另一个解决方案是在 read_csv
中使用 keep_default_na=False
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), names=[0,1,2,3,4], keep_default_na=False)
print (df)
0 1 2 3 4
0 a b c NULL d
1 e f g h i
2 j k l m n
for i in df[3]:
print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m
如果需要在数字列中解析 NaN
,则可以使用 na_values
参数,但它必须不同,例如NA
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b,c,NULL,1
e,f,g,h,2
j,k,l,m,NA"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), names=[0,1,2,3,4], keep_default_na=False, na_values=['NA'])
print (df)
0 1 2 3 4
0 a b c NULL 1.0
1 e f g h 2.0
2 j k l m NaN
for i in df[3]:
print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m
for i in df[4]:
print (type(i), i)
<class 'numpy.float64'> 1.0
<class 'numpy.float64'> 2.0
<class 'numpy.float64'> nan