处理 Pandas read_csv 中的缺失数据
Dealing with missing data in Pandas read_csv
在将 CSV 数据导入 pandas DataFrame 时,我还没有找到令人满意的数据丢失问题的解决方案。
我有一些数据集,其中我事先不知道列或数据类型是什么。我希望 pandas 能够更好地推断如何读取数据。
我还没有找到任何真正有用的 na_values=...
组合。
考虑以下 csv 文件:
no_holes.csv
letter,number
a,1
b,2
c,3
d,4
with_holes.csv
letter,number
a,1
,2
b,
,4
empty_column.csv
letters,numbers
,1
,2
,3
,4
with_NA.csv
letter,number
a,1
b,NA
NA,3
d,4
这是我将它们读入 DataFrame 时发生的情况(下面的代码):
**no holes**
letter number
0 a 1
1 b 2
2 c 3
3 d 4
letter object
number int64
dtype: object
**with holes**
letter number
0 a 1
1 NaN 2
2 b
3 NaN 4
letter object
number object
dtype: object
**empty_column**
letters numbers
0 NaN 1
1 NaN 2
2 NaN 3
3 NaN 4
letters float64
numbers int64
dtype: object
**with NA**
letter number
0 a 1.0
1 b NaN
2 NaN 3.0
3 d 4.0
letter object
number float64
dtype: object
有没有办法告诉 pandas 假定空值属于 object
类型?我试过了 na_values=[""]
.
demo_holes.py
import pandas as pd
with_holes = pd.read_csv("with_holes.csv")
no_holes = pd.read_csv("no_holes.csv")
empty_column = pd.read_csv("empty_column.csv")
with_NA = pd.read_csv("with_NA.csv")
print("\n**no holes**")
print(no_holes.head())
print(no_holes.dtypes)
print("\n**with holes**")
print(with_holes.head())
print(with_holes.dtypes)
print("\n**empty_column**")
print(empty_column.head())
print(empty_column.dtypes)
print("\n**with NA**")
print(with_NA.head())
print(with_NA.dtypes)
您要使用参数skipinitialspace=True
设置
no_holes = """letter,number
a,1
b,2
c,3
d,4"""
with_holes = """letter,number
a,1
,2
b,
,4"""
empty_column = """letters,numbers
,1
,2
,3
,4"""
with_NA = """letter,number
a,1
b,NA
NA,3
d,4"""
from StringIO import StringIO
import pandas as pd
d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True)
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True)
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True)
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True)
pd.concat([d1, d2, d3, d4], axis=1,
keys=['no_holes', 'with_holes',
'empty_column', 'with_NA'])
如果您希望这些 NaN
成为 ''
,则使用 fillna
d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True).fillna('')
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True).fillna('')
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True).fillna('')
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True).fillna('')
pd.concat([d1, d2, d3, d4], axis=1,
keys=['no_holes', 'with_holes',
'empty_column', 'with_NA'])
在将 CSV 数据导入 pandas DataFrame 时,我还没有找到令人满意的数据丢失问题的解决方案。
我有一些数据集,其中我事先不知道列或数据类型是什么。我希望 pandas 能够更好地推断如何读取数据。
我还没有找到任何真正有用的 na_values=...
组合。
考虑以下 csv 文件:
no_holes.csv
letter,number
a,1
b,2
c,3
d,4
with_holes.csv
letter,number
a,1
,2
b,
,4
empty_column.csv
letters,numbers
,1
,2
,3
,4
with_NA.csv
letter,number
a,1
b,NA
NA,3
d,4
这是我将它们读入 DataFrame 时发生的情况(下面的代码):
**no holes**
letter number
0 a 1
1 b 2
2 c 3
3 d 4
letter object
number int64
dtype: object
**with holes**
letter number
0 a 1
1 NaN 2
2 b
3 NaN 4
letter object
number object
dtype: object
**empty_column**
letters numbers
0 NaN 1
1 NaN 2
2 NaN 3
3 NaN 4
letters float64
numbers int64
dtype: object
**with NA**
letter number
0 a 1.0
1 b NaN
2 NaN 3.0
3 d 4.0
letter object
number float64
dtype: object
有没有办法告诉 pandas 假定空值属于 object
类型?我试过了 na_values=[""]
.
demo_holes.py
import pandas as pd
with_holes = pd.read_csv("with_holes.csv")
no_holes = pd.read_csv("no_holes.csv")
empty_column = pd.read_csv("empty_column.csv")
with_NA = pd.read_csv("with_NA.csv")
print("\n**no holes**")
print(no_holes.head())
print(no_holes.dtypes)
print("\n**with holes**")
print(with_holes.head())
print(with_holes.dtypes)
print("\n**empty_column**")
print(empty_column.head())
print(empty_column.dtypes)
print("\n**with NA**")
print(with_NA.head())
print(with_NA.dtypes)
您要使用参数skipinitialspace=True
设置
no_holes = """letter,number
a,1
b,2
c,3
d,4"""
with_holes = """letter,number
a,1
,2
b,
,4"""
empty_column = """letters,numbers
,1
,2
,3
,4"""
with_NA = """letter,number
a,1
b,NA
NA,3
d,4"""
from StringIO import StringIO
import pandas as pd
d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True)
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True)
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True)
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True)
pd.concat([d1, d2, d3, d4], axis=1,
keys=['no_holes', 'with_holes',
'empty_column', 'with_NA'])
如果您希望这些 NaN
成为 ''
,则使用 fillna
d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True).fillna('')
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True).fillna('')
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True).fillna('')
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True).fillna('')
pd.concat([d1, d2, d3, d4], axis=1,
keys=['no_holes', 'with_holes',
'empty_column', 'with_NA'])