防止 Pandas read_csv 将 NA 解释为 NaN 但为空值保留 NaN
Prevent Pandas read_csv from interpreting NA as NaN but retaining NaN for empty values
我的问题与此有关。我有一个名为 'test.csv' 的文件,其中 'NA' 作为 region
的值。我想读成 'NA',而不是 'NaN'。但是,test.csv 中的其他列中存在缺失值,我想将其保留为 'NaN'。我怎样才能做到这一点?
# test.csv looks like this:
这是我尝试过的方法:
import pandas as pd
# This reads NA as NaN
df = pd.read_csv(test.csv)
df
region date expenses
0 NaN 1/1/2019 53
1 EU 1/2/2019 NaN
# This reads NA as NA, but doesn't read missing expense as NaN
df = pd.read_csv('test.csv', keep_default_na=False, na_values='_')
df
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019
# What I want:
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019 NaN
添加参数 keep_default_na=False
的问题是 expenses
的第二个值没有读入 NaN
。因此,如果我随后尝试 pd.isnull(df['value'][1])
,则返回为 False
。
当指定 keep_default=False
时,所有默认值都不会被视为 nan,因此您应该指定它们:
使用keep_default_na=False, na_values= [‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’]
对我来说,这有效:
df = pd.read_csv('file.csv', keep_default_na=False, na_values=[''])
给出:
region date expenses
0 NA 1/1/2019 53.0
1 EU 1/2/2019 NaN
但我宁愿谨慎行事,因为其他栏目中可能有其他 NaN
,并且
df = pd.read_csv('file.csv')
df['region'] = df['region'].fillna('NA')
这种方法对我有用:
import pandas as pd
df = pd.read_csv('Test.csv')
co1 col2 col3 col4
a b c d e
NaN NaN NaN NaN NaN
2 3 4 5 NaN
我复制了该值并创建了一个默认情况下解释为 NaN 的列表,然后注释掉我希望被解释为非 NaN 的 NA。这种方法仍然将除 NA 之外的其他值视为 NaN。
#You can also create your own list of value that should be treated as NaN and
# then pass the values to na_values and set keep_default_na=False.
na_values = ["",
"#N/A",
"#N/A N/A",
"#NA",
"-1.#IND",
"-1.#QNAN",
"-NaN",
"-nan",
"1.#IND",
"1.#QNAN",
"<NA>",
"N/A",
# "NA",
"NULL",
"NaN",
"n/a",
"nan",
"null"]
df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )
co1 col2 col3 col4
a b c d e
NaN NA NaN NA NaN
2 3 4 5 NaN
我的问题与此有关region
的值。我想读成 'NA',而不是 'NaN'。但是,test.csv 中的其他列中存在缺失值,我想将其保留为 'NaN'。我怎样才能做到这一点?
# test.csv looks like this:
这是我尝试过的方法:
import pandas as pd
# This reads NA as NaN
df = pd.read_csv(test.csv)
df
region date expenses
0 NaN 1/1/2019 53
1 EU 1/2/2019 NaN
# This reads NA as NA, but doesn't read missing expense as NaN
df = pd.read_csv('test.csv', keep_default_na=False, na_values='_')
df
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019
# What I want:
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019 NaN
添加参数 keep_default_na=False
的问题是 expenses
的第二个值没有读入 NaN
。因此,如果我随后尝试 pd.isnull(df['value'][1])
,则返回为 False
。
当指定 keep_default=False
时,所有默认值都不会被视为 nan,因此您应该指定它们:
使用keep_default_na=False, na_values= [‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’]
对我来说,这有效:
df = pd.read_csv('file.csv', keep_default_na=False, na_values=[''])
给出:
region date expenses
0 NA 1/1/2019 53.0
1 EU 1/2/2019 NaN
但我宁愿谨慎行事,因为其他栏目中可能有其他 NaN
,并且
df = pd.read_csv('file.csv')
df['region'] = df['region'].fillna('NA')
这种方法对我有用:
import pandas as pd
df = pd.read_csv('Test.csv')
co1 col2 col3 col4
a b c d e
NaN NaN NaN NaN NaN
2 3 4 5 NaN
我复制了该值并创建了一个默认情况下解释为 NaN 的列表,然后注释掉我希望被解释为非 NaN 的 NA。这种方法仍然将除 NA 之外的其他值视为 NaN。
#You can also create your own list of value that should be treated as NaN and
# then pass the values to na_values and set keep_default_na=False.
na_values = ["",
"#N/A",
"#N/A N/A",
"#NA",
"-1.#IND",
"-1.#QNAN",
"-NaN",
"-nan",
"1.#IND",
"1.#QNAN",
"<NA>",
"N/A",
# "NA",
"NULL",
"NaN",
"n/a",
"nan",
"null"]
df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )
co1 col2 col3 col4
a b c d e
NaN NA NaN NA NaN
2 3 4 5 NaN