DataFrame.equals() 失败 - 使用 to_csv() 将字符串转换为 NaN
DataFrame.equals() failing - String conversion to NaN with to_csv()
我正在将 CSV 文件写入磁盘,然后读入 CSV 文件以检查它是否与 DataFrame 的内存版本匹配。我正在强制 CSV 文件的类型,当我读回它时,通过使用 dtypes 和 astype 来匹配原始数据帧的数据类型。
这似乎工作正常,但是当我对数据框执行“等于”时,它们是不同的。当我检查每个单独的字段时,我看到了这些差异:
MISMATCH AT INDEX: 21417
Column: REGISTRATION_NUMBER
Source Value: N/A
Source Type: <class 'str'>
Target Value: nan
Target Type: <class 'float'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MISMATCH AT INDEX: 21709
Column: REGISTRATION_NUMBER
Source Value: N/A
Source Type: <class 'str'>
Target Value: nan
Target Type: <class 'float'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
现在我使用 cx_oracle:
从 Oracle 读取原始数据
def get_data(sql):
# Returns the resulting recordset as a result of executing the SQL, also returns a list of the column names.
# Results are returned as a tuple (recordset, column names)
print("Running SQL:\n\n" + sql)
dsn_tns = cx_Oracle.makedsn('myserver.my_company.net', '1521', service_name='myservice')
con = cx_Oracle.connect(user='me', password='password', dsn=dsn_tns)
cur = con.cursor()
cur.execute(sql)
rs = cur.fetchall()
col_names = []
# we go through every field
for field in cur.description:
col_names.append(field[0])
return (rs, col_names)
def get_datframe_from_sql_file(filename):
with open(filename, 'r') as sql_file:
sql = sql_file.read()
rs, col_names = get_data(sql)
df = pd.DataFrame(rs, columns=col_names)
return df;
这是我写入磁盘然后读回进行比较的原始数据帧。
问题是,如果这在原始数据框中被视为 N/A 的字符串值,为什么当我将其写入 CSV 时,它最终是否为 NaN?
我调用了这个函数,然后将oracle数据框写入磁盘再读回进行比较,使用dtypes强制转换数据框中的数据类型。
来自数据导入 get_data、get_datframe_from_sql_file
从文件导入 write_data_frames
从日期时间导入日期
if __name__ == "__main__":
maa_file = "marketing_applications_all.sql";
maa_file_types = "marketing_applicastions_types.csv"
df_maa = get_datframe_from_sql_file(maa_file);
print(df_maa.head(20));
maa_file = ();
todays_date = date.today();
file_info = ('maa.csv', todays_date, df_maa, maa_file_types);
files = [file_info];
write_data_frames(files);
def write_data_frames(data_to_write):
# the data passed is is a list of tuples
# each tuple is of the format (file_name, file_date, DataFrame, datatypes_file_name)
for file_name, file_date , df, datatypes_file_name in data_to_write:
file_name_final = str(file_date.year).zfill(4) + '_' + str(file_date.month).zfill(2) + '_' + str(file_date.day).zfill(2) + '__' + file_name;
df.to_csv(file_name_final, index=True, quotechar='"', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, header=True);
df.dtypes.to_csv(datatypes_file_name, header=False);
validate_data_frame_vs_file(df, file_name_final,datatypes_file_name);
def validate_data_frame_vs_file(df, file_name, datatypes_file_name):
print("Validating file " + file_name + " against DataFrame")
print("Comparing in memory:");
print(df.dtypes);
print(str(len(df)) + ' rows');
df_file = pd.read_csv(file_name, index_col=0, header=0,parse_dates=True)
df_file_types = pd.read_csv(datatypes_file_name, names=["COLUMN","DATA_TYPE"], header=None);
# change date columne
for index, row in df_file_types.iterrows():
col_name = row["COLUMN"];
dtype = row["DATA_TYPE"];
if "DATE" in dtype.upper():
df_file[col_name] = df_file[col_name].astype(dtype);
print("Comparing file:");
print("In Memory....");
print(df.dtypes);
print(str(len(df)) + ' rows');
print("From File....");
print(df_file.dtypes);
print(str(len(df_file)) + ' rows');
print("Are DataFrames equal?")
frames_equal = df.equals(df_file);
counter = 0;
print("Source Data Frame:")
print(df.head(10));
df.fillna(value=pd.np.nan, inplace=True);
df_file.fillna(value=pd.np.nan, inplace=True);
if frames_equal == False:
source_columns = df.columns;
print("Source Columns:")
print(source_columns);
for source_index, source_row in df.iterrows():
counter = counter + 1;
for source_col in source_columns:
source_value = source_row[source_col];
target_value = df_file.loc[source_index, source_col];
if source_value == source_value: # deals with NaN
if source_value != target_value:
print("~" * 50);
print("MISMATCH AT INDEX:", source_index)
print("Column:", source_col);
print("Source Value: ", source_value);
print("Source Type: ", type(source_value));
print("Target Value: ", target_value);
print("Target Type: ", type(target_value));
print("~" * 50);
更新:
在 Oracle 中它是一个字符串“N\A”值,所以不确定为什么将其写入 CSV 然后将其呈现为 NaN?
我从 Oracle 读取数据帧,然后写入 CSV。该文件中有“N/A”值。我通过 Pandas 读回了该 CSV,该字段现在在该记录中有一个 NaN。导致问题的是 CSV 的回读,“N/A”值存在于数据帧的原始写出中。
我快速测试了 CSV 文件,N/A 不是问题吗?
在 CSV 文件中。
在对象类型列中以 Nan 形式读入?
我算出来了:
df = pd.read_csv("2020_07_24__maa.csv", header=0, keep_default_na = False)
Pandas 将 N/A 解释为 NaN 值 (!)
df.read_csv 和 df.to_csv 之间存在微小的不匹配,这可能是该行为的原因。
如果写入 csv,默认值为 ''
。引用 official documentation:
na_repstr, default ''
Missing data representation.
但是,如果您从 csv 中读取,则 NA 值定义为:
na_values scalar, str, list-like, or dict, optional Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ''
, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
因此,read_csv() 确实接受了更多的 NA 值并将它们转换为 NaN
。
您可以使用这两个参数定义可接受的 NA 值来解决您的不匹配问题。
df.to_csv(file_name_final, index=True, quotechar='"', doublequote=True,
quoting=csv.QUOTE_NONNUMERIC, header=True, na_repstr='NA');
df_file = pd.read_csv(file_name, index_col=0, header=0,parse_dates=True,
na_values='NA', keep_default_na=False)
请注意,这可能会破坏数据框中的其他值。
我正在将 CSV 文件写入磁盘,然后读入 CSV 文件以检查它是否与 DataFrame 的内存版本匹配。我正在强制 CSV 文件的类型,当我读回它时,通过使用 dtypes 和 astype 来匹配原始数据帧的数据类型。
这似乎工作正常,但是当我对数据框执行“等于”时,它们是不同的。当我检查每个单独的字段时,我看到了这些差异:
MISMATCH AT INDEX: 21417
Column: REGISTRATION_NUMBER
Source Value: N/A
Source Type: <class 'str'>
Target Value: nan
Target Type: <class 'float'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MISMATCH AT INDEX: 21709
Column: REGISTRATION_NUMBER
Source Value: N/A
Source Type: <class 'str'>
Target Value: nan
Target Type: <class 'float'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
现在我使用 cx_oracle:
从 Oracle 读取原始数据def get_data(sql):
# Returns the resulting recordset as a result of executing the SQL, also returns a list of the column names.
# Results are returned as a tuple (recordset, column names)
print("Running SQL:\n\n" + sql)
dsn_tns = cx_Oracle.makedsn('myserver.my_company.net', '1521', service_name='myservice')
con = cx_Oracle.connect(user='me', password='password', dsn=dsn_tns)
cur = con.cursor()
cur.execute(sql)
rs = cur.fetchall()
col_names = []
# we go through every field
for field in cur.description:
col_names.append(field[0])
return (rs, col_names)
def get_datframe_from_sql_file(filename):
with open(filename, 'r') as sql_file:
sql = sql_file.read()
rs, col_names = get_data(sql)
df = pd.DataFrame(rs, columns=col_names)
return df;
这是我写入磁盘然后读回进行比较的原始数据帧。
问题是,如果这在原始数据框中被视为 N/A 的字符串值,为什么当我将其写入 CSV 时,它最终是否为 NaN?
我调用了这个函数,然后将oracle数据框写入磁盘再读回进行比较,使用dtypes强制转换数据框中的数据类型。
来自数据导入 get_data、get_datframe_from_sql_file 从文件导入 write_data_frames
从日期时间导入日期
if __name__ == "__main__":
maa_file = "marketing_applications_all.sql";
maa_file_types = "marketing_applicastions_types.csv"
df_maa = get_datframe_from_sql_file(maa_file);
print(df_maa.head(20));
maa_file = ();
todays_date = date.today();
file_info = ('maa.csv', todays_date, df_maa, maa_file_types);
files = [file_info];
write_data_frames(files);
def write_data_frames(data_to_write):
# the data passed is is a list of tuples
# each tuple is of the format (file_name, file_date, DataFrame, datatypes_file_name)
for file_name, file_date , df, datatypes_file_name in data_to_write:
file_name_final = str(file_date.year).zfill(4) + '_' + str(file_date.month).zfill(2) + '_' + str(file_date.day).zfill(2) + '__' + file_name;
df.to_csv(file_name_final, index=True, quotechar='"', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, header=True);
df.dtypes.to_csv(datatypes_file_name, header=False);
validate_data_frame_vs_file(df, file_name_final,datatypes_file_name);
def validate_data_frame_vs_file(df, file_name, datatypes_file_name):
print("Validating file " + file_name + " against DataFrame")
print("Comparing in memory:");
print(df.dtypes);
print(str(len(df)) + ' rows');
df_file = pd.read_csv(file_name, index_col=0, header=0,parse_dates=True)
df_file_types = pd.read_csv(datatypes_file_name, names=["COLUMN","DATA_TYPE"], header=None);
# change date columne
for index, row in df_file_types.iterrows():
col_name = row["COLUMN"];
dtype = row["DATA_TYPE"];
if "DATE" in dtype.upper():
df_file[col_name] = df_file[col_name].astype(dtype);
print("Comparing file:");
print("In Memory....");
print(df.dtypes);
print(str(len(df)) + ' rows');
print("From File....");
print(df_file.dtypes);
print(str(len(df_file)) + ' rows');
print("Are DataFrames equal?")
frames_equal = df.equals(df_file);
counter = 0;
print("Source Data Frame:")
print(df.head(10));
df.fillna(value=pd.np.nan, inplace=True);
df_file.fillna(value=pd.np.nan, inplace=True);
if frames_equal == False:
source_columns = df.columns;
print("Source Columns:")
print(source_columns);
for source_index, source_row in df.iterrows():
counter = counter + 1;
for source_col in source_columns:
source_value = source_row[source_col];
target_value = df_file.loc[source_index, source_col];
if source_value == source_value: # deals with NaN
if source_value != target_value:
print("~" * 50);
print("MISMATCH AT INDEX:", source_index)
print("Column:", source_col);
print("Source Value: ", source_value);
print("Source Type: ", type(source_value));
print("Target Value: ", target_value);
print("Target Type: ", type(target_value));
print("~" * 50);
更新:
在 Oracle 中它是一个字符串“N\A”值,所以不确定为什么将其写入 CSV 然后将其呈现为 NaN?
我从 Oracle 读取数据帧,然后写入 CSV。该文件中有“N/A”值。我通过 Pandas 读回了该 CSV,该字段现在在该记录中有一个 NaN。导致问题的是 CSV 的回读,“N/A”值存在于数据帧的原始写出中。
我快速测试了 CSV 文件,N/A 不是问题吗?
在 CSV 文件中。
在对象类型列中以 Nan 形式读入?
我算出来了:
df = pd.read_csv("2020_07_24__maa.csv", header=0, keep_default_na = False)
Pandas 将 N/A 解释为 NaN 值 (!)
df.read_csv 和 df.to_csv 之间存在微小的不匹配,这可能是该行为的原因。
如果写入 csv,默认值为 ''
。引用 official documentation:
na_repstr, default
''
Missing data representation.
但是,如果您从 csv 中读取,则 NA 值定义为:
na_values scalar, str, list-like, or dict, optional Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN:
''
, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
因此,read_csv() 确实接受了更多的 NA 值并将它们转换为 NaN
。
您可以使用这两个参数定义可接受的 NA 值来解决您的不匹配问题。
df.to_csv(file_name_final, index=True, quotechar='"', doublequote=True,
quoting=csv.QUOTE_NONNUMERIC, header=True, na_repstr='NA');
df_file = pd.read_csv(file_name, index_col=0, header=0,parse_dates=True,
na_values='NA', keep_default_na=False)
请注意,这可能会破坏数据框中的其他值。