如何在读取批处理 csv 文件时动态添加缺失的列
How to dynamically add missing columns while reading batch csv files
我有 12 个 csv 文件要在单个输出数据框中读取。我在最终输出数据框中想要的列分布在多个文件中。例如如下图
文件 1-8 中的列列表
person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE
来自文件 9 的列列表
person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
文件 10-12 中的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
根据我对领域的理解,我知道 START_DATE
和 REGISTRATION_DATE
列的含义相同。
同理,subject_CD
和subject_Code
意思相同
因此,我在 的帮助下尝试了以下重命名列的方法。
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df[df.columns.intersection(['person_ID','Test_CODE','REGISTRATION_DATE','subject_CD','subject_DESCRIPTION'])].rename(columns={'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'})
dfs.append(df1)
但是,我不确定 how can I add a column on the fly
,因为缺少 files 1-9
END_DATE
。虽然我只想拥有一个没有数据的列 END_DATE
。只有当我有列 END_DATE
时,我才能附加所有输入数据帧并获得最终输出数据帧。
或者是否可以根据公共列附加一个数据框,并在最终输出数据框中添加一个 END_DATE
列(附加后)?
我希望我的最终数据框具有如下所示的列
来自最终输出数据帧的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION
我认为你可以先使用 rename
,然后 DataFrame.reindex
用于 return 只有列表中传递的列,如果 DataFrame 中不存在的列表中的列被附加缺失值填充:
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
列表理解备选方案:
dfs = [pd.read_excel(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols)
for f in files]
测试数据:
print (df1)
person_ID Test_CODE REGISTRATION_DATE subject_CD subject_DESCRIPTION \
0 id1 aa 2015-01-01 sub1 desc
subject_TYPE
0 type1
print (df2)
person_ID Test_CODE REGISTRATION_DATE subject_Code subject_DESCRIPTION \
0 id2 bb 2017-01-01 sub1 desc2
subject_Indica
0 type2
print (df3)
person_ID Test_CODE START_DATE END_DATE subject_Code \
0 id3 cc 2017-01-01 2017-08-06 sub3
subject_DESCRIPTION subject_Indicator
0 desc3 type3
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for df in [df1, df2, df3]:
# df = pd.read_excel(f, sep=",",low_memory=False)
#print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
df = pd.concat(dfs, ignore_index=True)
print (df)
person_ID Test_CODE START_DATE END_DATE subject_Code subject_DESCRIPTION
0 id1 aa 2015-01-01 NaN sub1 desc
1 id2 bb 2017-01-01 NaN sub1 desc2
2 id3 cc 2017-01-01 2017-08-06 sub3 desc3
我有 12 个 csv 文件要在单个输出数据框中读取。我在最终输出数据框中想要的列分布在多个文件中。例如如下图
文件 1-8 中的列列表
person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE
来自文件 9 的列列表
person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
文件 10-12 中的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
根据我对领域的理解,我知道 START_DATE
和 REGISTRATION_DATE
列的含义相同。
同理,subject_CD
和subject_Code
意思相同
因此,我在
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df[df.columns.intersection(['person_ID','Test_CODE','REGISTRATION_DATE','subject_CD','subject_DESCRIPTION'])].rename(columns={'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'})
dfs.append(df1)
但是,我不确定 how can I add a column on the fly
,因为缺少 files 1-9
END_DATE
。虽然我只想拥有一个没有数据的列 END_DATE
。只有当我有列 END_DATE
时,我才能附加所有输入数据帧并获得最终输出数据帧。
或者是否可以根据公共列附加一个数据框,并在最终输出数据框中添加一个 END_DATE
列(附加后)?
我希望我的最终数据框具有如下所示的列
来自最终输出数据帧的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION
我认为你可以先使用 rename
,然后 DataFrame.reindex
用于 return 只有列表中传递的列,如果 DataFrame 中不存在的列表中的列被附加缺失值填充:
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
列表理解备选方案:
dfs = [pd.read_excel(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols)
for f in files]
测试数据:
print (df1)
person_ID Test_CODE REGISTRATION_DATE subject_CD subject_DESCRIPTION \
0 id1 aa 2015-01-01 sub1 desc
subject_TYPE
0 type1
print (df2)
person_ID Test_CODE REGISTRATION_DATE subject_Code subject_DESCRIPTION \
0 id2 bb 2017-01-01 sub1 desc2
subject_Indica
0 type2
print (df3)
person_ID Test_CODE START_DATE END_DATE subject_Code \
0 id3 cc 2017-01-01 2017-08-06 sub3
subject_DESCRIPTION subject_Indicator
0 desc3 type3
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for df in [df1, df2, df3]:
# df = pd.read_excel(f, sep=",",low_memory=False)
#print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
df = pd.concat(dfs, ignore_index=True)
print (df)
person_ID Test_CODE START_DATE END_DATE subject_Code subject_DESCRIPTION
0 id1 aa 2015-01-01 NaN sub1 desc
1 id2 bb 2017-01-01 NaN sub1 desc2
2 id3 cc 2017-01-01 2017-08-06 sub3 desc3