如何比较 python/pandas 中 2 个 csv 文件(csv1 和 csv2)的列中的所有条目?
How to compare all entries in columns of 2 csv files (csv1 and csv2) in python/pandas?
我有 2 个 CSV 文件:
CSV1:
"Hypervisor","IP","ABCD","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","lol"
"lglac125.lss.com","10.247.52.125","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9"
"lglac126.lss.com","10.247.52.126","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac127.lss.com","10.247.52.127","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9"
"lglac128.lss.com","10.247.52.128","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac129.lss.com","10.247.52.129","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
CSV2:
"Hypervisor","IP","Arrays","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","DummyColumn"
"lglac125.lss.com","10.247.52.125",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9","A"
"lglac126.lss.com","10.247.52.126",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","B"
"lglac127.lss.com","10.247.52.127",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9","C"
"lglac128.lss.com","10.247.52.128",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","D"
"lglac129.lss.com","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","E"
"DummyRow","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","F"
我正在尝试将每列的所有条目(如果在 csv2 中可用)与相应的行进行比较。如果有任何条目丢失或更改,我需要举旗。 可能会在两个文件中添加或删除任何列。所以我需要首先检查 csv2 中是否存在列 x,然后匹配来自csv1.
我已经为此苦苦挣扎了三天,无法找到解决方案。我非常感谢任何帮助。
你可以用 merge
and indicator=True
and query()
试试 both
:
matching_cols=df1.columns.intersection(df2.columns).tolist() #find matching columns to merge
df1.merge(df2,on=matching_cols,how='outer',indicator=True).query("_merge!='both'")
这将显示数据帧之间的不常见数据
Hypervisor IP Operating System \
0 lglac125.lss.emc.com 10.247.52.125 VMware ESXi 5.5.0 build-9919047
5 lglac125.lss.emc.com VMware ESXi 5.5.0 build-9919047
6 DummyRow 10.247.52.129 VMware ESXi 5.5.0 build-9919047
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
5 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
6 lss.emc.com 524278.03125 4.0 100.0 1.558600e+09
Arrays DummyColumn _merge
0 NaN NaN left_only
5 NaN A right_only
6 NaN F right_only
IIUC,
假设将 csv1、csv2 作为 df1
、df2
导入到 pandas。在列上使用 intersection
查找匹配列并对其进行排序。将其传递给 df1
和 df2
。最后,eq
在 df1
和 df2
的匹配列 sub-set 上
matched_list = df1.columns.intersection(df2.columns).sort_values()
df1_mask = df1[matched_list].eq(df2[matched_list])
Out[853]:
Availability (%) Domain Hypervisor IP Last Collection Time Memory \
0 True True True False True True
1 True True True True True True
2 True True True True True True
3 True True True True True True
4 True True True True True True
5 False False False False False False
No. CPU Operating System
0 True True
1 True True
2 True True
3 True True
4 True True
5 False False
注意:我将df1.loc[0, 'IP']
更改为10.247.52.124
以在[的第0行的一个值中显示False
=13=] 用于演示
从此 df1_mask
,您可以将其插入 df1
以检查 NaN
。任何 NaN
要么是原始值 NaN
,要么是在 df1
和 df2
之间更改
df1[df1_mask]
Out[854]:
Hypervisor IP Operating System Domain \
0 lglac125.lss.com NaN VMware ESXi 5.5.0 build-9919047 lss.com
1 lglac126.lss.com 10.247.52.126 VMware ESXi 5.5.0 build-9919047 lss.com
2 lglac127.lss.com 10.247.52.127 VMware ESXi 5.5.0 build-9919047 lss.com
3 lglac128.lss.com 10.247.52.128 VMware ESXi 5.5.0 build-9919047 lss.com
4 lglac129.lss.com 10.247.52.129 VMware ESXi 5.5.0 build-9919047 lss.com
Memory No. CPU Availability (%) Last Collection Time lol
0 524278.03125 4.0 100.0 1.558599e+09 NaN
1 524278.03125 4.0 100.0 1.558600e+09 NaN
2 524278.03125 4.0 0.0 1.558599e+09 NaN
3 524278.03125 4.0 100.0 1.558600e+09 NaN
4 524278.03125 4.0 100.0 1.558600e+09 NaN
注意:你的df1
有列lol
但没有值,所以它原来是NaN
或者您可以查看 df2
df2[df1_mask]
Out[855]:
Hypervisor IP Arrays Operating System \
0 lglac125.lss.com NaN NaN VMware ESXi 5.5.0 build-9919047
1 lglac126.lss.com 10.247.52.126 NaN VMware ESXi 5.5.0 build-9919047
2 lglac127.lss.com 10.247.52.127 NaN VMware ESXi 5.5.0 build-9919047
3 lglac128.lss.com 10.247.52.128 NaN VMware ESXi 5.5.0 build-9919047
4 lglac129.lss.com 10.247.52.129 NaN VMware ESXi 5.5.0 build-9919047
5 NaN NaN NaN NaN
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.com 524278.03125 4.0 100.0 1.558599e+09
1 lss.com 524278.03125 4.0 100.0 1.558600e+09
2 lss.com 524278.03125 4.0 0.0 1.558599e+09
3 lss.com 524278.03125 4.0 100.0 1.558600e+09
4 lss.com 524278.03125 4.0 100.0 1.558600e+09
5 NaN NaN NaN NaN NaN
DummyColumn
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
我有 2 个 CSV 文件:
CSV1:
"Hypervisor","IP","ABCD","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","lol"
"lglac125.lss.com","10.247.52.125","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9"
"lglac126.lss.com","10.247.52.126","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac127.lss.com","10.247.52.127","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9"
"lglac128.lss.com","10.247.52.128","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac129.lss.com","10.247.52.129","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
CSV2:
"Hypervisor","IP","Arrays","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","DummyColumn"
"lglac125.lss.com","10.247.52.125",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9","A"
"lglac126.lss.com","10.247.52.126",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","B"
"lglac127.lss.com","10.247.52.127",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9","C"
"lglac128.lss.com","10.247.52.128",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","D"
"lglac129.lss.com","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","E"
"DummyRow","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","F"
我正在尝试将每列的所有条目(如果在 csv2 中可用)与相应的行进行比较。如果有任何条目丢失或更改,我需要举旗。 可能会在两个文件中添加或删除任何列。所以我需要首先检查 csv2 中是否存在列 x,然后匹配来自csv1.
我已经为此苦苦挣扎了三天,无法找到解决方案。我非常感谢任何帮助。
你可以用 merge
and indicator=True
and query()
试试 both
:
matching_cols=df1.columns.intersection(df2.columns).tolist() #find matching columns to merge
df1.merge(df2,on=matching_cols,how='outer',indicator=True).query("_merge!='both'")
这将显示数据帧之间的不常见数据
Hypervisor IP Operating System \
0 lglac125.lss.emc.com 10.247.52.125 VMware ESXi 5.5.0 build-9919047
5 lglac125.lss.emc.com VMware ESXi 5.5.0 build-9919047
6 DummyRow 10.247.52.129 VMware ESXi 5.5.0 build-9919047
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
5 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
6 lss.emc.com 524278.03125 4.0 100.0 1.558600e+09
Arrays DummyColumn _merge
0 NaN NaN left_only
5 NaN A right_only
6 NaN F right_only
IIUC,
假设将 csv1、csv2 作为 df1
、df2
导入到 pandas。在列上使用 intersection
查找匹配列并对其进行排序。将其传递给 df1
和 df2
。最后,eq
在 df1
和 df2
matched_list = df1.columns.intersection(df2.columns).sort_values()
df1_mask = df1[matched_list].eq(df2[matched_list])
Out[853]:
Availability (%) Domain Hypervisor IP Last Collection Time Memory \
0 True True True False True True
1 True True True True True True
2 True True True True True True
3 True True True True True True
4 True True True True True True
5 False False False False False False
No. CPU Operating System
0 True True
1 True True
2 True True
3 True True
4 True True
5 False False
注意:我将df1.loc[0, 'IP']
更改为10.247.52.124
以在[的第0行的一个值中显示False
=13=] 用于演示
从此 df1_mask
,您可以将其插入 df1
以检查 NaN
。任何 NaN
要么是原始值 NaN
,要么是在 df1
和 df2
df1[df1_mask]
Out[854]:
Hypervisor IP Operating System Domain \
0 lglac125.lss.com NaN VMware ESXi 5.5.0 build-9919047 lss.com
1 lglac126.lss.com 10.247.52.126 VMware ESXi 5.5.0 build-9919047 lss.com
2 lglac127.lss.com 10.247.52.127 VMware ESXi 5.5.0 build-9919047 lss.com
3 lglac128.lss.com 10.247.52.128 VMware ESXi 5.5.0 build-9919047 lss.com
4 lglac129.lss.com 10.247.52.129 VMware ESXi 5.5.0 build-9919047 lss.com
Memory No. CPU Availability (%) Last Collection Time lol
0 524278.03125 4.0 100.0 1.558599e+09 NaN
1 524278.03125 4.0 100.0 1.558600e+09 NaN
2 524278.03125 4.0 0.0 1.558599e+09 NaN
3 524278.03125 4.0 100.0 1.558600e+09 NaN
4 524278.03125 4.0 100.0 1.558600e+09 NaN
注意:你的df1
有列lol
但没有值,所以它原来是NaN
或者您可以查看 df2
df2[df1_mask]
Out[855]:
Hypervisor IP Arrays Operating System \
0 lglac125.lss.com NaN NaN VMware ESXi 5.5.0 build-9919047
1 lglac126.lss.com 10.247.52.126 NaN VMware ESXi 5.5.0 build-9919047
2 lglac127.lss.com 10.247.52.127 NaN VMware ESXi 5.5.0 build-9919047
3 lglac128.lss.com 10.247.52.128 NaN VMware ESXi 5.5.0 build-9919047
4 lglac129.lss.com 10.247.52.129 NaN VMware ESXi 5.5.0 build-9919047
5 NaN NaN NaN NaN
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.com 524278.03125 4.0 100.0 1.558599e+09
1 lss.com 524278.03125 4.0 100.0 1.558600e+09
2 lss.com 524278.03125 4.0 0.0 1.558599e+09
3 lss.com 524278.03125 4.0 100.0 1.558600e+09
4 lss.com 524278.03125 4.0 100.0 1.558600e+09
5 NaN NaN NaN NaN NaN
DummyColumn
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN