如何在 Pandas (Python 2.7) 中的两个 csv 文件之间进行字符串 object 比较并生成新的数据框或 csv 文件

How can I do a string object comparison between two csv files in Pandas (Python 2.7) and generate a new dataframe or csv file

我正在读取两个 .csv 文件,"CSV_1" 和 "CSV_2"。在大多数情况下,除了我将用作字符串键的列之外,它们还有不同的列和数据。

为了将 CSV_2 中的数据连接到 CSV_1 中匹配的每一行的末尾,四列中每行的元素(基本上是我的键)需要相同.

两个csv文件中的四列分别是"Date"(格式:Month/Day/Year)、"Hour"(格式:0-23)、"Make"、&"Model" 并且是 str obj 数据类型。

它基本上会读取 csv_1 中的第一行并获取这四个因素,然后查看 CSV_2 并找到这四列元素的任何实例。一旦在两者中找到匹配项,我想获取 CSV_1 和 CSV_2 中的所有列并将这些列连接到一个新的数据框中,而不复制 "Date"、"Hour"、 "Car", & "Model" 列,因为它们在两者中是相同的。我想我可以从新数据框中删除重复的列。

我敢肯定会有一些情况下不会有任何匹配项,我仍然需要这些数据,所以我想我需要 fillna 或其他东西来在末尾生成空白单元格在将行添加到新数据框之前,从 CSV_2 添加的每一列来自 CSV_1 的行。

我为这个例子生成了假数据,但它应该对输出做这样的事情(除了从两个 csv 文件读取数据,我无法提供实际代码的片段)

CSV_1:

import pandas as pd
from pandas import DataFrame

date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7','8']
safe = ['8', '10', '6','7']


dataSet = zip(date, hr, make, model, gas, safe)
df = pd.DataFrame(data = dataSet, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating'])
>>>df

CSV_2:

make2 = ['Honda', 'Toyota', 'Honda']
model2 = ['Accord', 'Camry', 'Civic']
mile = ['10', '10','9']
speed = ['7', '7', '6']

dataSet2 = zip(date, hr, make2, model2, mile, speed)
df2 = pd.DataFrame(data = dataSet2, columns=['Date', 'Hour', 'Make', 'Model', 'Mileage Rating', 'Speed Rating'])
>>>df2

这是字符串键比较应该发挥作用的地方,基本上给我以下代码的输出(headers 将始终相同,但数据量不会,实际上有两者都接近 100 多列)

Final_df:

date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7', '8']
safe = ['8', '10', '6', '7']
mile = ['10', '10', ' ','9']
speed = ['7', '7', ' ', '6']

dataSet3 = zip(date, hr, make, model, gas, safe, mile, speed)
df3 = pd.DataFrame(data = dataSet3, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating', 'Mileage Rating', 'Speed Rating'])
>>>df3     

最初,您请求加入 4 个不同的列(MakeModelDateHour),但第二个 table 实际上只有两个匹配(MakeModel)。

外连接仍然适用,但原来的答案是正确的:

In [8]: df
Out[8]: 
         Date Hour    Make   Model Gas Rating Safety Rating
0   5/10/2012    1   Honda  Accord          9             8
1  10/17/2012    0  Toyota   Camry          9            10
2    1/2/2013   23   Chevy   Sonic          7             6
3    5/3/2014   13   Honda   Civic          8             7

In [9]: df2
Out[9]: 
         Date Hour    Make   Model Mileage Rating Speed Rating
0   5/10/2012    1   Honda  Accord             10            7
1  10/17/2012    0  Toyota   Camry             10            7
2    1/2/2013   23   Honda   Civic              9            6

In [10]: final = pd.merge(df,df2, how='outer', on=['Date', 'Hour', 'Make', 'Model'])

In [11]: final
Out[11]: 
         Date Hour    Make   Model Gas Rating Safety Rating Mileage Rating  \
0   5/10/2012    1   Honda  Accord          9             8             10   
1  10/17/2012    0  Toyota   Camry          9            10             10   
2    1/2/2013   23   Chevy   Sonic          7             6            NaN   
3    5/3/2014   13   Honda   Civic          8             7            NaN   
4    1/2/2013   23   Honda   Civic        NaN           NaN              9   

  Speed Rating  
0            7  
1            7  
2          NaN  
3          NaN  
4            6  

In [12]: final.fillna(0, inplace=True)

In [13]: final
Out[13]: 
         Date Hour    Make   Model Gas Rating Safety Rating Mileage Rating  \
0   5/10/2012    1   Honda  Accord          9             8             10   
1  10/17/2012    0  Toyota   Camry          9            10             10   
2    1/2/2013   23   Chevy   Sonic          7             6              0   
3    5/3/2014   13   Honda   Civic          8             7              0   
4    1/2/2013   23   Honda   Civic          0             0              9   

  Speed Rating  
0            7  
1            7  
2            0  
3            0  
4            6