如何在 Pandas (Python 2.7) 中的两个 csv 文件之间进行字符串 object 比较并生成新的数据框或 csv 文件
How can I do a string object comparison between two csv files in Pandas (Python 2.7) and generate a new dataframe or csv file
我正在读取两个 .csv 文件,"CSV_1" 和 "CSV_2"。在大多数情况下,除了我将用作字符串键的列之外,它们还有不同的列和数据。
为了将 CSV_2 中的数据连接到 CSV_1 中匹配的每一行的末尾,四列中每行的元素(基本上是我的键)需要相同.
两个csv文件中的四列分别是"Date"(格式:Month/Day/Year)、"Hour"(格式:0-23)、"Make"、&"Model" 并且是 str obj 数据类型。
它基本上会读取 csv_1 中的第一行并获取这四个因素,然后查看 CSV_2 并找到这四列元素的任何实例。一旦在两者中找到匹配项,我想获取 CSV_1 和 CSV_2 中的所有列并将这些列连接到一个新的数据框中,而不复制 "Date"、"Hour"、 "Car", & "Model" 列,因为它们在两者中是相同的。我想我可以从新数据框中删除重复的列。
我敢肯定会有一些情况下不会有任何匹配项,我仍然需要这些数据,所以我想我需要 fillna 或其他东西来在末尾生成空白单元格在将行添加到新数据框之前,从 CSV_2 添加的每一列来自 CSV_1 的行。
我为这个例子生成了假数据,但它应该对输出做这样的事情(除了从两个 csv 文件读取数据,我无法提供实际代码的片段)
CSV_1:
import pandas as pd
from pandas import DataFrame
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7','8']
safe = ['8', '10', '6','7']
dataSet = zip(date, hr, make, model, gas, safe)
df = pd.DataFrame(data = dataSet, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating'])
>>>df
CSV_2:
make2 = ['Honda', 'Toyota', 'Honda']
model2 = ['Accord', 'Camry', 'Civic']
mile = ['10', '10','9']
speed = ['7', '7', '6']
dataSet2 = zip(date, hr, make2, model2, mile, speed)
df2 = pd.DataFrame(data = dataSet2, columns=['Date', 'Hour', 'Make', 'Model', 'Mileage Rating', 'Speed Rating'])
>>>df2
这是字符串键比较应该发挥作用的地方,基本上给我以下代码的输出(headers 将始终相同,但数据量不会,实际上有两者都接近 100 多列)
Final_df:
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7', '8']
safe = ['8', '10', '6', '7']
mile = ['10', '10', ' ','9']
speed = ['7', '7', ' ', '6']
dataSet3 = zip(date, hr, make, model, gas, safe, mile, speed)
df3 = pd.DataFrame(data = dataSet3, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating', 'Mileage Rating', 'Speed Rating'])
>>>df3
最初,您请求加入 4 个不同的列(Make
、Model
、Date
、Hour
),但第二个 table 实际上只有两个匹配(Make
、Model
)。
外连接仍然适用,但原来的答案是正确的:
In [8]: df
Out[8]:
Date Hour Make Model Gas Rating Safety Rating
0 5/10/2012 1 Honda Accord 9 8
1 10/17/2012 0 Toyota Camry 9 10
2 1/2/2013 23 Chevy Sonic 7 6
3 5/3/2014 13 Honda Civic 8 7
In [9]: df2
Out[9]:
Date Hour Make Model Mileage Rating Speed Rating
0 5/10/2012 1 Honda Accord 10 7
1 10/17/2012 0 Toyota Camry 10 7
2 1/2/2013 23 Honda Civic 9 6
In [10]: final = pd.merge(df,df2, how='outer', on=['Date', 'Hour', 'Make', 'Model'])
In [11]: final
Out[11]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 5/10/2012 1 Honda Accord 9 8 10
1 10/17/2012 0 Toyota Camry 9 10 10
2 1/2/2013 23 Chevy Sonic 7 6 NaN
3 5/3/2014 13 Honda Civic 8 7 NaN
4 1/2/2013 23 Honda Civic NaN NaN 9
Speed Rating
0 7
1 7
2 NaN
3 NaN
4 6
In [12]: final.fillna(0, inplace=True)
In [13]: final
Out[13]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 5/10/2012 1 Honda Accord 9 8 10
1 10/17/2012 0 Toyota Camry 9 10 10
2 1/2/2013 23 Chevy Sonic 7 6 0
3 5/3/2014 13 Honda Civic 8 7 0
4 1/2/2013 23 Honda Civic 0 0 9
Speed Rating
0 7
1 7
2 0
3 0
4 6
我正在读取两个 .csv 文件,"CSV_1" 和 "CSV_2"。在大多数情况下,除了我将用作字符串键的列之外,它们还有不同的列和数据。
为了将 CSV_2 中的数据连接到 CSV_1 中匹配的每一行的末尾,四列中每行的元素(基本上是我的键)需要相同.
两个csv文件中的四列分别是"Date"(格式:Month/Day/Year)、"Hour"(格式:0-23)、"Make"、&"Model" 并且是 str obj 数据类型。
它基本上会读取 csv_1 中的第一行并获取这四个因素,然后查看 CSV_2 并找到这四列元素的任何实例。一旦在两者中找到匹配项,我想获取 CSV_1 和 CSV_2 中的所有列并将这些列连接到一个新的数据框中,而不复制 "Date"、"Hour"、 "Car", & "Model" 列,因为它们在两者中是相同的。我想我可以从新数据框中删除重复的列。
我敢肯定会有一些情况下不会有任何匹配项,我仍然需要这些数据,所以我想我需要 fillna 或其他东西来在末尾生成空白单元格在将行添加到新数据框之前,从 CSV_2 添加的每一列来自 CSV_1 的行。
我为这个例子生成了假数据,但它应该对输出做这样的事情(除了从两个 csv 文件读取数据,我无法提供实际代码的片段)
CSV_1:
import pandas as pd
from pandas import DataFrame
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7','8']
safe = ['8', '10', '6','7']
dataSet = zip(date, hr, make, model, gas, safe)
df = pd.DataFrame(data = dataSet, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating'])
>>>df
CSV_2:
make2 = ['Honda', 'Toyota', 'Honda']
model2 = ['Accord', 'Camry', 'Civic']
mile = ['10', '10','9']
speed = ['7', '7', '6']
dataSet2 = zip(date, hr, make2, model2, mile, speed)
df2 = pd.DataFrame(data = dataSet2, columns=['Date', 'Hour', 'Make', 'Model', 'Mileage Rating', 'Speed Rating'])
>>>df2
这是字符串键比较应该发挥作用的地方,基本上给我以下代码的输出(headers 将始终相同,但数据量不会,实际上有两者都接近 100 多列)
Final_df:
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7', '8']
safe = ['8', '10', '6', '7']
mile = ['10', '10', ' ','9']
speed = ['7', '7', ' ', '6']
dataSet3 = zip(date, hr, make, model, gas, safe, mile, speed)
df3 = pd.DataFrame(data = dataSet3, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating', 'Mileage Rating', 'Speed Rating'])
>>>df3
最初,您请求加入 4 个不同的列(Make
、Model
、Date
、Hour
),但第二个 table 实际上只有两个匹配(Make
、Model
)。
外连接仍然适用,但原来的答案是正确的:
In [8]: df
Out[8]:
Date Hour Make Model Gas Rating Safety Rating
0 5/10/2012 1 Honda Accord 9 8
1 10/17/2012 0 Toyota Camry 9 10
2 1/2/2013 23 Chevy Sonic 7 6
3 5/3/2014 13 Honda Civic 8 7
In [9]: df2
Out[9]:
Date Hour Make Model Mileage Rating Speed Rating
0 5/10/2012 1 Honda Accord 10 7
1 10/17/2012 0 Toyota Camry 10 7
2 1/2/2013 23 Honda Civic 9 6
In [10]: final = pd.merge(df,df2, how='outer', on=['Date', 'Hour', 'Make', 'Model'])
In [11]: final
Out[11]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 5/10/2012 1 Honda Accord 9 8 10
1 10/17/2012 0 Toyota Camry 9 10 10
2 1/2/2013 23 Chevy Sonic 7 6 NaN
3 5/3/2014 13 Honda Civic 8 7 NaN
4 1/2/2013 23 Honda Civic NaN NaN 9
Speed Rating
0 7
1 7
2 NaN
3 NaN
4 6
In [12]: final.fillna(0, inplace=True)
In [13]: final
Out[13]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 5/10/2012 1 Honda Accord 9 8 10
1 10/17/2012 0 Toyota Camry 9 10 10
2 1/2/2013 23 Chevy Sonic 7 6 0
3 5/3/2014 13 Honda Civic 8 7 0
4 1/2/2013 23 Honda Civic 0 0 9
Speed Rating
0 7
1 7
2 0
3 0
4 6