Python:比较 2 个 CSV 文件的差异 1 列值和第 3 个 csv 文件中的输出
Python: Comparing 2 CSV files for Difference 1 column value and output in 3rd csv file
我有 2 个 CSV 文件,它们的列数和格式相同,每行包含有关服务器的详细信息。每个文件指的是不同的一天。
我想将 Day2 CSV file
列(D
列)的每个服务器(行)与 Day1 CSV file
列的每个服务器进行比较Size (GB)
列(D
列),并将输出写入 day2 CSV file
的 column E
或单独的第三个 CSV 文件以跟踪 difference/growth 的大小每天。
我正努力在 Python
中实现它。
接下来我举个例子:
day1.csv
Server Site Platform Size(GB)
a Primary Windows 100
b Secondary Unix 200
c Primary Oracle 500
day2.csv
Server Site Platform Size(GB)
a Primary Windows 150
b Secondary Unix 100
c Primary Oracle 500
预期结果
output.csv
Server Site Platform Size(GB) Growth(GB)
a Primary Windows 150 50
b Secondary Unix 100 -100
c Primary Oracle 500 0
编辑 1:
这是我目前开发的代码:
import csv
t1 = open('/day1.csv', 'r')
t2 = open('/day2.csv', 'r')
outputt=open("/growth.csv","w")
fileone = t1.readlines()
filetwo = t2.readlines()
for line in filetwo:
row = row.split(',')
a = str(row[0])
b = str(row[1])
c = str(row[2])
d = float(row[3])
f = float(filetwo.row[3] - fileone.row[3])
outputt.writerow([a,b,c,d,e,f])
outputt.write(line.replace("\n","") + ";6column\n") outputt.close()
fileone.close()
这可以使用 Python 的 CSV 库和 OrderedDict
来维护原始文件顺序:
from collections import OrderedDict
import csv
with open('day1.csv', 'rb') as f_day1, open('day2.csv', 'rb') as f_day2:
csv_day1 = csv.reader(f_day1)
csv_day2 = csv.reader(f_day2)
header = next(csv_day1) + ['Growth(GB)']
next(csv_day2)
day1 = OrderedDict([row[0], [row[1], row[2], int(row[3])]] for row in csv_day1)
day2 = OrderedDict([row[0], [row[1], row[2], int(row[3])]] for row in csv_day2)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for server, data in day1.items():
data.append(day2[server][2] - data[2])
data[2] = day2[server][2]
csv_output.writerow([server] + data)
给你一个输出CSV文件如下:
Server,Site,Platform,Size(GB),Growth(GB)
a,Primary,Windows,150,50
b,Secondary,Unix,100,-100
c,Primary,Oracle,500,0
注意:使用with
时文件会自动关闭。
在 Python 2.7.12
上测试
这不是一个非常通用的解决方案,但我尽可能地尝试遵循您的方法:
import csv
# Open read files
file1 = open('day1.csv', 'r')
file2 = open('day2.csv', 'r')
# Open output file
outputFile = open ('day3.csv', 'w')
csvWriter = csv.writer(outputFile, delimiter=',')
# Write the output file header
csvWriter.writerow(["Server", "Site", "Platform", "Size", "Growth"])
# Process input files
csvReader1 = csv.reader(file1, delimiter=',')
csvReader2 = csv.reader(file2, delimiter=',')
# Skip headers
csvReader1.next()
csvReader2.next()
# Process data
for rowF2 in csvReader2:
# Get the content of each line in F1
rowF1 = csvReader1.next()
# Uncomment for debug
#print rowF1
#print rowF2
# Construct output line from F2 values
colA = str(rowF2[0])
colB = str(rowF2[1])
colC = str(rowF2[2])
# Compute the growth
colD = str(int(rowF2[3]) - int(rowF1[3]))
# Write the output file
csvWriter.writerow([colA, colB, colC, colD])
file1.close()
file2.close()
outputFile.close()
在我看来,最大的担忧在于:
- 您需要使用
CSV
库(csv reader 和 writer)
- 需要时您需要跳过 headers
- 您需要在执行结束时关闭所有文件
# Show True/False against column containing NaN(Mached data)
print(difference.isnull().any())
# Count of NaN(Mached data) in each column
print(difference.isnull().sum())
# Count of Mismatched Data in each column
print(difference.count())
# Difference in records from 2 csv loaded in dataframe df
df = difference.dropna(axis=0,how='all')
# OutputFile to be saved as 'output_file'.
df.to_csv(output_file)
我有 2 个 CSV 文件,它们的列数和格式相同,每行包含有关服务器的详细信息。每个文件指的是不同的一天。
我想将 Day2 CSV file
列(D
列)的每个服务器(行)与 Day1 CSV file
列的每个服务器进行比较Size (GB)
列(D
列),并将输出写入 day2 CSV file
的 column E
或单独的第三个 CSV 文件以跟踪 difference/growth 的大小每天。
我正努力在 Python
中实现它。
接下来我举个例子:
day1.csv
Server Site Platform Size(GB)
a Primary Windows 100
b Secondary Unix 200
c Primary Oracle 500
day2.csv
Server Site Platform Size(GB)
a Primary Windows 150
b Secondary Unix 100
c Primary Oracle 500
预期结果 output.csv
Server Site Platform Size(GB) Growth(GB)
a Primary Windows 150 50
b Secondary Unix 100 -100
c Primary Oracle 500 0
编辑 1:
这是我目前开发的代码:
import csv
t1 = open('/day1.csv', 'r')
t2 = open('/day2.csv', 'r')
outputt=open("/growth.csv","w")
fileone = t1.readlines()
filetwo = t2.readlines()
for line in filetwo:
row = row.split(',')
a = str(row[0])
b = str(row[1])
c = str(row[2])
d = float(row[3])
f = float(filetwo.row[3] - fileone.row[3])
outputt.writerow([a,b,c,d,e,f])
outputt.write(line.replace("\n","") + ";6column\n") outputt.close()
fileone.close()
这可以使用 Python 的 CSV 库和 OrderedDict
来维护原始文件顺序:
from collections import OrderedDict
import csv
with open('day1.csv', 'rb') as f_day1, open('day2.csv', 'rb') as f_day2:
csv_day1 = csv.reader(f_day1)
csv_day2 = csv.reader(f_day2)
header = next(csv_day1) + ['Growth(GB)']
next(csv_day2)
day1 = OrderedDict([row[0], [row[1], row[2], int(row[3])]] for row in csv_day1)
day2 = OrderedDict([row[0], [row[1], row[2], int(row[3])]] for row in csv_day2)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for server, data in day1.items():
data.append(day2[server][2] - data[2])
data[2] = day2[server][2]
csv_output.writerow([server] + data)
给你一个输出CSV文件如下:
Server,Site,Platform,Size(GB),Growth(GB)
a,Primary,Windows,150,50
b,Secondary,Unix,100,-100
c,Primary,Oracle,500,0
注意:使用with
时文件会自动关闭。
在 Python 2.7.12
上测试这不是一个非常通用的解决方案,但我尽可能地尝试遵循您的方法:
import csv
# Open read files
file1 = open('day1.csv', 'r')
file2 = open('day2.csv', 'r')
# Open output file
outputFile = open ('day3.csv', 'w')
csvWriter = csv.writer(outputFile, delimiter=',')
# Write the output file header
csvWriter.writerow(["Server", "Site", "Platform", "Size", "Growth"])
# Process input files
csvReader1 = csv.reader(file1, delimiter=',')
csvReader2 = csv.reader(file2, delimiter=',')
# Skip headers
csvReader1.next()
csvReader2.next()
# Process data
for rowF2 in csvReader2:
# Get the content of each line in F1
rowF1 = csvReader1.next()
# Uncomment for debug
#print rowF1
#print rowF2
# Construct output line from F2 values
colA = str(rowF2[0])
colB = str(rowF2[1])
colC = str(rowF2[2])
# Compute the growth
colD = str(int(rowF2[3]) - int(rowF1[3]))
# Write the output file
csvWriter.writerow([colA, colB, colC, colD])
file1.close()
file2.close()
outputFile.close()
在我看来,最大的担忧在于:
- 您需要使用
CSV
库(csv reader 和 writer) - 需要时您需要跳过 headers
- 您需要在执行结束时关闭所有文件
# Show True/False against column containing NaN(Mached data)
print(difference.isnull().any())
# Count of NaN(Mached data) in each column
print(difference.isnull().sum())
# Count of Mismatched Data in each column
print(difference.count())
# Difference in records from 2 csv loaded in dataframe df
df = difference.dropna(axis=0,how='all')
# OutputFile to be saved as 'output_file'.
df.to_csv(output_file)