基于键的 CSV 连接
CSV joining based on keys
这可能是一个 simple/repeat 问题,但我 find/figure 还不知道该怎么做。
我有两个 csv 文件:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,
然后
age.csv:
student_id,age_1
3124,20
9087,21
1234,45
我想比较两个 csv 文件,基于 info.csv 的列“id
”和来自age.csv并取对应的“age_1
”数据放入[=56=中的“age
”列].
所以最终输出应该是:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,45
bcd, uvw, 3124, 813-222-1111, tre,20
poi, ccc, 9087, 123-45607890, weq,21
我可以简单地将基于键的表连接成 new.csv,但不能将数据放入列标题“age
”。我用“csvkit
”来做到这一点。
这是我使用的:
csvjoin -c 3,1 info.csv age.csv > new.csv
试试这个...
import csv
info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))
def copyCSV(age, info, outFileName = 'out.csv'):
# put age into dict, indexed by ID
# assumes no duplicate entries
# 1 - build a dict ageDict to represent data
ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])
# 2 - setup output
with open(outFileName, 'wb') as outFile:
outwriter = csv.writer(outFile)
# 3 - run through info and slot in ages and write to output
# nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
outwriter.writerow(info[0])
for entry in info[1:]:
if entry != []:
key = entry[2].replace(' ','')
if key in ageDict: # checks that you have data from age.csv
entry[5] = ageDict[key]
outwriter.writerow(entry)
copyCSV(age, info)
让我知道它是否有效或有任何不清楚的地方。我使用了字典,因为如果你的文件很大,它应该会更快,因为你只需要循环遍历 age.csv 中的数据一次。
可能有更简单的方法/已经实现的东西...但这应该可以解决问题。
您可以使用 Pandas
并使用 age
数据更新 info dataframe
。您可以通过将两个数据帧的索引分别设置为 ID
和 student_id
来实现,然后更新 info dataframe
中的年龄列。之后您重置索引,使 ID
再次成为列。
from StringIO import StringIO
import pandas as pd
info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,""")
age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")
info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')
info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df
输出:
ID Last Name First Name phone adress age X [Total age: 100] |009076
0 1234 abc xyz 982-128-0000 pqt 45
1 3124 bcd uvw 813-222-1111 tre 20
2 9087 poi ccc 123-45607890 weq 21
这可能是一个 simple/repeat 问题,但我 find/figure 还不知道该怎么做。
我有两个 csv 文件:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,
然后
age.csv:
student_id,age_1
3124,20
9087,21
1234,45
我想比较两个 csv 文件,基于 info.csv 的列“id
”和来自age.csv并取对应的“age_1
”数据放入[=56=中的“age
”列].
所以最终输出应该是:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,45
bcd, uvw, 3124, 813-222-1111, tre,20
poi, ccc, 9087, 123-45607890, weq,21
我可以简单地将基于键的表连接成 new.csv,但不能将数据放入列标题“age
”。我用“csvkit
”来做到这一点。
这是我使用的:
csvjoin -c 3,1 info.csv age.csv > new.csv
试试这个...
import csv
info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))
def copyCSV(age, info, outFileName = 'out.csv'):
# put age into dict, indexed by ID
# assumes no duplicate entries
# 1 - build a dict ageDict to represent data
ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])
# 2 - setup output
with open(outFileName, 'wb') as outFile:
outwriter = csv.writer(outFile)
# 3 - run through info and slot in ages and write to output
# nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
outwriter.writerow(info[0])
for entry in info[1:]:
if entry != []:
key = entry[2].replace(' ','')
if key in ageDict: # checks that you have data from age.csv
entry[5] = ageDict[key]
outwriter.writerow(entry)
copyCSV(age, info)
让我知道它是否有效或有任何不清楚的地方。我使用了字典,因为如果你的文件很大,它应该会更快,因为你只需要循环遍历 age.csv 中的数据一次。
可能有更简单的方法/已经实现的东西...但这应该可以解决问题。
您可以使用 Pandas
并使用 age
数据更新 info dataframe
。您可以通过将两个数据帧的索引分别设置为 ID
和 student_id
来实现,然后更新 info dataframe
中的年龄列。之后您重置索引,使 ID
再次成为列。
from StringIO import StringIO
import pandas as pd
info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,""")
age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")
info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')
info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df
输出:
ID Last Name First Name phone adress age X [Total age: 100] |009076
0 1234 abc xyz 982-128-0000 pqt 45
1 3124 bcd uvw 813-222-1111 tre 20
2 9087 poi ccc 123-45607890 weq 21