计算 CSV 文件中特定列中的重复值,并将该值 return 计算到另一列 (python2)
Count repeated values in a specific column in a CSV file and return the value to another column (python2)
我目前正在尝试计算 CSV 文件列中的重复值,并且 return 计算 python 中另一个 CSV 列的值。
比如我的CSV文件:
KeyID GeneralID
145258 KL456
145259 BG486
145260 HJ789
145261 KL456
我想要实现的是计算有多少数据具有相同的GeneralID
并将其插入到新的 CSV 列中。例如,
KeyID Total_GeneralID
145258 2
145259 1
145260 1
145261 2
我曾尝试使用拆分方法拆分每一列,但效果不佳。
我的代码:
case_id_list_data = []
with open(file_path_1, "rU") as g:
for line in g:
case_id_list_data.append(line.split('\t'))
#print case_id_list_data[0][0] #the result is dissatisfying
#I'm stuck here..
import pandas as pd
#read your csv to a dataframe
df = pd.read_csv('file_path_1')
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row.
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x])
df = df[['KeyID','Total_GeneralID']]
Out[442]:
KeyID Total_GeneralID
0 145258 2
1 145259 1
2 145260 1
3 145261 2
您可以使用 pandas
库:
- 第一个
read_csv
- 通过
value_counts
、rename
通过输出列 获取列 GeneralID
中值的计数
join
到原来的 DataFrame
import pandas as pd
df = pd.read_csv('file')
s = df['GeneralID'].value_counts().rename('Total_GeneralID')
df = df.join(s, on='GeneralID')
print (df)
KeyID GeneralID Total_GeneralID
0 145258 KL456 2
1 145259 BG486 1
2 145260 HJ789 1
3 145261 KL456 2
如果您反对 pandas 并希望继续使用标准库:
代码:
import csv
from collections import Counter
with open('file1', 'rU') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
lines = [line for line in reader]
counts = Counter([l[1] for l in lines])
new_lines = [l + [str(counts[l[1]])] for l in lines]
with open('file2', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerow(header + ['Total_GeneralID'])
writer.writerows(new_lines)
结果:
KeyID GeneralID Total_GeneralID
145258 KL456 2
145259 BG486 1
145260 HJ789 1
145261 KL456 2
您必须将任务分为三个步骤:
1.读取CSV文件
2.生成新列的值
3.给文件加值回来
导入 csv
导入文件输入
导入系统
# 1. Read CSV file
# This is opening CSV and reading value from it.
with open("dev.csv") as filein:
reader = csv.reader(filein, skipinitialspace = True)
xs, ys = zip(*reader)
result=["Total_GeneralID"]
# 2. Generate new column's value
# This loop is for counting the "GeneralID" element.
for i in range(1,len(ys),1):
result.append(ys.count(ys[i]))
# 3. Add value to the file back
# This loop is for writing new column
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)):
sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind]))
我没有使用临时文件或任何高级模块,如 panda 或任何东西。
使用 csv.reader 而不是 split() 方法。
更简单。
谢谢
我目前正在尝试计算 CSV 文件列中的重复值,并且 return 计算 python 中另一个 CSV 列的值。
比如我的CSV文件:
KeyID GeneralID
145258 KL456
145259 BG486
145260 HJ789
145261 KL456
我想要实现的是计算有多少数据具有相同的GeneralID
并将其插入到新的 CSV 列中。例如,
KeyID Total_GeneralID
145258 2
145259 1
145260 1
145261 2
我曾尝试使用拆分方法拆分每一列,但效果不佳。
我的代码:
case_id_list_data = []
with open(file_path_1, "rU") as g:
for line in g:
case_id_list_data.append(line.split('\t'))
#print case_id_list_data[0][0] #the result is dissatisfying
#I'm stuck here..
import pandas as pd
#read your csv to a dataframe
df = pd.read_csv('file_path_1')
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row.
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x])
df = df[['KeyID','Total_GeneralID']]
Out[442]:
KeyID Total_GeneralID
0 145258 2
1 145259 1
2 145260 1
3 145261 2
您可以使用 pandas
库:
- 第一个
read_csv
- 通过
value_counts
、rename
通过输出列 获取列 join
到原来的DataFrame
GeneralID
中值的计数
import pandas as pd
df = pd.read_csv('file')
s = df['GeneralID'].value_counts().rename('Total_GeneralID')
df = df.join(s, on='GeneralID')
print (df)
KeyID GeneralID Total_GeneralID
0 145258 KL456 2
1 145259 BG486 1
2 145260 HJ789 1
3 145261 KL456 2
如果您反对 pandas 并希望继续使用标准库:
代码:
import csv
from collections import Counter
with open('file1', 'rU') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
lines = [line for line in reader]
counts = Counter([l[1] for l in lines])
new_lines = [l + [str(counts[l[1]])] for l in lines]
with open('file2', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerow(header + ['Total_GeneralID'])
writer.writerows(new_lines)
结果:
KeyID GeneralID Total_GeneralID
145258 KL456 2
145259 BG486 1
145260 HJ789 1
145261 KL456 2
您必须将任务分为三个步骤: 1.读取CSV文件 2.生成新列的值 3.给文件加值回来 导入 csv 导入文件输入 导入系统
# 1. Read CSV file
# This is opening CSV and reading value from it.
with open("dev.csv") as filein:
reader = csv.reader(filein, skipinitialspace = True)
xs, ys = zip(*reader)
result=["Total_GeneralID"]
# 2. Generate new column's value
# This loop is for counting the "GeneralID" element.
for i in range(1,len(ys),1):
result.append(ys.count(ys[i]))
# 3. Add value to the file back
# This loop is for writing new column
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)):
sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind]))
我没有使用临时文件或任何高级模块,如 panda 或任何东西。
使用 csv.reader 而不是 split() 方法。 更简单。
谢谢