计算 CSV 文件中特定列中的重复值，并将该值 return 计算到另一列 (python2)

Question

我目前正在尝试计算 CSV 文件列中的重复值，并且 return 计算 python 中另一个 CSV 列的值。

比如我的CSV文件：

KeyID    GeneralID
145258   KL456
145259   BG486
145260   HJ789
145261   KL456

我想要实现的是计算有多少数据具有相同的GeneralID并将其插入到新的 CSV 列中。例如，

KeyID    Total_GeneralID
145258   2
145259   1
145260   1
145261   2

我曾尝试使用拆分方法拆分每一列，但效果不佳。

我的代码：

case_id_list_data = []

with open(file_path_1, "rU") as g:
    for line in g:
        case_id_list_data.append(line.split('\t'))
        #print case_id_list_data[0][0] #the result is dissatisfying 
        #I'm stuck here..

Answer 1

import pandas as pd
#read your csv to a dataframe
df = pd.read_csv('file_path_1')
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row.
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x])
df = df[['KeyID','Total_GeneralID']]
Out[442]: 
    KeyID  Total_GeneralID
0  145258                2
1  145259                1
2  145260                1
3  145261                2

Answer 2

您可以使用 pandas 库：

第一个read_csv
通过 value_counts、rename 通过输出列

GeneralID

join 到原来的 DataFrame

import pandas as pd

df = pd.read_csv('file')
s = df['GeneralID'].value_counts().rename('Total_GeneralID')
df = df.join(s, on='GeneralID')
print (df)
    KeyID GeneralID  Total_GeneralID
0  145258     KL456                2
1  145259     BG486                1
2  145260     HJ789                1
3  145261     KL456                2

Answer 3

如果您反对 pandas 并希望继续使用标准库：

代码：

import csv
from collections import Counter
with open('file1', 'rU') as f:
    reader = csv.reader(f, delimiter='\t')
    header = next(reader)
    lines = [line for line in reader]
    counts = Counter([l[1] for l in lines])

new_lines = [l + [str(counts[l[1]])] for l in lines]
with open('file2', 'wb') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerow(header + ['Total_GeneralID'])
    writer.writerows(new_lines)

结果：

KeyID   GeneralID   Total_GeneralID
145258  KL456   2
145259  BG486   1
145260  HJ789   1
145261  KL456   2

Answer 4

您必须将任务分为三个步骤： 1.读取CSV文件 2.生成新列的值 3.给文件加值回来导入 csv 导入文件输入导入系统

# 1. Read CSV file
# This is opening CSV and reading value from it.
with open("dev.csv") as filein:
    reader = csv.reader(filein, skipinitialspace = True)
    xs, ys = zip(*reader)

result=["Total_GeneralID"]

# 2. Generate new column's value
# This loop is for counting the "GeneralID" element.
for i in range(1,len(ys),1):
    result.append(ys.count(ys[i]))

# 3. Add value to the file back
# This loop is for writing new column
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)):
    sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind]))

我没有使用临时文件或任何高级模块，如 panda 或任何东西。

Answer 5

使用 csv.reader 而不是 split() 方法。更简单。

谢谢

计算 CSV 文件中特定列中的重复值，并将该值 return 计算到另一列 (python2)

Count repeated values in a specific column in a CSV file and return the value to another column (python2)

python

csv

python-2.x