Python 将具有相同 ID 但不同值的术语附加到列表?

Python append terms that had same id but different values to a list?

我有一个 csv 文件,其中包含一般概念和相应的医学术语或短语。我怎样才能写一个循环,以便我可以将所有短语分组到它们相应的概念中?我对 python 不是很有经验,所以我不确定如何编写循环。

id   concept           phrase
--------------------------------
1    general_history   H&P
1    general_history   history and physical
1    general_history   history physical
2    clinic_history    clinic history physical
2    clinic_history    outpatient h p
3    discharge         discharge summary
3    discharge         DCS

对于相同的概念术语(或相同的 ID),我如何将短语附加到列表中以获得类似这样的结果:

var = [[general_history, ['history and physical', history physical]], 
       [clinic_history, ['clinic history physical', 'outpatient h p']], 
       [discharge, ['discharge summary', 'DCS']]]

假设您已经可以解析 csv,下面介绍如何按概念一起排序

from collections import defaultdict

concepts = defaultdict(list)

""" parse csv """

for row in csv:
    id, concept, phrase = row
    concepts[concept].append(phrase)

var = [[k, concepts[k]] for k in concepts.keys()]

var 将包含这样的内容:

[['general_history', ['history and physical', 'history physical']...]

如果您维护该字典的键,甚至可能会有用,因为 var 看起来像这样:

{
  "general_history": [
    "history and physical",
    "history physical",
  ],
 ...
}

使用 for 循环和 defaultdict 来累积项。

import csv
from collections import defaultdict
var = defaultdict(list)
records = ...  # read csv with csv.DictReader
for row in records:
    concept = row.get('concept', None)
    if concept is None: continue
    phrase = row.get('phrase', None)
    if phrase is None: continue
    var[concept].append(phrase)
print(var)

如果您使用的是 pandas,请尝试过滤。它应该看起来像这样:

new_dataframe = dataframe[dataframe['id'] == id]

然后,连接数据帧,

final_df = pd.concat([new_dataframe1, new_dataframe2], axis = 0)

你也可以尝试对概念做同样的事情。

希望这能解决您的问题:

# a quick way to to transfer the data into python
csv_string = """id, concept, phrase
1, general_history, H&P
1, general_history, history and physical
1, general_history, history physical
2, clinic_history, clinic history physical
2, clinic_history, outpatient h p
3, discharge, discharge summary
3, discharge, DCS"""

# formats the data as shown in the original question
csv=[[x.strip() for x in line.split(", ")]  for line in csv_string.split("\n")]

# makes a dictionary with an empty list that will hold all data points
id_dict = {line[0]:[] for line in csv[1:]}

# iterates and adds all possible combinations of id's and phrases
for line in csv[1:]:
    current_id = line[0]
    phrases = line[2]
    id_dict[current_id].append(phrases)

# makes the data into a list of lists containing only unique phrases
[[current_id, list(set(phrases))] for current_id, phrases in id_dict.items()]