Python 将具有相同 ID 但不同值的术语附加到列表?
Python append terms that had same id but different values to a list?
我有一个 csv 文件,其中包含一般概念和相应的医学术语或短语。我怎样才能写一个循环,以便我可以将所有短语分组到它们相应的概念中?我对 python 不是很有经验,所以我不确定如何编写循环。
id concept phrase
--------------------------------
1 general_history H&P
1 general_history history and physical
1 general_history history physical
2 clinic_history clinic history physical
2 clinic_history outpatient h p
3 discharge discharge summary
3 discharge DCS
对于相同的概念术语(或相同的 ID),我如何将短语附加到列表中以获得类似这样的结果:
var = [[general_history, ['history and physical', history physical]],
[clinic_history, ['clinic history physical', 'outpatient h p']],
[discharge, ['discharge summary', 'DCS']]]
假设您已经可以解析 csv,下面介绍如何按概念一起排序
from collections import defaultdict
concepts = defaultdict(list)
""" parse csv """
for row in csv:
id, concept, phrase = row
concepts[concept].append(phrase)
var = [[k, concepts[k]] for k in concepts.keys()]
var
将包含这样的内容:
[['general_history', ['history and physical', 'history physical']...]
如果您维护该字典的键,甚至可能会有用,因为 var
看起来像这样:
{
"general_history": [
"history and physical",
"history physical",
],
...
}
使用 for 循环和 defaultdict 来累积项。
import csv
from collections import defaultdict
var = defaultdict(list)
records = ... # read csv with csv.DictReader
for row in records:
concept = row.get('concept', None)
if concept is None: continue
phrase = row.get('phrase', None)
if phrase is None: continue
var[concept].append(phrase)
print(var)
如果您使用的是 pandas,请尝试过滤。它应该看起来像这样:
new_dataframe = dataframe[dataframe['id'] == id]
然后,连接数据帧,
final_df = pd.concat([new_dataframe1, new_dataframe2], axis = 0)
你也可以尝试对概念做同样的事情。
希望这能解决您的问题:
# a quick way to to transfer the data into python
csv_string = """id, concept, phrase
1, general_history, H&P
1, general_history, history and physical
1, general_history, history physical
2, clinic_history, clinic history physical
2, clinic_history, outpatient h p
3, discharge, discharge summary
3, discharge, DCS"""
# formats the data as shown in the original question
csv=[[x.strip() for x in line.split(", ")] for line in csv_string.split("\n")]
# makes a dictionary with an empty list that will hold all data points
id_dict = {line[0]:[] for line in csv[1:]}
# iterates and adds all possible combinations of id's and phrases
for line in csv[1:]:
current_id = line[0]
phrases = line[2]
id_dict[current_id].append(phrases)
# makes the data into a list of lists containing only unique phrases
[[current_id, list(set(phrases))] for current_id, phrases in id_dict.items()]
我有一个 csv 文件,其中包含一般概念和相应的医学术语或短语。我怎样才能写一个循环,以便我可以将所有短语分组到它们相应的概念中?我对 python 不是很有经验,所以我不确定如何编写循环。
id concept phrase
--------------------------------
1 general_history H&P
1 general_history history and physical
1 general_history history physical
2 clinic_history clinic history physical
2 clinic_history outpatient h p
3 discharge discharge summary
3 discharge DCS
对于相同的概念术语(或相同的 ID),我如何将短语附加到列表中以获得类似这样的结果:
var = [[general_history, ['history and physical', history physical]],
[clinic_history, ['clinic history physical', 'outpatient h p']],
[discharge, ['discharge summary', 'DCS']]]
假设您已经可以解析 csv,下面介绍如何按概念一起排序
from collections import defaultdict
concepts = defaultdict(list)
""" parse csv """
for row in csv:
id, concept, phrase = row
concepts[concept].append(phrase)
var = [[k, concepts[k]] for k in concepts.keys()]
var
将包含这样的内容:
[['general_history', ['history and physical', 'history physical']...]
如果您维护该字典的键,甚至可能会有用,因为 var
看起来像这样:
{
"general_history": [
"history and physical",
"history physical",
],
...
}
使用 for 循环和 defaultdict 来累积项。
import csv
from collections import defaultdict
var = defaultdict(list)
records = ... # read csv with csv.DictReader
for row in records:
concept = row.get('concept', None)
if concept is None: continue
phrase = row.get('phrase', None)
if phrase is None: continue
var[concept].append(phrase)
print(var)
如果您使用的是 pandas,请尝试过滤。它应该看起来像这样:
new_dataframe = dataframe[dataframe['id'] == id]
然后,连接数据帧,
final_df = pd.concat([new_dataframe1, new_dataframe2], axis = 0)
你也可以尝试对概念做同样的事情。
希望这能解决您的问题:
# a quick way to to transfer the data into python
csv_string = """id, concept, phrase
1, general_history, H&P
1, general_history, history and physical
1, general_history, history physical
2, clinic_history, clinic history physical
2, clinic_history, outpatient h p
3, discharge, discharge summary
3, discharge, DCS"""
# formats the data as shown in the original question
csv=[[x.strip() for x in line.split(", ")] for line in csv_string.split("\n")]
# makes a dictionary with an empty list that will hold all data points
id_dict = {line[0]:[] for line in csv[1:]}
# iterates and adds all possible combinations of id's and phrases
for line in csv[1:]:
current_id = line[0]
phrases = line[2]
id_dict[current_id].append(phrases)
# makes the data into a list of lists containing only unique phrases
[[current_id, list(set(phrases))] for current_id, phrases in id_dict.items()]