如何从 csv 文件中获取存储聚合值的字典
how to obtain dictionary of dictionaries that stores aggregated values from a csv file
我有一个包含以下内容的数据文件:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
从那里我希望有一个字典的字典,每个字母都有汇总数据,所以它会是这样的:
dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
B:{10:6, 20:2, 30:3}}},
{Part#2:{A:{10:5, 20:3, 30:3},
B:{10:7, 20:1, 30:3}}},
{Part#3:{A:{10:6, 20:4, 30:1},
B:{10:4, 20:5, 30:2}}}}
如果我想显示每个部分,它会给我这样的输出:
dictionary[Part#1]
A
10: 6
20: 3
30: 2
B
10: 6
20: 2
30: 3
… 文件中接下来的几个分区依此类推。
目前我已经能够将文件从 txt 解析为 csv。并将其转换成字典,比方说外部字典。我一直在测试几种方法来查看我得到的输出,到目前为止,这段代码是最接近(但不是全部)我正在寻找的结构的代码,我在上面已经描述过。
partitions_dict = df_head(5).to_dict(orient='list')
print(partitions_dict)
Output:
{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]
我用来解析文件的函数:
def fileFormatConverter(txt_file):
""" Receives a generated text file of partitions as a parameter
and converts it into csv format.
input: text file
return: csv file """
filename, ext = os.path.splitext(txt_file)
csv_file = filename + ".csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
out_csv = csv.writer(open(csv_file,'w'))
out_csv.writerows(in_txt)
return (csv_file)
# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None) #, error_bad_lines=False)
df_traces.head()
输出:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 A, 10, 20, 10, 10, 30, 10, 20, 10, 30, ... 20, 10, 10, 30, 10, 30, 10, 20, 30.0 NaN
1 Part#2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 A, 30, 30, 30, 10, 10, 20, 20, 20, 10, ... 20, 10, 10, 30, 10, 30, 10, 30, 10.0 NaN
3 Part#3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 A, 10, 20, 10, 30, 10, 20, 10, 20, 10, ... 20, 20, 20, 30, 10, 10, 20, 20, 30.0 NaN
我使用了一个函数来更改 headers 以便更容易操作每个分区内的字母:
def changeDFHeaders(df):
df_transpose = df.T
new_header = df_transpose.iloc[0] # stores the first row for the header
df_transpose = df_transpose[1:] # take the data less the header row
df_transpose.columns = new_header # set the header row as the df header
return(df_transpose)
# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose)) # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()
输出:
A, Part#2 A, Part#3 A,
1 10, NaN 30, NaN 10,
2 20, NaN 30, NaN 20,
3 10, NaN 30, NaN 10,
4 10, NaN 10, NaN 30,
5 30, NaN 10, NaN 10,
6 10, NaN 20, NaN 20,
7 20, NaN 20, NaN 10,
8 10, NaN 20, NaN 20,
9 30, NaN 10, NaN 10,
10 10, NaN 10, NaN 20,
11 20, NaN 10, NaN 10,
12 B, NaN B, NaN B,
13 10, NaN 10, NaN 10,
14 10, NaN 10, NaN 10,
15 20, NaN 20, NaN 20,
16 10, NaN 10, NaN 20,
17 10, NaN 10, NaN 20,
18 30, NaN 30, NaN 30,
19 10, NaN 10, NaN 10,
20 30, NaN 30, NaN 10,
21 10, NaN 10, NaN 20,
22 20, NaN 30, NaN 20,
23 30 NaN 10 NaN 30
24 NaN NaN NaN NaN NaN
--还是不太对...
如果您检查此语句:
df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')
print(partitions_dict)
输出:
{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}
我会避免pandas,只是因为我不太了解它:
from collections import Counter
result = {}
part = ""
group = ""
for line in f: # f being an open file
sline = line.strip()
if sline.startswith("Part"):
part = sline
result[part] = {}
continue
group = sline.split()[0]
result[part][group] = Counter(sline.split()[1:])
结果采用以下形式:
{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})},
'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})},
'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}
如果您直接从一个没有行分隔的文件开始,您可以使用 "Part" 来查找行,然后使用 "B" 的索引来分隔两种数据类型:
result = {}
sf = f.split("Part")[1:] # drop the empty first part
for line in sf:
line = line.strip() # remove trailing spaces
sline = line.split() # split on spaces
result["Part%s" % sline[0]] = {} # Use the index of B to split the value lists
result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])
输入文件为:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
这应该有效
def parse_file(file_name):
return_dict = dict()
section = str()
with open(file_name, "r") as source:
for line in source.readlines():
if "#" in line:
section = line.strip()
return_dict[section] = dict()
continue
tmp = line.strip().split()
group = tmp.pop(0)
return_dict[section][group] = dict()
for item in tmp:
if item in return_dict[section][group].keys():
return_dict[section][group][item] += 1
else:
return_dict[section][group][item] = 1
return return_dict
产出
{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
'B': {'10': 6, '20': 2, '30': 3}},
'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
'B': {'10': 7, '20': 1, '30': 3}},
'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
'B': {'10': 4, '20': 5, '30': 2}}}
老实说,我不明白你为什么想要一个中间阶段,看起来如果你必须解析一次文件来创建一个 CSV,你可以把你的逻辑放在里面创建你的 dict() .因此,如果我错过了问题中的一些细微之处,我深表歉意。
编辑:根据输入文件实际上是一行的评论重新制定答案
所以输入文件为
Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30
以下修改后的代码将起作用
import string
from pprint import pprint
def parse_file2(file_name):
return_dict = dict()
section = None
group = None
with open(file_name, "r") as source:
for line in source.readlines():
tmp_line = line.strip().split()
for token in tmp_line:
if "#" in token:
section = token
return_dict[section] = dict()
continue
elif token in string.ascii_uppercase:
group = token
return_dict[section][group] = dict()
continue
if section and group:
if token in return_dict[section][group].keys():
return_dict[section][group][token] += 1
else:
return_dict[section][group][token] = 1
return return_dict
if __name__ == "__main__":
pprint(parse_file(file_name))
pprint(parse_file2(file_name2))
请注意,此功能专门针对您在评论中注明的文件格式。如果文件格式不是你说的可能会炸掉
根据问题,这应该可行。
此外,如果您可以简化上面的问题 post 以仅说明实际文件内容和所需结果,或者只是输入我有结构 A 并想将其转换为结构 B,我会清除此 post 中的所有历史记录,并获得更简单的答案。
希望对您有所帮助! :)
我有一个包含以下内容的数据文件:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
从那里我希望有一个字典的字典,每个字母都有汇总数据,所以它会是这样的:
dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
B:{10:6, 20:2, 30:3}}},
{Part#2:{A:{10:5, 20:3, 30:3},
B:{10:7, 20:1, 30:3}}},
{Part#3:{A:{10:6, 20:4, 30:1},
B:{10:4, 20:5, 30:2}}}}
如果我想显示每个部分,它会给我这样的输出:
dictionary[Part#1]
A
10: 6
20: 3
30: 2
B
10: 6
20: 2
30: 3
… 文件中接下来的几个分区依此类推。
目前我已经能够将文件从 txt 解析为 csv。并将其转换成字典,比方说外部字典。我一直在测试几种方法来查看我得到的输出,到目前为止,这段代码是最接近(但不是全部)我正在寻找的结构的代码,我在上面已经描述过。
partitions_dict = df_head(5).to_dict(orient='list')
print(partitions_dict)
Output:
{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]
我用来解析文件的函数:
def fileFormatConverter(txt_file):
""" Receives a generated text file of partitions as a parameter
and converts it into csv format.
input: text file
return: csv file """
filename, ext = os.path.splitext(txt_file)
csv_file = filename + ".csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
out_csv = csv.writer(open(csv_file,'w'))
out_csv.writerows(in_txt)
return (csv_file)
# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None) #, error_bad_lines=False)
df_traces.head()
输出:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 A, 10, 20, 10, 10, 30, 10, 20, 10, 30, ... 20, 10, 10, 30, 10, 30, 10, 20, 30.0 NaN
1 Part#2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 A, 30, 30, 30, 10, 10, 20, 20, 20, 10, ... 20, 10, 10, 30, 10, 30, 10, 30, 10.0 NaN
3 Part#3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 A, 10, 20, 10, 30, 10, 20, 10, 20, 10, ... 20, 20, 20, 30, 10, 10, 20, 20, 30.0 NaN
我使用了一个函数来更改 headers 以便更容易操作每个分区内的字母:
def changeDFHeaders(df):
df_transpose = df.T
new_header = df_transpose.iloc[0] # stores the first row for the header
df_transpose = df_transpose[1:] # take the data less the header row
df_transpose.columns = new_header # set the header row as the df header
return(df_transpose)
# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose)) # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()
输出:
A, Part#2 A, Part#3 A,
1 10, NaN 30, NaN 10,
2 20, NaN 30, NaN 20,
3 10, NaN 30, NaN 10,
4 10, NaN 10, NaN 30,
5 30, NaN 10, NaN 10,
6 10, NaN 20, NaN 20,
7 20, NaN 20, NaN 10,
8 10, NaN 20, NaN 20,
9 30, NaN 10, NaN 10,
10 10, NaN 10, NaN 20,
11 20, NaN 10, NaN 10,
12 B, NaN B, NaN B,
13 10, NaN 10, NaN 10,
14 10, NaN 10, NaN 10,
15 20, NaN 20, NaN 20,
16 10, NaN 10, NaN 20,
17 10, NaN 10, NaN 20,
18 30, NaN 30, NaN 30,
19 10, NaN 10, NaN 10,
20 30, NaN 30, NaN 10,
21 10, NaN 10, NaN 20,
22 20, NaN 30, NaN 20,
23 30 NaN 10 NaN 30
24 NaN NaN NaN NaN NaN
--还是不太对...
如果您检查此语句:
df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')
print(partitions_dict)
输出:
{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}
我会避免pandas,只是因为我不太了解它:
from collections import Counter
result = {}
part = ""
group = ""
for line in f: # f being an open file
sline = line.strip()
if sline.startswith("Part"):
part = sline
result[part] = {}
continue
group = sline.split()[0]
result[part][group] = Counter(sline.split()[1:])
结果采用以下形式:
{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})},
'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})},
'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}
如果您直接从一个没有行分隔的文件开始,您可以使用 "Part" 来查找行,然后使用 "B" 的索引来分隔两种数据类型:
result = {}
sf = f.split("Part")[1:] # drop the empty first part
for line in sf:
line = line.strip() # remove trailing spaces
sline = line.split() # split on spaces
result["Part%s" % sline[0]] = {} # Use the index of B to split the value lists
result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])
输入文件为:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
这应该有效
def parse_file(file_name):
return_dict = dict()
section = str()
with open(file_name, "r") as source:
for line in source.readlines():
if "#" in line:
section = line.strip()
return_dict[section] = dict()
continue
tmp = line.strip().split()
group = tmp.pop(0)
return_dict[section][group] = dict()
for item in tmp:
if item in return_dict[section][group].keys():
return_dict[section][group][item] += 1
else:
return_dict[section][group][item] = 1
return return_dict
产出
{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
'B': {'10': 6, '20': 2, '30': 3}},
'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
'B': {'10': 7, '20': 1, '30': 3}},
'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
'B': {'10': 4, '20': 5, '30': 2}}}
老实说,我不明白你为什么想要一个中间阶段,看起来如果你必须解析一次文件来创建一个 CSV,你可以把你的逻辑放在里面创建你的 dict() .因此,如果我错过了问题中的一些细微之处,我深表歉意。
编辑:根据输入文件实际上是一行的评论重新制定答案
所以输入文件为
Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30
以下修改后的代码将起作用
import string
from pprint import pprint
def parse_file2(file_name):
return_dict = dict()
section = None
group = None
with open(file_name, "r") as source:
for line in source.readlines():
tmp_line = line.strip().split()
for token in tmp_line:
if "#" in token:
section = token
return_dict[section] = dict()
continue
elif token in string.ascii_uppercase:
group = token
return_dict[section][group] = dict()
continue
if section and group:
if token in return_dict[section][group].keys():
return_dict[section][group][token] += 1
else:
return_dict[section][group][token] = 1
return return_dict
if __name__ == "__main__":
pprint(parse_file(file_name))
pprint(parse_file2(file_name2))
请注意,此功能专门针对您在评论中注明的文件格式。如果文件格式不是你说的可能会炸掉
根据问题,这应该可行。
此外,如果您可以简化上面的问题 post 以仅说明实际文件内容和所需结果,或者只是输入我有结构 A 并想将其转换为结构 B,我会清除此 post 中的所有历史记录,并获得更简单的答案。
希望对您有所帮助! :)