如何从 csv 文件中获取存储聚合值的字典

Question

我有一个包含以下内容的数据文件：

 Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

从那里我希望有一个字典的字典，每个字母都有汇总数据，所以它会是这样的：

dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
                       B:{10:6, 20:2, 30:3}}}, 
              {Part#2:{A:{10:5, 20:3, 30:3}, 
                       B:{10:7, 20:1, 30:3}}}, 
              {Part#3:{A:{10:6, 20:4, 30:1}, 
                       B:{10:4, 20:5, 30:2}}}}

如果我想显示每个部分，它会给我这样的输出：

dictionary[Part#1]

A
 10: 6
 20: 3
 30: 2

B
 10: 6
 20: 2
 30: 3

… 文件中接下来的几个分区依此类推。

目前我已经能够将文件从 txt 解析为 csv。并将其转换成字典，比方说外部字典。我一直在测试几种方法来查看我得到的输出，到目前为止，这段代码是最接近（但不是全部）我正在寻找的结构的代码，我在上面已经描述过。

partitions_dict = df_head(5).to_dict(orient='list')      

print(partitions_dict)

Output:

{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]

我用来解析文件的函数：

def fileFormatConverter(txt_file):
    """ Receives a generated text file  of partitions as a parameter
        and converts it into csv format.
        input: text file
        return: csv file """

    filename, ext = os.path.splitext(txt_file)
    csv_file = filename + ".csv"
    in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
    out_csv = csv.writer(open(csv_file,'w'))
    out_csv.writerows(in_txt)   
    return (csv_file)

# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None)   #, error_bad_lines=False)
df_traces.head()

输出：

    0   1   2   3   4   5   6   7   8   9   ...     15  16  17  18  19  20  21  22  23  24
0   A,  10,     20,     10,     10,     30,     10,     20,     10,     30,     ...     20,     10,     10,     30,     10,     30,     10,     20,     30.0    NaN
1   Part#2  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   A,  30,     30,     30,     10,     10,     20,     20,     20,     10,     ...     20,     10,     10,     30,     10,     30,     10,     30,     10.0    NaN
3   Part#3  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   A,  10,     20,     10,     30,     10,     20,     10,     20,     10,     ...     20,     20,     20,     30,     10,     10,     20,     20,     30.0    NaN

我使用了一个函数来更改 headers 以便更容易操作每个分区内的字母：

def changeDFHeaders(df):

    df_transpose = df.T
    new_header = df_transpose.iloc[0]                       # stores the first row for the header
    df_transpose = df_transpose[1:]                         # take the data less the header row
    df_transpose.columns = new_header                       # set the header row as the df header
    return(df_transpose)


# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose))      # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()

输出：

    A,  Part#2  A,  Part#3  A,
1   10,     NaN     30,     NaN     10,
2   20,     NaN     30,     NaN     20,
3   10,     NaN     30,     NaN     10,
4   10,     NaN     10,     NaN     30,
5   30,     NaN     10,     NaN     10,
6   10,     NaN     20,     NaN     20,
7   20,     NaN     20,     NaN     10,
8   10,     NaN     20,     NaN     20,
9   30,     NaN     10,     NaN     10,
10  10,     NaN     10,     NaN     20,
11  20,     NaN     10,     NaN     10,
12  B,  NaN     B,  NaN     B,
13  10,     NaN     10,     NaN     10,
14  10,     NaN     10,     NaN     10,
15  20,     NaN     20,     NaN     20,
16  10,     NaN     10,     NaN     20,
17  10,     NaN     10,     NaN     20,
18  30,     NaN     30,     NaN     30,
19  10,     NaN     10,     NaN     10,
20  30,     NaN     30,     NaN     10,
21  10,     NaN     10,     NaN     20,
22  20,     NaN     30,     NaN     20,
23  30  NaN     10  NaN     30
24  NaN     NaN     NaN     NaN     NaN

--还是不太对...

如果您检查此语句：

df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')      

print(partitions_dict)

输出：

{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}

Answer 1

我会避免pandas，只是因为我不太了解它：

from collections import Counter

result = {}
part = ""
group = ""
for line in f:  # f being an open file
    sline = line.strip()
    if sline.startswith("Part"):
        part = sline
        result[part] = {}
        continue
    group = sline.split()[0]
    result[part][group] = Counter(sline.split()[1:])

结果采用以下形式：

{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})}, 
 'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})}, 
 'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}

如果您直接从一个没有行分隔的文件开始，您可以使用 "Part" 来查找行，然后使用 "B" 的索引来分隔两种数据类型:

result = {}
sf = f.split("Part")[1:]  # drop the empty first part
for line in sf:
    line = line.strip()  # remove trailing spaces
    sline = line.split()  # split on spaces
    result["Part%s" % sline[0]] = {}  # Use the index of B to split the value lists
    result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
    result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])

Answer 2

输入文件为：

  Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

这应该有效

def parse_file(file_name):
    return_dict = dict()
    section = str()
    with open(file_name, "r") as source:
        for line in source.readlines():
            if "#" in line:
                section = line.strip()
                return_dict[section] = dict()
                continue
            tmp = line.strip().split()
            group = tmp.pop(0)
            return_dict[section][group] = dict()
            for item in tmp:
                if item in return_dict[section][group].keys():
                    return_dict[section][group][item] += 1
                else:
                    return_dict[section][group][item] = 1

    return return_dict

产出

{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
            'B': {'10': 6, '20': 2, '30': 3}},
 'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
            'B': {'10': 7, '20': 1, '30': 3}},
 'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
            'B': {'10': 4, '20': 5, '30': 2}}}

老实说，我不明白你为什么想要一个中间阶段，看起来如果你必须解析一次文件来创建一个 CSV，你可以把你的逻辑放在里面创建你的 dict() .因此，如果我错过了问题中的一些细微之处，我深表歉意。

编辑：根据输入文件实际上是一行的评论重新制定答案

所以输入文件为

Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30

以下修改后的代码将起作用

import string
from pprint import pprint

def parse_file2(file_name):
    return_dict = dict()
    section = None
    group = None
    with open(file_name, "r") as source:
        for line in source.readlines():
            tmp_line = line.strip().split()
            for token in tmp_line:
                if "#" in token:
                    section = token
                    return_dict[section] = dict()
                    continue
                elif token in string.ascii_uppercase:
                    group = token
                    return_dict[section][group] = dict()
                    continue
                if section and group:
                    if token in return_dict[section][group].keys():
                        return_dict[section][group][token] += 1
                    else:
                        return_dict[section][group][token] = 1

    return return_dict

if __name__ == "__main__":
    pprint(parse_file(file_name))
    pprint(parse_file2(file_name2))

请注意，此功能专门针对您在评论中注明的文件格式。如果文件格式不是你说的可能会炸掉

根据问题，这应该可行。

此外，如果您可以简化上面的问题 post 以仅说明实际文件内容和所需结果，或者只是输入我有结构 A 并想将其转换为结构 B，我会清除此 post 中的所有历史记录，并获得更简单的答案。

希望对您有所帮助！ :)

如何从 csv 文件中获取存储聚合值的字典

how to obtain dictionary of dictionaries that stores aggregated values from a csv file

python

dictionary

nested

aggregate

summary