解析文件中所有出现的字符串并在 JSON 中生成键值

Parse a file for all occurrences of a string and generate key-values in JSON

  1. 我有一个文件 (https://pastebin.com/STgtBRS8),我需要在其中搜索单词 "silencedetect" 的所有匹配项。

  2. 然后我必须生成一个 JSON 文件,其中包含“silence_start”、“silence_end”和“[=34”的键值=]”.

JSON 文件应该如下所示:

[
{
"id": 1,
"silence_start": -0.012381,
"silence_end": 2.2059,
"silence_duration": 2.21828
},
{
"id": 2,
"silence_start": 5.79261,
"silence_end": 6.91955,
"silence_duration": 1.12694,
}
]

这是我试过的:

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read().replace('\n', '')

for line in data:
    if "silencedetect" in data:
        #read silence_start, silence_end, and silence_duration and put in json

我无法将 3 个键值对与每个 "silencedetect" 相关联。如何解析键值并以 JSON 格式获取它们?

你可以为它正则表达式。它适用于我

import re

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read()

d = re.findall('silence_start: (-?\d+\.\d+)\n.*?\n?\[silencedetect @ \w{14}\] silence_end: (-?\d+\.\d+) \| silence_duration: (-?\d+\.\d+)', data)
print d

您可以通过

将它们放入 json
out = [{'id': i, 'start':a[0], 'end':a[1], 'duration':a[2]} for i, a in enumerate(d)]
import json
print json.dumps(out) # or write to file or... whatever

输出:

'[{"duration": "2.21828", "start": "-0.012381", "end": "2.2059", "id": 0}, {"duration": "1.12694", "start": "5.79261", "end": "6.91955", "id": 1}, {"duration": "0.59288", "start": "8.53256", "end": "9.12544", "id": 2}, {"duration": "1.0805", "start": "9.64712", "end": "10.7276", "id": 3}, {"duration": "1.03406", "start": "12.6657", "end": "13.6998", "id": 4}, {"duration": "0.871519", "start": "19.2602", "end": "20.1317", "id": 5}'

编辑: 修复了由于frame=..行落在比赛开始和结束之间而错过一些比赛的错误

使用 re.findallenumerate 函数的复杂解决方案:

import re, json

with open('volume_data.txt', 'r') as f:
    result = []
    pat = re.compile(r'(silence_start: -?\d+\.\d+).+?(silence_end: -?\d+\.\d+).+?(silence_duration: -?\d+\.\d+)')
    silence_items = re.findall(pat, f.read().replace('\n', ''))
    for i,v in enumerate(silence_items):
        d = {'id': i+1}
        d.update({pair[:pair.find(':')]: float(pair[pair.find(':')+2:]) for pair in v})
        result.append(d)

    print(json.dumps(result, indent=4))

输出:

[
    {
        "id": 1,
        "silence_end": 2.2059,
        "silence_duration": 2.21828,
        "silence_start": -0.012381
    },
    {
        "id": 2,
        "silence_end": 6.91955,
        "silence_duration": 1.12694,
        "silence_start": 5.79261
    },
    {
        "id": 3,
        "silence_end": 9.12544,
        "silence_duration": 0.59288,
        "silence_start": 8.53256
    },
    {
        "id": 4,
        "silence_end": 10.7276,
        "silence_duration": 1.0805,
        "silence_start": 9.64712
    },
    {
        "id": 5,
        "silence_end": 13.6998,
        "silence_duration": 1.03406,
        "silence_start": 12.6657
    },
    {
        "id": 6,
        "silence_end": 20.1317,
        "silence_duration": 0.871519,
        "silence_start": 19.2602
    },
    {
        "id": 7,
        "silence_end": 22.4305,
        "silence_duration": 0.801859,
        "silence_start": 21.6286
    },
    ...
]

进口重新 导入 json

with open('volume_data.csv', 'r') 作为我的文件: 数据 = myfile.read()

matcher = re.compile('(?P<g1>[silencedetect @ \w+?\])\s+?silence_start:\s+?(?P<g2>-?\d+?\.\d+?).*?\n([^\[]+?\n)?(?P=g1)\s+?silence_end:\s+?(?P<g3>-?\d+?\.\d+?).+?\|\s+?silence_duration:\s+?(?P<g4>-?\d+?\.\d+?).*?\n')
matchiter= matcher.findall(data)
#(1) (2)
string=""
for i, matchediter in enumerate( matchiter):
    string+= '{"id": {},\n, "silence_start":{},\n"silence_end": {},\n"silence_duration":{}}'. format(i, matchediter.group(g2),matchediter.group(g3),matchediter.group(g4)).

json.dumps(string)

(1) 您可能希望传递一些标志,例如 "re.IGNORECASE" 以使您的脚本不受此类更改的影响。

(2) 我更喜欢使用非贪婪序列识别模式,它可能对识别和速度有影响。命名组的使用是个人喜好的问题。 如果您决定使用 matcher.sub 操作来立即重新格式化 read(),而不是使用迭代来重建文件文本,它们可能会有用。 如果您无法理解,我可以添加替换字符串。否则我更喜欢使用匹配对象的 .group,它是为此而制作的,可以使用您选择的名称而不是 g1、g2、g3、g4。

总的来说,我更喜欢使用 finditer,因为它基本上是为这种操作而设计的,findall 生成捕获组的元组,这很好,但有时您可能想使用与完整匹配、模式、位置索引相关的信息分析的文本等

编辑:我使正则表达式对持续时间数字后添加的任何字符串以及多个空格都具有鲁棒性。我还考虑了插入的线条,如果需要,您可以通过命名组来捕获它们。 它捕获了 189 次出现,有 190 次 "silence start" 但最后一次没有跟随结束和持续时间信息。

假设您的数据是有序的,您可以简单地对其进行流式分析,根本不需要正则表达式和加载整个文件:

import json

parsed = []  # a list to hold our parsed values
with open("entries.dat", "r") as f:  # open the file for reading
    current_id = 1  # holds our ID
    entry = None  # holds the current parsed entry
    for line in f:  # ... go through the file line by line
        if line[:14] == "[silencedetect":  # parse the lines starting with [silencedetect
            if entry:  # we already picked up silence_start
                index = line.find("silence_end:")  # find where silence_end starts
                value = line[index + 12:line.find("|", index)].strip()  # the number after it
                entry["silence_end"] = float(value)  # store the silence_end
                # the following step is optional, instead of parsing you can just calculate
                # the silence_duration yourself with:
                # entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
                index = line.find("silence_duration:")  # find where silence_duration starts
                value = line[index + 17:].strip()  # grab the number after it
                entry["silence_duration"] = float(value)  # store the silence_duration
                # and now that we have everything...
                parsed.append(entry)  # add the entry to our parsed list
                entry = None  # blank out the entry for the next step
            else:  # find silence_start first
                index = line.find("silence_start:")  # find where silence_start, well, starts
                value = line[index + 14:].strip()  # grab the number after it
                entry = {"id": current_id}  # store the current ID...
                entry["silence_start"] = float(value)  # ... and the silence_start
                current_id += 1  # increase our ID value for the next entry

# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4)  # holds the JSON, pretty-printed
print(your_json)  # let's print it...

你得到:

[
    {
        "silence_end": 2.2059, 
        "silence_duration": 2.21828, 
        "id": 1, 
        "silence_start": -0.012381
    }, 
    {
        "silence_end": 6.91955, 
        "silence_duration": 1.12694, 
        "id": 2, 
        "silence_start": 5.79261
    }, 
    {
        "silence_end": 9.12544, 
        "silence_duration": 0.59288, 
        "id": 3, 
        "silence_start": 8.53256
    }, 
    {
        "silence_end": 10.7276, 
        "silence_duration": 1.0805, 
        "id": 4, 
        "silence_start": 9.64712
    }, 
    # 
    # etc.
    # 
    {
        "silence_end": 795.516, 
        "silence_duration": 0.68576, 
        "id": 189, 
        "silence_start": 794.83
    }
]

请记住,JSON 不订阅数据顺序(v3.5 之前的 Python dict 也不订阅),因此 id 不一定出现在第一位,但数据有效性是一样的。

我特意将最初的 entry 创建分开,这样您就可以使用 collections.OrderedDict 作为替代品(即 entry = collections.OrderedDict({"id": current_id}))来保留顺序(如果您愿意的话) .