在文件中查找关键字,解析它们所在的行,return dict
find keywords in file, parse lines they are on, return dict
我需要在文件中查找信息。该文件有很多行,但我要查找的是这样的
Initial command:
/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/"
Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.
Copyright (c) 1950-2050, program, Inc. All Rights Reserved.
我需要为 -scrdir="/scratch/user/"
和 PID= 18941
解析文件。
我想要return这样的字典
dict = {"-scrdir=":"/scratch/user/", "PID":18941}
这应该是通用的,因为我想传递一组要搜索的东西,即 -scrdir=, and/or PID and/or other, and, get returned 文件中那些关键字(如果存在)之后的任何内容。
到目前为止,我所做的似乎有效,但逻辑语句似乎很重
作为 MWE,我将信息存储在列表而不是文件中,并且具有以下内容
log = ["this is a line Initial",
'/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/"',
"Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.",
" ",
"Copyright (c) 1950-2050, program, Inc. All Rights Reserved."]
dicti = {}
phrases = ["-scrdir", "PID"]
# with open(file, 'r') as log:# would use in real situation
for line in log:
if any(word in line for word in phrases):
for phrase in phrases:
try:
dicti[phrase]=line.split(phrase+"=")[1]
except:
pass
有没有更简洁的写法?
最后要注意的是,文件通常比 1 MB 小得多,速度不是优先事项。它不需要快速或高效......我想只是优雅。
您可以在您的文本中写下您想要搜索的所有特定正则表达式,然后将它们与 |
交替运算符(相当于 OR 运算符)组合起来:
import re
REGEXES = (
'(-scrdir)="([/\w]+)"',
'(PID)=\s*(\d+)',
)
dicti = dict(
[z for z in w if z != ''] # filter all empty strings in matches
for y in filter(lambda x: x, map(re.compile("|".join(REGEXES)).findall, log)) # get all matches in a row
for w in y # loop over all row matches
)
dicti
是:
{'-scrdir': '/scratch/user/', 'PID': '18941'}
即使您连续有多场比赛,它也能正常工作。例如,如果您有:
log = ["this is a line Initial",
'/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/" Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.',
" ",
"Copyright (c) 1950-2050, program, Inc. All Rights Reserved."]
输出将是:
{'-scrdir': '/scratch/user/', 'PID': '18941'}
如果您要查找的文本分布在多行中,则它不起作用。
我需要在文件中查找信息。该文件有很多行,但我要查找的是这样的
Initial command:
/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/"
Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.
Copyright (c) 1950-2050, program, Inc. All Rights Reserved.
我需要为 -scrdir="/scratch/user/"
和 PID= 18941
解析文件。
我想要return这样的字典
dict = {"-scrdir=":"/scratch/user/", "PID":18941}
这应该是通用的,因为我想传递一组要搜索的东西,即 -scrdir=, and/or PID and/or other, and, get returned 文件中那些关键字(如果存在)之后的任何内容。
到目前为止,我所做的似乎有效,但逻辑语句似乎很重 作为 MWE,我将信息存储在列表而不是文件中,并且具有以下内容
log = ["this is a line Initial",
'/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/"',
"Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.",
" ",
"Copyright (c) 1950-2050, program, Inc. All Rights Reserved."]
dicti = {}
phrases = ["-scrdir", "PID"]
# with open(file, 'r') as log:# would use in real situation
for line in log:
if any(word in line for word in phrases):
for phrase in phrases:
try:
dicti[phrase]=line.split(phrase+"=")[1]
except:
pass
有没有更简洁的写法?
最后要注意的是,文件通常比 1 MB 小得多,速度不是优先事项。它不需要快速或高效......我想只是优雅。
您可以在您的文本中写下您想要搜索的所有特定正则表达式,然后将它们与 |
交替运算符(相当于 OR 运算符)组合起来:
import re
REGEXES = (
'(-scrdir)="([/\w]+)"',
'(PID)=\s*(\d+)',
)
dicti = dict(
[z for z in w if z != ''] # filter all empty strings in matches
for y in filter(lambda x: x, map(re.compile("|".join(REGEXES)).findall, log)) # get all matches in a row
for w in y # loop over all row matches
)
dicti
是:
{'-scrdir': '/scratch/user/', 'PID': '18941'}
即使您连续有多场比赛,它也能正常工作。例如,如果您有:
log = ["this is a line Initial",
'/opt/user/program/pg.c01/l1.exe "/scratch/user/pg-18930.inp" -scrdir="/scratch/user/" Entering Link 1 = /opt/software/program/pg.c01/l1.exe PID= 18941.',
" ",
"Copyright (c) 1950-2050, program, Inc. All Rights Reserved."]
输出将是:
{'-scrdir': '/scratch/user/', 'PID': '18941'}
如果您要查找的文本分布在多行中,则它不起作用。