如何使用 python 正则表达式解析此日志并使用 pandas(可选)导出到 excel?
How to parse this log using python regex and export to excel with pandas (optional)?
我有一个以下格式的日志文件。对于每一行,我需要捕获第 3 列,例如 0102b69880c4b330
、相应的消息 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
及其各自的计数(请参阅输出)。我认为使用正则表达式可以让我更轻松地解决问题。
解释:
案例 1:ID 0102b69880c4b330
出现了 3 次(第 1、2、3 行)。因此 ID 的计数为 3,相应的消息 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
也出现了 3 次,因此计数为 3.
案例2:现在第4行和第5行的ID 0102b69880c4e3b2
有两条不同的消息JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH
,ID计数为2,但计数为他们的消息应该分别是1、1。
案例3:第10行到最后一行的ID0102b6988000000c
有消息DM_WORKFLOW_E_PROCESS_AUTO_TASK
。 ID 计数为 3,消息计数为 3。但是这里我需要获取此错误消息旁边的流程任务 ID 和工作流 ID。
我在输出中使用 [Ignore for this]
只是为了解释我不需要 ID。
最后我还需要维护DM_WORKFLOW_E_PROCESS_AUTO_TASK
的总数。
Input:
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T00:50:11.761273 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance phchbs-sp220333_9200_IndexAgent
2019-05-05T00:50:12.015521 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 1 seconds.
2019-05-05T00:50:12.015563 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance phchbs-sp220333_9200_IndexAgent -details false to Index Agent phchbs-sp220333_9200_IndexAgent is successful.
I need to get the below output:
Output:
ID: Count: Message: Corresponding Message Count Task ID: Workflow ID
0102b69880c4b330 3 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG 3 [Ignore for this] [Ignore for this]
0102b69880c4e3b2 2 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 1, 1 [Ignore for this] [Ignore for this]
0102b6988000000b 1 DM_WORKFLOW_E_PROCESS_AUTO_TASK 1 4a02b698800aad04 4d02b6988000f709
0102b69880c4fb1e 3 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 2, 1 [Ignore for this] [Ignore for this]
0102b6988000000c 3 DM_WORKFLOW_E_PROCESS_AUTO_TASK 3 4a02b698800a977c, 4a02b698800a9782, 4a02b698800a9784 4d02b698800107e9, 4d02b6988001081e, 4d02b6988001081c
我试过测试的程序如下。我没有在 ID 列之后正确使用正则表达式,我只是选择了包含 [] 中的值的值,但它跳过了没有的值。它也不会选择流程任务 ID 和工作流 ID。你能帮我修改我的代码以获得正确的计数、任务 ID 和工作流 ID 吗?
import re
import collections
regexp = re.compile(
r'(?P<date>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{6}\s*)'+
'(?P<un_num>[0-9]{3,5}\[[0-9]{3,5}\]\s*)'+
'(?P<id>[a-z0-9]{16}\s*)'+
'(?P<message>\[(.*?)\])'
)
ls = ["2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent",
"2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.",
"2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.",
"2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod",
"2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod",
"2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors.'",
"2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'"
]
id_counter = collections.Counter()
message_counter = collections.Counter()
print("started......!!!!!")
for i in range(len(ls)):
x = regexp.match(ls[i])
y = re.search(regexp, ls[i])
if x is None or y is None:
print("None")
continue
print("-----------------")
print(y.group('date'))
print(y.group('un_num'))
print(y.group('id'))
id_counter.update([y.group('id')])
print(y.group('message'))
message_counter.update([y.group('message')])
print("end....!!!")
print(id_counter)
print(message_counter)
def print_counts(cdict):
for key, values in enumerate(cdict.items()):
print(key, values)
print_counts(id_counter)
print_counts(message_counter)
这个输出是:
started......!!!!!
-----------------
2019-05-05T00:05:11.507245
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759829
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759898
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
None
None
-----------------
2019-05-07T05:42:21.171087
22484[22484]
0102b6988000000b
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:44:35.410674
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:50:31.383668
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:53:49.978053
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
end....!!!
Counter({'0102b69880c4b330\t': 3, '0102b6988000000c\t': 3, '0102b6988000000b ': 1})
Counter({'[DM_WORKFLOW_E_PROCESS_AUTO_TASK]': 4, '[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]': 3})
0 ('0102b69880c4b330\t', 3)
1 ('0102b6988000000b ', 1)
2 ('0102b6988000000c\t', 3)
0 ('[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]', 3)
1 ('[DM_WORKFLOW_E_PROCESS_AUTO_TASK]', 4)
以文本形式输入数据开始:
txt = """
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
"""
我们可以做一些预处理,首先拆分成行并丢弃空行:
lines = [line for line in txt.split('\n') if line.strip()]
然后提取我们感兴趣的块,但只是对数据进行粗略(且非常快速)的拆分
parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
更新: 因为你的新数据不是固定宽度的,我们需要一些其他方式 pre-processing 它,例如:
# parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
import re
lines = [re.sub(r'\s+', ' ', line) for line in lines] # squash all multiple spaces to a single space
parts = [line.split() for line in lines] # split on whitespace
parts = [(line[2], ' '.join(line[3:]).split(':', 1)) for line in parts] # this is similar to the original line
记住,这部分只是为了让下面的 InputData class 中的最终处理更容易。
然后我们为我们感兴趣的输入数据创建一个数据结构,它可以获取我们在部分中拥有的 pre-processed 数据:
class InputData(object):
def __init__(self, idtag, (msg, details)): # py3 is more awkward here (*)
self.idtag = idtag
self.error_task = None
self.error_workflow = None
msg = msg.strip()
if msg.endswith('] info'):
self.msg = msg[1:-len('] info')]
elif msg.endswith('error'):
self.msg = msg[1:-len(']error')]
self.error_task = details.split(' task ', 1)[1].split(' ', 1)[0]
self.error_workflow = details.split(' workflow ', 1)[1].split('.', 1)[0]
else:
self.msg = msg
def __repr__(self):
return repr(self.__dict__) # this is a great trick for making debugging easier
(*) 对于 py3 你需要(不确定他们为什么改变这个......?)
def __init__(self, idtag, tmp):
msg, details = tmp
现在我们可以将此 class 应用于 pre-processed 输入:
input_data = [InputData(*part) for part in parts]
如果我们打印出到目前为止的内容:
for d in input_data:
print d
结果是:
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'JMS DO_METHOD TRACE LAUNCH'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'DO_METHOD TRACE LAUNCH'}
{'error_workflow': '4d02b6988000f709', 'error_task': '4a02b698800aad04', 'idtag': '0102b6988000000b', 'msg': 'DM_WORKFLOW_E_PROCESS_AUTO_TASK'}
...
现在我们创建一个 class 来表示我们想要的输出数据:
from collections import defaultdict
class OutputData(object):
def __init__(self): # I'm using this class in a defaultdict, so the __init__ method can't take any arguments
self.idtag = None
self.idtag_count = 0
self.messages = defaultdict(int)
self.errors = []
self.workflows = []
def add(self, indata):
"Adds indata to this object."
self.idtag = indata.idtag
self.idtag_count += 1
self.messages[indata.msg] += 1
if indata.error_task:
self.errors.append(indata.error_task)
self.workflows.append(indata.error_workflow)
并将输入数据输入其中:
output_data = defaultdict(OutputData)
for indata in input_data:
output_data[indata.idtag].add(indata)
最后,我们可以将输出数据输出为我们想要的格式:
fmt = '%-20s %-6s %-55s %-15s %-60s %s'
print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
for outdata in output_data.values():
print fmt % (
outdata.idtag,
outdata.idtag_count,
', '.join(outdata.messages.keys()),
', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
', '.join(outdata.errors),
', '.join(outdata.workflows)
)
这种结构,即:pre-process文本,提取有趣的输入数据,将输入数据转换为输出数据,最后serializing/formatting输出数据;对于所有此类问题都适用,并且它使以后的调试和修改变得更加容易。
我有一个以下格式的日志文件。对于每一行,我需要捕获第 3 列,例如 0102b69880c4b330
、相应的消息 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
及其各自的计数(请参阅输出)。我认为使用正则表达式可以让我更轻松地解决问题。
解释:
案例 1:ID 0102b69880c4b330
出现了 3 次(第 1、2、3 行)。因此 ID 的计数为 3,相应的消息 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
也出现了 3 次,因此计数为 3.
案例2:现在第4行和第5行的ID 0102b69880c4e3b2
有两条不同的消息JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH
,ID计数为2,但计数为他们的消息应该分别是1、1。
案例3:第10行到最后一行的ID0102b6988000000c
有消息DM_WORKFLOW_E_PROCESS_AUTO_TASK
。 ID 计数为 3,消息计数为 3。但是这里我需要获取此错误消息旁边的流程任务 ID 和工作流 ID。
我在输出中使用 [Ignore for this]
只是为了解释我不需要 ID。
最后我还需要维护DM_WORKFLOW_E_PROCESS_AUTO_TASK
的总数。
Input:
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T00:50:11.761273 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance phchbs-sp220333_9200_IndexAgent
2019-05-05T00:50:12.015521 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 1 seconds.
2019-05-05T00:50:12.015563 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance phchbs-sp220333_9200_IndexAgent -details false to Index Agent phchbs-sp220333_9200_IndexAgent is successful.
I need to get the below output:
Output:
ID: Count: Message: Corresponding Message Count Task ID: Workflow ID
0102b69880c4b330 3 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG 3 [Ignore for this] [Ignore for this]
0102b69880c4e3b2 2 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 1, 1 [Ignore for this] [Ignore for this]
0102b6988000000b 1 DM_WORKFLOW_E_PROCESS_AUTO_TASK 1 4a02b698800aad04 4d02b6988000f709
0102b69880c4fb1e 3 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 2, 1 [Ignore for this] [Ignore for this]
0102b6988000000c 3 DM_WORKFLOW_E_PROCESS_AUTO_TASK 3 4a02b698800a977c, 4a02b698800a9782, 4a02b698800a9784 4d02b698800107e9, 4d02b6988001081e, 4d02b6988001081c
我试过测试的程序如下。我没有在 ID 列之后正确使用正则表达式,我只是选择了包含 [] 中的值的值,但它跳过了没有的值。它也不会选择流程任务 ID 和工作流 ID。你能帮我修改我的代码以获得正确的计数、任务 ID 和工作流 ID 吗?
import re
import collections
regexp = re.compile(
r'(?P<date>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{6}\s*)'+
'(?P<un_num>[0-9]{3,5}\[[0-9]{3,5}\]\s*)'+
'(?P<id>[a-z0-9]{16}\s*)'+
'(?P<message>\[(.*?)\])'
)
ls = ["2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent",
"2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.",
"2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.",
"2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod",
"2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod",
"2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors.'",
"2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'"
]
id_counter = collections.Counter()
message_counter = collections.Counter()
print("started......!!!!!")
for i in range(len(ls)):
x = regexp.match(ls[i])
y = re.search(regexp, ls[i])
if x is None or y is None:
print("None")
continue
print("-----------------")
print(y.group('date'))
print(y.group('un_num'))
print(y.group('id'))
id_counter.update([y.group('id')])
print(y.group('message'))
message_counter.update([y.group('message')])
print("end....!!!")
print(id_counter)
print(message_counter)
def print_counts(cdict):
for key, values in enumerate(cdict.items()):
print(key, values)
print_counts(id_counter)
print_counts(message_counter)
这个输出是:
started......!!!!!
-----------------
2019-05-05T00:05:11.507245
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759829
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759898
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
None
None
-----------------
2019-05-07T05:42:21.171087
22484[22484]
0102b6988000000b
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:44:35.410674
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:50:31.383668
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:53:49.978053
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
end....!!!
Counter({'0102b69880c4b330\t': 3, '0102b6988000000c\t': 3, '0102b6988000000b ': 1})
Counter({'[DM_WORKFLOW_E_PROCESS_AUTO_TASK]': 4, '[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]': 3})
0 ('0102b69880c4b330\t', 3)
1 ('0102b6988000000b ', 1)
2 ('0102b6988000000c\t', 3)
0 ('[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]', 3)
1 ('[DM_WORKFLOW_E_PROCESS_AUTO_TASK]', 4)
以文本形式输入数据开始:
txt = """
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
"""
我们可以做一些预处理,首先拆分成行并丢弃空行:
lines = [line for line in txt.split('\n') if line.strip()]
然后提取我们感兴趣的块,但只是对数据进行粗略(且非常快速)的拆分
parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
更新: 因为你的新数据不是固定宽度的,我们需要一些其他方式 pre-processing 它,例如:
# parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
import re
lines = [re.sub(r'\s+', ' ', line) for line in lines] # squash all multiple spaces to a single space
parts = [line.split() for line in lines] # split on whitespace
parts = [(line[2], ' '.join(line[3:]).split(':', 1)) for line in parts] # this is similar to the original line
记住,这部分只是为了让下面的 InputData class 中的最终处理更容易。
然后我们为我们感兴趣的输入数据创建一个数据结构,它可以获取我们在部分中拥有的 pre-processed 数据:
class InputData(object):
def __init__(self, idtag, (msg, details)): # py3 is more awkward here (*)
self.idtag = idtag
self.error_task = None
self.error_workflow = None
msg = msg.strip()
if msg.endswith('] info'):
self.msg = msg[1:-len('] info')]
elif msg.endswith('error'):
self.msg = msg[1:-len(']error')]
self.error_task = details.split(' task ', 1)[1].split(' ', 1)[0]
self.error_workflow = details.split(' workflow ', 1)[1].split('.', 1)[0]
else:
self.msg = msg
def __repr__(self):
return repr(self.__dict__) # this is a great trick for making debugging easier
(*) 对于 py3 你需要(不确定他们为什么改变这个......?)
def __init__(self, idtag, tmp):
msg, details = tmp
现在我们可以将此 class 应用于 pre-processed 输入:
input_data = [InputData(*part) for part in parts]
如果我们打印出到目前为止的内容:
for d in input_data:
print d
结果是:
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'JMS DO_METHOD TRACE LAUNCH'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'DO_METHOD TRACE LAUNCH'}
{'error_workflow': '4d02b6988000f709', 'error_task': '4a02b698800aad04', 'idtag': '0102b6988000000b', 'msg': 'DM_WORKFLOW_E_PROCESS_AUTO_TASK'}
...
现在我们创建一个 class 来表示我们想要的输出数据:
from collections import defaultdict
class OutputData(object):
def __init__(self): # I'm using this class in a defaultdict, so the __init__ method can't take any arguments
self.idtag = None
self.idtag_count = 0
self.messages = defaultdict(int)
self.errors = []
self.workflows = []
def add(self, indata):
"Adds indata to this object."
self.idtag = indata.idtag
self.idtag_count += 1
self.messages[indata.msg] += 1
if indata.error_task:
self.errors.append(indata.error_task)
self.workflows.append(indata.error_workflow)
并将输入数据输入其中:
output_data = defaultdict(OutputData)
for indata in input_data:
output_data[indata.idtag].add(indata)
最后,我们可以将输出数据输出为我们想要的格式:
fmt = '%-20s %-6s %-55s %-15s %-60s %s'
print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
for outdata in output_data.values():
print fmt % (
outdata.idtag,
outdata.idtag_count,
', '.join(outdata.messages.keys()),
', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
', '.join(outdata.errors),
', '.join(outdata.workflows)
)
这种结构,即:pre-process文本,提取有趣的输入数据,将输入数据转换为输出数据,最后serializing/formatting输出数据;对于所有此类问题都适用,并且它使以后的调试和修改变得更加容易。