如果不存在，追加。如果存在，增加计数

Question

我是 Python 的新手（总体来说编程不错），真的需要您的帮助。

我正在尝试读取防火墙日志文件。我对其中包含 Deny 的所有行都感兴趣。如果找到，它应该提取源 IP、目标 IP、目标端口和协议。但我不想看到所有的线条，只想看到独特的线条。到目前为止，一切都很好。一切正常（尽管我确信它可以做得更聪明），但我也想添加一个计数器，这样我就可以看到 s_ip、d_ip 的特定组合的次数, d_port 和协议已经发生，但我不知道如何。

日志文件示例：

Nov  9 00:36:10 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/43882 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:10 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/38780 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:11 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/8273 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/23433 dst outside:2.2.2.22/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/25175 dst outside:2.2.2.24/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/15855 dst outside:2.2.2.26/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/24574 dst outside:2.2.2.27/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/21797 dst outside:2.2.2.29/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny udp src outside:3.3.3.3/12112 dst outside:2.2.2.99/53031 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:13 firewall %ASA-4-106023: Deny icmp src outside:4.4.4.4 dst services:2.2.2.211 (type 11, code 1) by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:17 firewall %ASA-4-106023: Deny icmp src outside:4.4.4.4 dst services:2.2.2.10 (type 3, code 3) by access-group "outside-in" [0x0, 0x0]

我能够得到以下结果

'icmp'
'tcp', '1.1.1.1', '2.2.2.2', '23'
'tcp', '1.1.1.1', '2.2.2.22', '23'
'tcp', '1.1.1.1', '2.2.2.24', '23'
'tcp', '1.1.1.1', '2.2.2.26', '23'
'tcp', '1.1.1.1', '2.2.2.27', '23'
'tcp', '1.1.1.1', '2.2.2.29', '23'
'udp', '3.3.3.3', '2.2.2.99', '53031'

我还没有完全获得 icmp 输出（icmp 没有 /port 而我的正则表达式使用它来获取 IP 地址），我会尝试使输出更好一些（尝试删除' 和 ,)，但我真正想要的是每一行的点击次数，例如第一个 tcp 行的命中数为 3，依此类推。

import re       #for regular expressions - to match ip's
import sys      #for parsing command line opts

# if file is specified on command line, parse, else ask for file
if sys.argv[1:]:
    print "File: %s" % (sys.argv[1])
    logfile = sys.argv[1]
else:
    logfile = raw_input("Please enter a file to parse, e.g /var/log/secure: ")

match = []
seen = []

# find all Deny lines and append them in a list
for lines in open(logfile) :
    extract = re.findall('Deny.*"' ,lines)
    for i in extract :
        match.append(i)

# extract different keywords from Deny lines
for lines in match :
    prot = re.findall('Deny\s(.+?)\ssrc',lines)
    ip_src = re.findall('src.*?:([0-9a-f].*?)/', lines)
    ip_dst = re.findall('dst.*?:([0-9a-f].*?)/', lines)
    #ip_sport = re.findall('src.*?[0-9a-f].*?/([0-9].*?)\s', lines)     # uncomment if you want source port also, and add ip_sport to summarized below
    ip_dport = re.findall('dst.*?[0-9a-f].*?/([0-9].*?)\s', lines)

    summarized = prot + ip_src + ip_dst + ip_dport

    if summarized not in seen :             # only add unique entries
        seen.append(summarized)


# sort 
seen.sort()

for lines in seen :
    print ( ", ".join( repr(e) for e in lines ) )

此外，我试图向它扔一个 3GB 的日志文件，现在已经运行几个小时了。优化代码有什么好主意吗？

我知道我问了很多问题，我们将不胜感激，但我的主要问题是帮助获得在线计数器。

Answer 1

为避免重复输入，您可以使用 set 而不是 list。我会这样做：

seen = set()
for lines in open(logfile) :
    extract = re.findall('Deny.*"' ,lines)
    for i in extract :
        prot = re.findall('Deny\s(.+?)\ssrc',i)
        ip_src = re.findall('src.*?:([0-9a-f].*?)/', i)
        ip_dst = re.findall('dst.*?:([0-9a-f].*?)/', i)
        #ip_sport = re.findall('src.*?[0-9a-f].*?/([0-9].*?)\s', i)
        ip_dport = re.findall('dst.*?[0-9a-f].*?/([0-9].*?)\s', i)
        seen.add((prot, ip_src, ip_dst, ip_dport)) #Add here ip_sport if you want

这应该更快，因为它使用更少的循环，另一方面 sets 是无序的（这里是构建它的方法，http://code.activestate.com/recipes/576694/）。如果您不想构建它并订购您应该在打印前将其转换为列表

Answer 2

Python 标准库已经有 Counter class.

您可以将 seen 变量更改为 Counter:

from collections import Counter

[...]

seen = Counter()

# extract different keywords from Deny lines
for lines in match :

    [...]

    summarized = prot + ip_src + ip_dst + ip_dport

    # NOTE: summarized must be a string or tuple.
    seen.update([summarized])

最后，seen 字典将每个唯一的汇总行作为键，每行的计数将是值。

关于优化，最好（我认为）在 for lines in open(logfile) 循环中处理遇到的每一行。

如果不存在，追加。如果存在，增加计数

If not exist, append. If exist, increment count

python

regex

append