读取 gz 文件并获取最后 24 小时行 python
Read gz file and get the last 24 hrs line python
我有三个文件:2 个 .gz
个文件和 1 个 .log
个文件。这些文件相当大。下面是我的原始数据的样本副本。我想提取与过去 24 小时对应的条目。
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
我不确定如何获取过去 24 小时的条目。
我想 运行 对过去 24 小时的数据执行以下函数。
def _clean_logs(line):
# noinspection SpellCheckingInspection
lemmatizer = WordNetLemmatizer()
clean_line = clean_line.strip()
clean_line = clean_line.lstrip('0123456789.- ')
cleaned_log = " ".join(
[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
cleaned_log = cleaned_log.replace('"', ' ')
return cleaned_log
像这样的东西应该有用。
from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil
def open_file(path):
if Path(path).suffix == '.gz':
return gzip.open(path, mode='rt', encoding='utf-8')
else:
return open(path, encoding='utf-8')
def parsed_entries(lines):
for line in lines:
yield line.split(' ', maxsplit=1)
def earlier():
return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')
def get_files():
return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))
output = open('output.log', 'w', encoding='utf-8')
files = get_files()
cutoff = earlier()
for i, path in enumerate(files):
with open_file(path) as f:
lines = parsed_entries(f)
# Assumes that your files are not empty
date, line = next(lines)
if cutoff <= date:
# Skip files that can just be appended to the output later
continue
for date, line in lines:
if cutoff <= date:
# We've reached the first entry of our file that should be
# included
output.write(line)
break
# Copies from the current position to the end of the file
shutil.copyfileobj(f, output)
break
else:
# In case ALL the files are within the last 24 hours
i = len(files)
for path in reversed(files[:i]):
with open_file(path) as f:
# Assumes that your files have trailing newlines.
shutil.copyfileobj(f, output)
# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()
然后如果我们制作一些测试日志文件:
#!/bin/sh
echo "2019/01/15-00:00:00.000000 hi" > a.log.1
echo "2019/01/31-00:00:00.000000 hi2" > a.log.2
echo "2019/01/31-19:00:00.000000 hi3" > a.log
gzip a.log.1 a.log.2
和运行我们的脚本,它输出预期的结果(对于这个时间点)
2019/01/31-00:00:00.000000 hi2
2019/01/31-19:00:00.000000 hi3
处理日志文件通常涉及大量数据,因此不希望按升序读取并每次都读取所有内容,因为这会浪费大量资源。
我立即想到实现您的目标的最快方法(肯定会存在更好的方法)是非常简单的随机搜索:我们以相反的顺序搜索日志文件,因此首先从最新的开始.您无需访问所有行,而是任意选择一些 stepsize
并且每个 stepsize
只查看 一些 行。这样,您可以在很短的时间内搜索千兆字节的数据。
此外,这种方法不需要将文件的每一行存储在内存中,但只需要部分行和最终结果。
当a.log
为当前日志文件时,我们从这里开始搜索:
with open("a.log", "rb+") as fh:
由于我们只对最近 24 小时感兴趣,所以我们先跳到最后并将要搜索的时间戳保存为格式化字符串:
timestamp = datetime.datetime.now() - datetime.timedelta(days=1) # last 24h
# jump to logfile's end
fh.seek(0, 2) # <-- '2': search relative to file's end
index = fh.tell() # current position in file; here: logfile's *last* byte
现在我们可以开始随机搜索了。您的行似乎平均长约 65 个字符,因此我们移动了它的倍数。
average_line_length = 65
stepsize = 1000
while True:
# we move a step back
fh.seek(index - average_line_length * stepsize, 2)
# save our current position in file
index = fh.tell()
# we try to read a "line" (multiply avg. line length times a number
# large enough to cover even large lines. Ignore largest lines here,
# since this is an edge cases ruining our runtime. We rather skip
# one iteration of the loop then)
r = fh.read(average_line_length * 10)
# our results now contains (on average) multiple lines, so we
# split first
lines = r.split(b"\n")
# now we check for our timestring
for l in lines:
# your timestamps are formatted like '2018/03/28-20:08:48.985053'
# I ignore minutes, seconds, ... here, just for the sake of simplicity
timestr = l.split(b":") # this gives us b'2018/03/28-20' in timestr[0]
# next we convert this to a datetime
found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")
# finally, we compare if the found time is not inside our 24hour margin
if found_time < timestamp:
break
使用此代码,只要我们在最后 24 小时内,我们最终只会搜索几行 stepsize
(这里:1000 行)。离开 24 小时后,我们 知道 我们最多正好 stepsize
* average_line_length
在文件中走得太远了。
过滤此 "went too far" 变得非常容易:
# read in file's contents from current position to end
contents = fh.read()
# split for lines
lines_of_contents = contents.split(b"\n")
# helper function for removing all lines older than 24 hours
def check_line(line):
# split to extract datestr
tstr = line.split(b":")
# convert this to a datetime
ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")
return ftime > timestamp
# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)
由于 contents
涵盖了我们文件的所有剩余内容(以及 lines
所有行,这只是 contents
在换行符 \n
处拆分)我们可以轻松地使用 filter
得到我们想要的结果。
lines
中的每一行将被馈送到 check_line
,如果该行的时间是 > timestamp
和 timestamp
,则 returns True
] 是准确描述 now - 1day
的日期时间对象。这意味着 check_line
将 return False
用于所有早于 timestamp
的行,而 filter
将删除这些行。
显然,这远非最佳,但它很容易理解并且很容易扩展到过滤分钟、秒、...
此外,覆盖多个文件也很容易:您只需要 glob.glob
找到所有可能的文件,从最新的文件开始并添加另一个循环:您将搜索文件直到我们的 while 循环失败第一次,然后中断并读取当前文件的所有剩余内容+之前访问过的所有文件的所有内容。
大致是这样的:
final_lines = lst()
for file in logfiles:
# our while-loop
while True:
...
# if while-loop did not break all of the current logfile's content is
# <24 hours of age
with open(file, "rb+") as fh:
final_lines.extend(fh.readlines())
如果所有行都小于 24 小时,那么您只需存储日志文件的所有行。如果循环在某个点中断,即我们找到了一个日志文件和确切的行 >24 小时的年龄,将 final_lines
扩展 final_result
,因为这将仅覆盖 <24 小时的行。
我有三个文件:2 个 .gz
个文件和 1 个 .log
个文件。这些文件相当大。下面是我的原始数据的样本副本。我想提取与过去 24 小时对应的条目。
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
我不确定如何获取过去 24 小时的条目。
我想 运行 对过去 24 小时的数据执行以下函数。
def _clean_logs(line):
# noinspection SpellCheckingInspection
lemmatizer = WordNetLemmatizer()
clean_line = clean_line.strip()
clean_line = clean_line.lstrip('0123456789.- ')
cleaned_log = " ".join(
[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
cleaned_log = cleaned_log.replace('"', ' ')
return cleaned_log
像这样的东西应该有用。
from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil
def open_file(path):
if Path(path).suffix == '.gz':
return gzip.open(path, mode='rt', encoding='utf-8')
else:
return open(path, encoding='utf-8')
def parsed_entries(lines):
for line in lines:
yield line.split(' ', maxsplit=1)
def earlier():
return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')
def get_files():
return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))
output = open('output.log', 'w', encoding='utf-8')
files = get_files()
cutoff = earlier()
for i, path in enumerate(files):
with open_file(path) as f:
lines = parsed_entries(f)
# Assumes that your files are not empty
date, line = next(lines)
if cutoff <= date:
# Skip files that can just be appended to the output later
continue
for date, line in lines:
if cutoff <= date:
# We've reached the first entry of our file that should be
# included
output.write(line)
break
# Copies from the current position to the end of the file
shutil.copyfileobj(f, output)
break
else:
# In case ALL the files are within the last 24 hours
i = len(files)
for path in reversed(files[:i]):
with open_file(path) as f:
# Assumes that your files have trailing newlines.
shutil.copyfileobj(f, output)
# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()
然后如果我们制作一些测试日志文件:
#!/bin/sh
echo "2019/01/15-00:00:00.000000 hi" > a.log.1
echo "2019/01/31-00:00:00.000000 hi2" > a.log.2
echo "2019/01/31-19:00:00.000000 hi3" > a.log
gzip a.log.1 a.log.2
和运行我们的脚本,它输出预期的结果(对于这个时间点)
2019/01/31-00:00:00.000000 hi2
2019/01/31-19:00:00.000000 hi3
处理日志文件通常涉及大量数据,因此不希望按升序读取并每次都读取所有内容,因为这会浪费大量资源。
我立即想到实现您的目标的最快方法(肯定会存在更好的方法)是非常简单的随机搜索:我们以相反的顺序搜索日志文件,因此首先从最新的开始.您无需访问所有行,而是任意选择一些 stepsize
并且每个 stepsize
只查看 一些 行。这样,您可以在很短的时间内搜索千兆字节的数据。
此外,这种方法不需要将文件的每一行存储在内存中,但只需要部分行和最终结果。
当a.log
为当前日志文件时,我们从这里开始搜索:
with open("a.log", "rb+") as fh:
由于我们只对最近 24 小时感兴趣,所以我们先跳到最后并将要搜索的时间戳保存为格式化字符串:
timestamp = datetime.datetime.now() - datetime.timedelta(days=1) # last 24h
# jump to logfile's end
fh.seek(0, 2) # <-- '2': search relative to file's end
index = fh.tell() # current position in file; here: logfile's *last* byte
现在我们可以开始随机搜索了。您的行似乎平均长约 65 个字符,因此我们移动了它的倍数。
average_line_length = 65
stepsize = 1000
while True:
# we move a step back
fh.seek(index - average_line_length * stepsize, 2)
# save our current position in file
index = fh.tell()
# we try to read a "line" (multiply avg. line length times a number
# large enough to cover even large lines. Ignore largest lines here,
# since this is an edge cases ruining our runtime. We rather skip
# one iteration of the loop then)
r = fh.read(average_line_length * 10)
# our results now contains (on average) multiple lines, so we
# split first
lines = r.split(b"\n")
# now we check for our timestring
for l in lines:
# your timestamps are formatted like '2018/03/28-20:08:48.985053'
# I ignore minutes, seconds, ... here, just for the sake of simplicity
timestr = l.split(b":") # this gives us b'2018/03/28-20' in timestr[0]
# next we convert this to a datetime
found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")
# finally, we compare if the found time is not inside our 24hour margin
if found_time < timestamp:
break
使用此代码,只要我们在最后 24 小时内,我们最终只会搜索几行 stepsize
(这里:1000 行)。离开 24 小时后,我们 知道 我们最多正好 stepsize
* average_line_length
在文件中走得太远了。
过滤此 "went too far" 变得非常容易:
# read in file's contents from current position to end
contents = fh.read()
# split for lines
lines_of_contents = contents.split(b"\n")
# helper function for removing all lines older than 24 hours
def check_line(line):
# split to extract datestr
tstr = line.split(b":")
# convert this to a datetime
ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")
return ftime > timestamp
# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)
由于 contents
涵盖了我们文件的所有剩余内容(以及 lines
所有行,这只是 contents
在换行符 \n
处拆分)我们可以轻松地使用 filter
得到我们想要的结果。
lines
中的每一行将被馈送到 check_line
,如果该行的时间是 > timestamp
和 timestamp
,则 returns True
] 是准确描述 now - 1day
的日期时间对象。这意味着 check_line
将 return False
用于所有早于 timestamp
的行,而 filter
将删除这些行。
显然,这远非最佳,但它很容易理解并且很容易扩展到过滤分钟、秒、...
此外,覆盖多个文件也很容易:您只需要 glob.glob
找到所有可能的文件,从最新的文件开始并添加另一个循环:您将搜索文件直到我们的 while 循环失败第一次,然后中断并读取当前文件的所有剩余内容+之前访问过的所有文件的所有内容。
大致是这样的:
final_lines = lst()
for file in logfiles:
# our while-loop
while True:
...
# if while-loop did not break all of the current logfile's content is
# <24 hours of age
with open(file, "rb+") as fh:
final_lines.extend(fh.readlines())
如果所有行都小于 24 小时,那么您只需存储日志文件的所有行。如果循环在某个点中断,即我们找到了一个日志文件和确切的行 >24 小时的年龄,将 final_lines
扩展 final_result
,因为这将仅覆盖 <24 小时的行。