如何根据列表中的关键字删除行的子集?
How to delete subsets of lines based on keywords from a list?
我有以下文件:
This
is
a
testfile
wj5j keyword 1
WFEWF
O%LWJZ keyword 2
which
should
lpokpij keyword 3
123123das
kpmnvf keyword 4
just
contain
the
following
lines.
我需要从中删除关键字 1 和关键字 2 之间以及关键字 3 和关键字 4 之间的行子集,因此它看起来像这样:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我尝试了以下方法,它只打印包含关键字的代码行,而不打印中间的代码行。我的想法是,如果我打印了所有行,我可以从文件中删除它们
with open ("newfile_TEST1.txt", mode = "r") as file:
keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
lines = file.readlines()
for lineno, line in enumerate(file,1):
matches = [k for k in keywords if k in line]
if matches:
print(line)
我可以做些什么来改进我的代码?
这不是很优雅,但你可以这样做:
with open("file.txt", mode="r") as file:
lines = file.readlines()
keywords = ["keyword 1", "keyword 2", "keyword 3", "keyword 4"]
line = 0
to_keep = True
kept = []
while line < len(lines):
has_keyword = any((keyword in lines[line] for keyword in keywords))
if to_keep and not has_keyword:
kept.append(lines[line])
if has_keyword:
to_keep = not to_keep
line += 1
for line in kept:
print(line, end="")
with open("newfile.txt", mode="w") as file:
file.writelines(kept)
输出:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我会使用从第一场比赛到 netx 比赛以来都是 True 的天赋。那么是假的:
with open ("./txt.txt", mode = "r") as file:
keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
lines = file.readlines()
glitch_flair=False
for lineno, line in enumerate(lines,1):
matches = [k for k in keywords if k in line]
if not matches and not glitch_flair:
print(line, end='')
elif matches:
glitch_flair=not glitch_flair
当您不想用 readlines()
等存储整行时,此解决方案适用于大型文本文件。
keywords = ['keyword 1', 'keyword 2', 'keyword 3', 'keyword 4']
keywords_it = iter(keywords)
pair = (next(keywords_it), next(keywords_it))
write = True
with open("newfile_TEST1.txt") as f:
for line in f:
if not line.rstrip().endswith(pair[0]) and write:
print(line, end='')
elif line.rstrip().endswith(pair[1]):
write = True
try:
pair = (next(keywords_it), next(keywords_it))
except StopIteration:
pass
else:
write = False
输出:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我们的想法是每次从 keywords
列表中获取一对关键字(如 ('keyword 1', 'keyword 2')
。当我们遍历文件中的行时,如果该行不是以第一个,它是一个正常的行,应该打印。如果它以对中的第一个项目结束,它将 write
标志设置为 False
,这意味着我们停止写入。
现在,如果它以对中的第二项结束,则意味着我们可以在这一行之后再次开始编写。所以我们得到下一对并将 write
标志设置为 True。
我用过reindex的split功能
使用它我将整个字符串分成块。然后我只选择了具有偶数位值的块,因为我们对 2 个关键字之间的数据感兴趣。例如:pair("关键字 1","关键字 2") 和 pair("关键字 3","关键字 4") 等。
几乎没有空行(因为我们跳过了奇数位值)所以只是用 rstrip() 来删除空行。
import re
Lmatches=[]
Loutput=[]
patt=re.compile(r'\b.* keyword [1-4]')
with open("f1.txt","r") as f:
data=f.read()
matches=patt.split(data)
for i in range(len(matches)):
if i%2==0:
Lmatches.append(matches[i])
for elem in Lmatches:
Loutput.append(elem.rstrip())#to remove empty lines
with open("output.txt","w") as wfile:
wfile.writelines(Loutput)
我有以下文件:
This
is
a
testfile
wj5j keyword 1
WFEWF
O%LWJZ keyword 2
which
should
lpokpij keyword 3
123123das
kpmnvf keyword 4
just
contain
the
following
lines.
我需要从中删除关键字 1 和关键字 2 之间以及关键字 3 和关键字 4 之间的行子集,因此它看起来像这样:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我尝试了以下方法,它只打印包含关键字的代码行,而不打印中间的代码行。我的想法是,如果我打印了所有行,我可以从文件中删除它们
with open ("newfile_TEST1.txt", mode = "r") as file:
keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
lines = file.readlines()
for lineno, line in enumerate(file,1):
matches = [k for k in keywords if k in line]
if matches:
print(line)
我可以做些什么来改进我的代码?
这不是很优雅,但你可以这样做:
with open("file.txt", mode="r") as file:
lines = file.readlines()
keywords = ["keyword 1", "keyword 2", "keyword 3", "keyword 4"]
line = 0
to_keep = True
kept = []
while line < len(lines):
has_keyword = any((keyword in lines[line] for keyword in keywords))
if to_keep and not has_keyword:
kept.append(lines[line])
if has_keyword:
to_keep = not to_keep
line += 1
for line in kept:
print(line, end="")
with open("newfile.txt", mode="w") as file:
file.writelines(kept)
输出:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我会使用从第一场比赛到 netx 比赛以来都是 True 的天赋。那么是假的:
with open ("./txt.txt", mode = "r") as file:
keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
lines = file.readlines()
glitch_flair=False
for lineno, line in enumerate(lines,1):
matches = [k for k in keywords if k in line]
if not matches and not glitch_flair:
print(line, end='')
elif matches:
glitch_flair=not glitch_flair
当您不想用 readlines()
等存储整行时,此解决方案适用于大型文本文件。
keywords = ['keyword 1', 'keyword 2', 'keyword 3', 'keyword 4']
keywords_it = iter(keywords)
pair = (next(keywords_it), next(keywords_it))
write = True
with open("newfile_TEST1.txt") as f:
for line in f:
if not line.rstrip().endswith(pair[0]) and write:
print(line, end='')
elif line.rstrip().endswith(pair[1]):
write = True
try:
pair = (next(keywords_it), next(keywords_it))
except StopIteration:
pass
else:
write = False
输出:
This
is
a
testfile
which
should
just
contain
the
following
lines.
我们的想法是每次从 keywords
列表中获取一对关键字(如 ('keyword 1', 'keyword 2')
。当我们遍历文件中的行时,如果该行不是以第一个,它是一个正常的行,应该打印。如果它以对中的第一个项目结束,它将 write
标志设置为 False
,这意味着我们停止写入。
现在,如果它以对中的第二项结束,则意味着我们可以在这一行之后再次开始编写。所以我们得到下一对并将 write
标志设置为 True。
我用过reindex的split功能
使用它我将整个字符串分成块。然后我只选择了具有偶数位值的块,因为我们对 2 个关键字之间的数据感兴趣。例如:pair("关键字 1","关键字 2") 和 pair("关键字 3","关键字 4") 等。 几乎没有空行(因为我们跳过了奇数位值)所以只是用 rstrip() 来删除空行。
import re
Lmatches=[]
Loutput=[]
patt=re.compile(r'\b.* keyword [1-4]')
with open("f1.txt","r") as f:
data=f.read()
matches=patt.split(data)
for i in range(len(matches)):
if i%2==0:
Lmatches.append(matches[i])
for elem in Lmatches:
Loutput.append(elem.rstrip())#to remove empty lines
with open("output.txt","w") as wfile:
wfile.writelines(Loutput)