如何根据列表中的关键字删除行的子集？

Question

我有以下文件：

This
is
a
testfile 
wj5j keyword 1
WFEWF
O%LWJZ keyword 2
which
should
lpokpij keyword 3
123123das
kpmnvf keyword 4
just
contain
the 
following
lines.

我需要从中删除关键字 1 和关键字 2 之间以及关键字 3 和关键字 4 之间的行子集，因此它看起来像这样：

This
is
a
testfile 
which
should
just
contain
the 
following
lines.

我尝试了以下方法，它只打印包含关键字的代码行，而不打印中间的代码行。我的想法是，如果我打印了所有行，我可以从文件中删除它们

with open ("newfile_TEST1.txt", mode = "r") as file:
    keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
    lines = file.readlines()
    for lineno, line in enumerate(file,1):
        matches = [k for k in keywords if k  in line]
        if matches:
            print(line)

我可以做些什么来改进我的代码？

Answer 1

这不是很优雅，但你可以这样做:

with open("file.txt", mode="r") as file:
    lines = file.readlines()

keywords = ["keyword 1", "keyword 2", "keyword 3", "keyword 4"]
line = 0
to_keep = True
kept = []

while line < len(lines):
    has_keyword = any((keyword in lines[line] for keyword in keywords))
    if to_keep and not has_keyword:
        kept.append(lines[line])
    if has_keyword:
        to_keep = not to_keep
    line += 1

for line in kept:
    print(line, end="")


with open("newfile.txt", mode="w") as file:
    file.writelines(kept)

输出：

This
is
a
testfile
which
should
just
contain
the
following
lines.

Answer 2

我会使用从第一场比赛到 netx 比赛以来都是 True 的天赋。那么是假的：

with open ("./txt.txt", mode = "r") as file:
    keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
    lines = file.readlines()
    glitch_flair=False
    for lineno, line in enumerate(lines,1):
        matches = [k for k in keywords if k  in line]
        if not matches and not glitch_flair:
            print(line, end='')
        elif matches:
            glitch_flair=not glitch_flair

Answer 3

当您不想用 readlines() 等存储整行时，此解决方案适用于大型文本文件。

keywords = ['keyword 1', 'keyword 2', 'keyword 3', 'keyword 4']

keywords_it = iter(keywords)
pair = (next(keywords_it), next(keywords_it))
write = True

with open("newfile_TEST1.txt") as f:
    for line in f:
        if not line.rstrip().endswith(pair[0]) and write:
            print(line, end='')

        elif line.rstrip().endswith(pair[1]):
            write = True
            try:
                pair = (next(keywords_it), next(keywords_it))
            except StopIteration:
                pass
        else:
            write = False

输出：

This
is
a
testfile 
which
should
just
contain
the 
following
lines.

我们的想法是每次从 keywords 列表中获取一对关键字（如 ('keyword 1', 'keyword 2')。当我们遍历文件中的行时，如果该行不是以第一个，它是一个正常的行，应该打印。如果它以对中的第一个项目结束，它将 write 标志设置为 False，这意味着我们停止写入。
现在，如果它以对中的第二项结束，则意味着我们可以在这一行之后再次开始编写。所以我们得到下一对并将 write 标志设置为 True。

Answer 4

我用过reindex的split功能

使用它我将整个字符串分成块。然后我只选择了具有偶数位值的块，因为我们对 2 个关键字之间的数据感兴趣。例如：pair("关键字 1","关键字 2") 和 pair("关键字 3","关键字 4") 等。几乎没有空行（因为我们跳过了奇数位值）所以只是用 rstrip() 来删除空行。

import re
Lmatches=[]
Loutput=[]
patt=re.compile(r'\b.* keyword [1-4]')
with open("f1.txt","r") as f:
    data=f.read()
matches=patt.split(data)
for i in range(len(matches)):
    if i%2==0:
        Lmatches.append(matches[i])
for elem in Lmatches:
    Loutput.append(elem.rstrip())#to remove empty lines
with open("output.txt","w") as wfile:
    wfile.writelines(Loutput)

如何根据列表中的关键字删除行的子集？

How to delete subsets of lines based on keywords from a list?

python

text-files