Python 从另一个文件中删除不匹配的文件行
Python remove file lines without matches from another file
我有两个文件,第一个包含必要的数据:1st file and 2nd with list of lines to keep: 2nd file
我已尝试通过 python 代码进行过滤:
import os.path
# loading the input files
output = open('descmat.txt', 'w+')
input = open('descmat_all.txt', 'r')
lists = open('training_lines.txt', 'r')
print "Test1"
# reading the input files
list_lines = lists.readlines()
list_input = input.readlines()
print "Test2"
output.write(list_input[0])
for i in range(len(list_lines)):
for ii in range(len(list_input)):
position = list_input[ii].find(list_lines[i][:-1])
if position > -1:
output.write(list_input[ii])
break
print "Test3"
output.close()
但是这个脚本找不到任何匹配项。仅保留第一个文件中与第二个文件匹配的行的最简单解决方案是什么?
对于此类问题,Python 具有 set
数据类型
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line
OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))
# when you leave a with block, all the resources are released
# i.e., no need for file.close()
with open('descmat_all.txt') as infile:
with open('descmat.txt', 'w') as outfile:
for line in infile:
# OK_lines have been stripped, input lines must be stripped as well
if line.rstrip('\n') in OK_lines:
outfile.write(line)
一个简单的测试
boffi@debian:~/Documents/tmp$ cat check.py
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line
OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))
# when you leave a with block, all the resources are released
# i.e., no need for file.close()
with open('descmat_all.txt') as infile:
with open('descmat.txt', 'w') as outfile:
for line in infile:
# OK_lines have been stripped, input lines must be stripped as well
if line.rstrip('\n') in OK_lines:
outfile.write(line)
boffi@debian:~/Documents/tmp$ cat training_lines.txt
ada
bob
boffi@debian:~/Documents/tmp$ cat descmat_all.txt
bob
doug
ada
doug
eddy
ada
bob
boffi@debian:~/Documents/tmp$ python check.py
boffi@debian:~/Documents/tmp$ cat descmat.txt
bob
ada
ada
bob
boffi@debian:~/Documents/tmp$
如果您将两个文件读入一个列表,您可以简单地比较这些列表。看here怎么做。 out
应包含可以匹配的字符串列表。
out = [e for e in list_input for i in list_lines if e.startswith(i)]
output.writelines(out)
替换这部分代码:
for i in range(len(list_lines)):
for ii in range(len(list_input)):
position = list_input[ii].find(list_lines[i][:-1])
if position > -1:
output.write(list_input[ii])
break
通过这个:
for i in range(len(list_lines)):
for ii in range(len(list_input)):
if list_input[ii][:26] == list_lines[i][:-1]:
output.write(list_input[ii])
完全符合我的需要。
我有两个文件,第一个包含必要的数据:1st file and 2nd with list of lines to keep: 2nd file
我已尝试通过 python 代码进行过滤:
import os.path
# loading the input files
output = open('descmat.txt', 'w+')
input = open('descmat_all.txt', 'r')
lists = open('training_lines.txt', 'r')
print "Test1"
# reading the input files
list_lines = lists.readlines()
list_input = input.readlines()
print "Test2"
output.write(list_input[0])
for i in range(len(list_lines)):
for ii in range(len(list_input)):
position = list_input[ii].find(list_lines[i][:-1])
if position > -1:
output.write(list_input[ii])
break
print "Test3"
output.close()
但是这个脚本找不到任何匹配项。仅保留第一个文件中与第二个文件匹配的行的最简单解决方案是什么?
对于此类问题,Python 具有 set
数据类型
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line
OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))
# when you leave a with block, all the resources are released
# i.e., no need for file.close()
with open('descmat_all.txt') as infile:
with open('descmat.txt', 'w') as outfile:
for line in infile:
# OK_lines have been stripped, input lines must be stripped as well
if line.rstrip('\n') in OK_lines:
outfile.write(line)
一个简单的测试
boffi@debian:~/Documents/tmp$ cat check.py
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line
OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))
# when you leave a with block, all the resources are released
# i.e., no need for file.close()
with open('descmat_all.txt') as infile:
with open('descmat.txt', 'w') as outfile:
for line in infile:
# OK_lines have been stripped, input lines must be stripped as well
if line.rstrip('\n') in OK_lines:
outfile.write(line)
boffi@debian:~/Documents/tmp$ cat training_lines.txt
ada
bob
boffi@debian:~/Documents/tmp$ cat descmat_all.txt
bob
doug
ada
doug
eddy
ada
bob
boffi@debian:~/Documents/tmp$ python check.py
boffi@debian:~/Documents/tmp$ cat descmat.txt
bob
ada
ada
bob
boffi@debian:~/Documents/tmp$
如果您将两个文件读入一个列表,您可以简单地比较这些列表。看here怎么做。 out
应包含可以匹配的字符串列表。
out = [e for e in list_input for i in list_lines if e.startswith(i)]
output.writelines(out)
替换这部分代码:
for i in range(len(list_lines)):
for ii in range(len(list_input)):
position = list_input[ii].find(list_lines[i][:-1])
if position > -1:
output.write(list_input[ii])
break
通过这个:
for i in range(len(list_lines)):
for ii in range(len(list_input)):
if list_input[ii][:26] == list_lines[i][:-1]:
output.write(list_input[ii])
完全符合我的需要。