从文件中删除未出现在另一个文件中的行，错误

Question

我有两个文件，类似于下面的文件：

文件 1 - 表型信息，第一列是个体，原始文件有 400 行：

215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745

文件2 - 带有SNPs信息，原始文件有400行，每行42,000个字符。

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101
217          20211010202200025201202102210121201005010101
218          02022000252012021022101212010050101012021101

并且我需要从文件1中删除没有出现在文件1中的2个人，例如：

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101

我可以用这段代码做到这一点：

awk 'NR==FNR{a[]; next} in a{print [=13=]}' file1 file2> file3

但是，当我对生成的文件进行主要分析时，出现以下错误：

*** Error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 ***
*** Error in `./postGSf90': free(): invalid size: 0x00007fec4a04f010 ***

airemlf90 和 postGSf90 是软件。但是当我使用原始文件时，不会出现这个问题。我发出的删除个人的命令是否足够？还有一个没说的细节就是有些人的身份证是4个字，会不会是这个错误？

谢谢

Answer 1

我在几分钟内写了一个小 python 脚本。效果很好，我已经用 42000 字符的行进行了测试，效果很好。

import sys,re

# rudimentary argument parsing

file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]

present = set()

# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
    for l in f1:
        toks = re.split("\s+",l)    # same as awk fields
        if toks:   # robustness against empty lines
            present.add(toks[0])

#now read second one and write in third one only if id is in the set

with open(file2,"r") as f2:
    with open(file3,"w") as f3:
        for l in f2:
            toks = re.split("\s+",l)
            if toks and toks[0] in present:
                f3.write(l)

（如果 python 尚未存在，请先安装。）

像这样调用我的示例脚本 mytool.py 和运行：

python mytool.py file1.txt file2.txt file3.txt

在一个 bash 文件中同时处理多个文件（以替换原始解决方案）很容易（尽管不是最佳方案，因为可以在 python 中快速完成）

<whatever the for loop you need>; do
  python my_tool.py   
done

就像您使用 3 个文件调用 awk 一样。

从文件中删除未出现在另一个文件中的行，错误

Remove Lines from File which not appear in another File, error

linux

awk

command

compiler-errors

file