如何从两个制表符分隔的文件中获取轴心线？

Question

给定两个文件file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432

和file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

任务是提取第一列匹配的行并实现：

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

我可以在 Python 中这样做：

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

实际任务文件有数十亿行，有没有更快的解决方案而不加载所有内容并保持 hashmaps/dictionaries？

也许使用 unix/bash 命令？对文件进行预排序会有帮助吗？

Answer 1

你可以试试这个awk:

awk '{key =  FS } FNR==NR {sub(/^([^[:blank:]]+[[:blank:]]+){2}/, ""); map[key] = [=10=]; next} key in map {print [=10=], map[key]}' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

更具可读性的版本：

awk '{
   key =  FS 
}
FNR == NR {
   sub(/^([^[:blank:]]+[[:blank:]]+){2}/, "")
   map[key] = [=11=]
   next
}
key in map {
   print [=11=], map[key]
}' file2.txt file1.txt

它只将file2的数据加载到内存中，并逐行处理file1的记录。

Answer 2

join 命令有时很难使用，但这里很简单：

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

使用 bash 的 ANSI-C quoting to specify the tab separator, and process substitutions 将程序输出视为文件。

要查看输出，请将以上内容通过管道传输到 cat -A 以查看表示为 ^I:

的选项卡

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$

Answer 3

使用 Miller (https://github.com/johnkerl/miller) 及其连接动词

mlr --tsv --implicit-csv-header --headerless-csv-output join -j 1 --rp 2 -f file1.txt file2.txt >output.tsv

输出将是（这只是预览，您将有制表符分隔符）：

| foo bar | 789 123 | hello world  |
| abc def | 123 456 | good morning |

如何从两个制表符分隔的文件中获取轴心线？

How to get the pivot lines from two tab-separated files?

python

csv

shell

dictionary

hashmap