打印由第二个文件索引的行

Question

我有两个文件：

带字符串的文件（新行终止）
包含整数的文件（每行一个）

我想打印由第二个文件中的行索引的第一个文件中的行。我目前的解决方案是这样做

while read index
do
    sed -n ${index}p $file1
done < $file2

它本质上是逐行读取索引文件并运行 sed 来打印该特定行。问题是对于大型索引文件（几千和几万行）它很慢。

是否可以更快地执行此操作？我怀疑 awk 在这里很有用。

我尽我所能进行搜索，但只能找到试图打印行范围而不是通过第二个文件进行索引的人。

更新

索引一般不打乱。预计这些行将按照索引文件中索引定义的顺序出现。

例子

文件 1：

this is line 1
this is line 2
this is line 3
this is line 4

文件 2：

3
2

预期输出为：

this is line 3
this is line 2

Answer 1

这个 awk 脚本可以满足您的需求：

$ cat lines
1
3
5
$ cat strings 
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[[=10=]];next}FNR in a' lines strings 
string 1
string 3
string 5

第一个块仅针对第一个文件运行，其中当前文件的行号FNR 等于总行号NR。它在数组 a 中为每个应打印的行号设置一个键。 next 跳过其余说明。对于包含字符串的文件，如果行号在数组中，则执行默认操作（因此打印该行）。

Answer 2

如果我没理解错的话，那么

awk 'NR == FNR { selected[] = 1; next } selected[FNR]' indexfile datafile

应该可以工作，前提是索引按升序排序，或者您希望行按数据文件中的顺序打印，而不管索引的排序方式如何。其工作方式如下：

NR == FNR {         # while processing the first file
  selected[] = 1  # remember if an index was seen
  next              # and do nothing else
}
selected[FNR]       # after that, select (print) the selected lines.

如果索引未排序并且行应按它们在索引中出现的顺序打印：

NR == FNR {               # processing the index:
  ++counter
  idx[[=12=]] = counter       # remember that and at which position you saw
  next                    # the index
}
FNR in idx {              # when processing the data file: 
  lines[idx[FNR]] = [=12=]    # remember selected lines by the position of
}                         # the index
END {                     # and at the end: print them in that order.
  for(i = 1; i <= counter; ++i) {
    print lines[i]
  }
}

这也可以内联（在 ++counter 和 index[FNR] = counter 之后有分号，但我可能会把它放在一个文件中，比如 foo.awk，和运行 awk -f foo.awk indexfile datafile.带索引文件

1
4
3

和一个数据文件

line1
line2
line3
line4

这将打印

line1
line4
line3

剩下的警告是，这假定索引中的条目是唯一的。如果这也是一个问题，您将必须记住一个索引位置列表，在扫描数据文件时将其拆分并记住每个位置的行。即：

NR == FNR {               
  ++counter
  idx[[=16=]] = idx[[=16=]] " " counter  # remember a list here
  next
}
FNR in idx {              
  split(idx[FNR], pos)    # split that list
  for(p in pos) {
    lines[pos[p]] = [=16=]    # and remember the line for
                          # all positions in them.
  }
}
END {
  for(i = 1; i <= counter; ++i) {
    print lines[i]
  }
}

这最终是问题中代码的功能等价物。您必须决定用例有多复杂。

Answer 3

为了完成使用 awk 的答案，这里有一个 Python 中的解决方案，您可以从 bash 脚本中使用它：

cat << EOF | python
lines = []
with open("$file2") as f:
    for line in f:
        lines.append(int(line))

i = 0
with open("$file1") as f:
    for line in f:
        i += 1
        if i in lines:
            print line,
EOF

这里唯一的优点是 Python 比 awk 更容易理解:)。

Answer 4

使用 nl 对字符串文件中的行进行编号，然后使用 join 合并两者：

~ $ cat index
1
3
5

~ $ cat strings
a
b
c
d
e

~ $ join index <(nl strings)
1 a
3 c
5 e

如果你想要反转（在你的索引中显示 NOT 的行）：

$ join -v 2 index <(nl strings)
2 b
4 d

请注意@glennjackman 的评论：如果您的文件未按词法排序，则需要在传入之前对它们进行排序：

$ join <(sort index) <(nl strings | sort -b)

打印由第二个文件索引的行

Print lines indexed by a second file

bash

awk

sed