如何通过将它们与 LINUX 中的另一个数据文件进行比较来为 ID 列表赋予相同的排名?
How to give the same rank to a list of ids by comparing them to another data file in LINUX?
我有一个 ID 列表(第 2 列),我根据它们的值(第 3 列)从 1 到 600 进行了排名。我有另一个相同 ID 但排名不同的列表,因为它们的价值不同。如何将 file2 中第一个 id 列表的 rank/order 与 file1 中的第一个 id 列表进行比较?例如:
file1:
rank list-of-ids values
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
file2:
list-of-ids values
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
输出文件应为 file2,其排名来自 file1:
output:
rank list-of-ids values
2 HOCANM106363549 0.4832
1 HOUSAM69708729 0.4199
3 HOCANM10845509 0.4143
6 HOUSAM69990251 0.3887
4 HOCANM11098662 0.3792
8 HOUSAM69756113 0.365
5 HOUSAM68571374 0.3649
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
9 HOCANM11098658 0.3545
有什么建议吗?请注意,真实数据没有任何 header,因此,out 也不应该有 header。
awk 解决办法:
awk 'NR==FNR{ a[]=; next }{ print a[],, }' file1 file2
NR==FNR
- 处理第一个输入文件(即file1
)
a[]=
- 将 rank
值(第一个字段 </code>)捕获到用相应的 <code>list-of-ids
值索引的数组 a
中(第二个字段 </code>)</p></li>
<li><p><code>next
- 跳转到下一条记录(file1
)
print a[],,
- 打印来自第二个输入文件 file2
的字段 (,
) 以及相应的 rank
值 a[]
输出:
2 HOCANM106363549 0.4832
1 HOUSAM69708729 0.4199
3 HOCANM10845509 0.4143
6 HOUSAM69990251 0.3887
4 HOCANM11098662 0.3792
8 HOUSAM69756113 0.365
5 HOUSAM68571374 0.3649
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
9 HOCANM11098658 0.3545
另一种选择,使用'join'
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(sort -k 2 file1) <(sort -k 1 file2)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887
ranks list-of-ids values
诚然,这并不能很好地处理 header。您已经接受了一个解决方案,但我喜欢这个实用程序,但知道它的人并不多 ;)
编辑:如果源数据没有任何 headers,那么这个命令很有效:
$ cat file1
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
$ cat file2
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(sort -k 2 file1) <(sort -k 1 file2)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887
如果您的任何一个文件确实包含 headers,那么您可以在 'sort':
之前将它们 grep 出来
$ cat file1
ranks list-of-ids values
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
$ cat file2
list-of-ids values
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(grep -v "list-of-ids" file1 | sort -k 2) <(grep -v "list-of-ids" file2 | sort -k 1)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887
我有一个 ID 列表(第 2 列),我根据它们的值(第 3 列)从 1 到 600 进行了排名。我有另一个相同 ID 但排名不同的列表,因为它们的价值不同。如何将 file2 中第一个 id 列表的 rank/order 与 file1 中的第一个 id 列表进行比较?例如:
file1:
rank list-of-ids values
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
file2:
list-of-ids values
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
输出文件应为 file2,其排名来自 file1:
output:
rank list-of-ids values
2 HOCANM106363549 0.4832
1 HOUSAM69708729 0.4199
3 HOCANM10845509 0.4143
6 HOUSAM69990251 0.3887
4 HOCANM11098662 0.3792
8 HOUSAM69756113 0.365
5 HOUSAM68571374 0.3649
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
9 HOCANM11098658 0.3545
有什么建议吗?请注意,真实数据没有任何 header,因此,out 也不应该有 header。
awk 解决办法:
awk 'NR==FNR{ a[]=; next }{ print a[],, }' file1 file2
NR==FNR
- 处理第一个输入文件(即file1
)a[]=
- 将rank
值(第一个字段</code>)捕获到用相应的 <code>list-of-ids
值索引的数组a
中(第二个字段</code>)</p></li> <li><p><code>next
- 跳转到下一条记录(file1
)print a[],,
- 打印来自第二个输入文件file2
的字段 (,
) 以及相应的rank
值a[]
输出:
2 HOCANM106363549 0.4832
1 HOUSAM69708729 0.4199
3 HOCANM10845509 0.4143
6 HOUSAM69990251 0.3887
4 HOCANM11098662 0.3792
8 HOUSAM69756113 0.365
5 HOUSAM68571374 0.3649
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
9 HOCANM11098658 0.3545
另一种选择,使用'join'
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(sort -k 2 file1) <(sort -k 1 file2)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887
ranks list-of-ids values
诚然,这并不能很好地处理 header。您已经接受了一个解决方案,但我喜欢这个实用程序,但知道它的人并不多 ;)
编辑:如果源数据没有任何 headers,那么这个命令很有效:
$ cat file1
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
$ cat file2
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(sort -k 2 file1) <(sort -k 1 file2)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887
如果您的任何一个文件确实包含 headers,那么您可以在 'sort':
之前将它们 grep 出来$ cat file1
ranks list-of-ids values
1 HOUSAM69708729 0.4468
2 HOCANM106363549 0.4434
3 HOCANM10845509 0.4268
4 HOCANM11098662 0.4203
5 HOUSAM68571374 0.3896
6 HOUSAM69990251 0.3895
7 HONLDM716072164 0.3893
8 HOUSAM69756113 0.3656
9 HOCANM11098658 0.3593
10 HOUSAM66626020 0.3538
$ cat file2
list-of-ids values
HOCANM106363549 0.4832
HOUSAM69708729 0.4199
HOCANM10845509 0.4143
HOUSAM69990251 0.3887
HOCANM11098662 0.3792
HOUSAM69756113 0.365
HOUSAM68571374 0.3649
HONLDM716072164 0.3600
HOUSAM66626020 0.3593
HOCANM11098658 0.3545
$ join -1 2 -2 1 -o 1.1,2.1,2.2 <(grep -v "list-of-ids" file1 | sort -k 2) <(grep -v "list-of-ids" file2 | sort -k 1)
2 HOCANM106363549 0.4832
3 HOCANM10845509 0.4143
9 HOCANM11098658 0.3545
4 HOCANM11098662 0.3792
7 HONLDM716072164 0.3600
10 HOUSAM66626020 0.3593
5 HOUSAM68571374 0.3649
1 HOUSAM69708729 0.4199
8 HOUSAM69756113 0.365
6 HOUSAM69990251 0.3887