从 bash 中的多个 tsv 文件中加入选定的列
joining selected columns from multiple tsv files in bash
我有一堆制表符分隔的文本文件,如下所示
"gene_id" "Pattern1" "Pattern2" "Pattern3" "Pattern4" "Pattern5" "MAP" "PPDE"
"ENSG00000119771.13" 3.11528786599051e-18 2.52650109640992e-13 6.25109524320237e-09 0.345846257420197 0.654153736328455 "Pattern5" 1
"ENSG00000123700.4" 1.75016991626305e-36 3.98804090894939e-19 0.63423772228367 3.8159144080782e-21 0.36576227771633 "Pattern3" 1
"ENSG00000128567.15" 1.10722918612618e-23 7.62691311068806e-07 5.77031364194955e-06 5.13675840911147e-21 0.999993466995047 "Pattern5" 1
"ENSG00000130182.6" 9.75717082221716e-22 1.27675651077242e-12 0.469972541094369 1.13677117238758e-12 0.530027458903217 "Pattern5" 1
"ENSG00000131914.9" 3.1627489688037e-41 1.00274706758683e-22 0.0578584524816503 6.98718794692175e-22 0.94214154751835 "Pattern5" 1
现在我想将它们合并到一个文件中,这样我就可以得到
"gene_id" "Pattern5" "Pattern5" "Pattern5" "Pattern5" "Pattern5"
其中每个 Pattern5
列来自一个文件。
我用
尝试了一些东西
cut -f 6 <file>
和
paste <file1> <file2> ...
但我无法正确组合。
感谢您的帮助!
更新:
我试着给你一个可测试的例子作为输入:
<file1>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 1 2 3 4 5
ENSG00000123700 1 2 3 4 5
ENSG00000128567 1 2 3 4 5
ENSG00000130182 1 2 3 4 5
ENSG00000131914 1 2 3 4 5
<file2>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 6 7 8 9 10
ENSG00000123700 6 7 8 9 10
ENSG00000128567 6 7 8 9 10
ENSG00000130182 6 7 8 9 10
ENSG00000131914 6 7 8 9 10
<file3>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 11 12 13 14 15
ENSG00000123700 11 12 13 14 15
ENSG00000128567 11 12 13 14 15
ENSG00000130182 11 12 13 14 15
ENSG00000131914 11 12 13 14 15
所需的输出将是
gene_id Pattern5_file1 Pattern5_file2 Pattern5_file3
ENSG00000119771 5 10 15
ENSG00000123700 5 10 15
ENSG00000128567 5 10 15
ENSG00000130182 5 10 15
ENSG00000131914 5 10 15
更新2:
我尝试了 Ed Morton 的方法:
awk '
BEGIN { FS=OFS="\t" } FNR==1{ARGIND++}
{ genes[]; val[,ARGIND] = }
END {
for (gene in genes) {
printf "%s%s", gene, OFS
for (file=1; file<=ARGIND; file++) {
printf "%s%s", val[gene,file], (file<ARGIND?OFS:ORS)
}
}
} ' $files
但输出格式不正确:
ENSG00000128567 4 9 14
ENSG00000130182 4 9 14
ENSG00000119771 4 9 14
gene_id Pattern4 Pattern4 Pattern4
ENSG00000131914 4 9 14
ENSG00000123700 4 9 14
for f in file1 file2 file3; do
cut -f 6 $f; done |
awk '{if (~/Pattern5/) {printf("\n%s\t",)} else {printf("%s\t",)} };END{print ""}' |
tail -n +2
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
(我只是为文件 1-3 使用了相同的数据。)
您还可以指定输入文件,如果它们被定期命名,则使用 glob,例如for f in myfiles*
.
试试这个
#!/bin/bash
paste file1 file2 file3 | awk -v patternIdx=6 '
function printPattern(idx, isFirstLine) {
for (i = 1; i <= NF; ++i) {
if (i == 1)
printf "%s ", $i;
else if (isFirstLine && i % patternIdx == 0)
printf "%s_file%d ", $i, i / patternIdx;
else if (i % patternIdx == 0)
printf "%d ", $i;
}
printf "\n"
}
{
if (NR == 1)
printPattern(patternIdx, 1);
else
printPattern(patternIdx, 0);
}'
patternIdx 是 Pattern5
的列索引
我有一堆制表符分隔的文本文件,如下所示
"gene_id" "Pattern1" "Pattern2" "Pattern3" "Pattern4" "Pattern5" "MAP" "PPDE"
"ENSG00000119771.13" 3.11528786599051e-18 2.52650109640992e-13 6.25109524320237e-09 0.345846257420197 0.654153736328455 "Pattern5" 1
"ENSG00000123700.4" 1.75016991626305e-36 3.98804090894939e-19 0.63423772228367 3.8159144080782e-21 0.36576227771633 "Pattern3" 1
"ENSG00000128567.15" 1.10722918612618e-23 7.62691311068806e-07 5.77031364194955e-06 5.13675840911147e-21 0.999993466995047 "Pattern5" 1
"ENSG00000130182.6" 9.75717082221716e-22 1.27675651077242e-12 0.469972541094369 1.13677117238758e-12 0.530027458903217 "Pattern5" 1
"ENSG00000131914.9" 3.1627489688037e-41 1.00274706758683e-22 0.0578584524816503 6.98718794692175e-22 0.94214154751835 "Pattern5" 1
现在我想将它们合并到一个文件中,这样我就可以得到
"gene_id" "Pattern5" "Pattern5" "Pattern5" "Pattern5" "Pattern5"
其中每个 Pattern5
列来自一个文件。
我用
尝试了一些东西cut -f 6 <file>
和
paste <file1> <file2> ...
但我无法正确组合。
感谢您的帮助!
更新: 我试着给你一个可测试的例子作为输入:
<file1>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 1 2 3 4 5
ENSG00000123700 1 2 3 4 5
ENSG00000128567 1 2 3 4 5
ENSG00000130182 1 2 3 4 5
ENSG00000131914 1 2 3 4 5
<file2>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 6 7 8 9 10
ENSG00000123700 6 7 8 9 10
ENSG00000128567 6 7 8 9 10
ENSG00000130182 6 7 8 9 10
ENSG00000131914 6 7 8 9 10
<file3>
gene_id Pattern1 Pattern2 Pattern3 Pattern4 Pattern5
ENSG00000119771 11 12 13 14 15
ENSG00000123700 11 12 13 14 15
ENSG00000128567 11 12 13 14 15
ENSG00000130182 11 12 13 14 15
ENSG00000131914 11 12 13 14 15
所需的输出将是
gene_id Pattern5_file1 Pattern5_file2 Pattern5_file3
ENSG00000119771 5 10 15
ENSG00000123700 5 10 15
ENSG00000128567 5 10 15
ENSG00000130182 5 10 15
ENSG00000131914 5 10 15
更新2: 我尝试了 Ed Morton 的方法:
awk '
BEGIN { FS=OFS="\t" } FNR==1{ARGIND++}
{ genes[]; val[,ARGIND] = }
END {
for (gene in genes) {
printf "%s%s", gene, OFS
for (file=1; file<=ARGIND; file++) {
printf "%s%s", val[gene,file], (file<ARGIND?OFS:ORS)
}
}
} ' $files
但输出格式不正确:
ENSG00000128567 4 9 14
ENSG00000130182 4 9 14
ENSG00000119771 4 9 14
gene_id Pattern4 Pattern4 Pattern4
ENSG00000131914 4 9 14
ENSG00000123700 4 9 14
for f in file1 file2 file3; do
cut -f 6 $f; done |
awk '{if (~/Pattern5/) {printf("\n%s\t",)} else {printf("%s\t",)} };END{print ""}' |
tail -n +2
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
"Pattern5" 0.654153736328455 0.36576227771633 0.999993466995047
(我只是为文件 1-3 使用了相同的数据。)
您还可以指定输入文件,如果它们被定期命名,则使用 glob,例如for f in myfiles*
.
试试这个
#!/bin/bash
paste file1 file2 file3 | awk -v patternIdx=6 '
function printPattern(idx, isFirstLine) {
for (i = 1; i <= NF; ++i) {
if (i == 1)
printf "%s ", $i;
else if (isFirstLine && i % patternIdx == 0)
printf "%s_file%d ", $i, i / patternIdx;
else if (i % patternIdx == 0)
printf "%d ", $i;
}
printf "\n"
}
{
if (NR == 1)
printPattern(patternIdx, 1);
else
printPattern(patternIdx, 0);
}'
patternIdx 是 Pattern5