Unix/Bash: 单元格上的 Uniq
Unix/Bash: Uniq on a cell
我有一个制表符分隔的文件 A,其中第 12 列(从 1 开始)包含几个逗号分隔的标识符。但是,其中一些在同一行中可能会出现多次:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(有些在逗号后有白色-space,有些则没有)。
我只想获取唯一标识符并删除第 12 列中每行的倍数:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
这是我目前的情况:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print , , , , , , , , , , , }'
done > out
我的想法是一次遍历每一行,删掉第 12 列,用换行符替换所有逗号,然后排序并使用 uniq 去除重复项,粘贴回去并打印右边的列顺序,跳过原始标识符列。
然而,这似乎不起作用。有什么想法吗?
如果我没理解错的话,用awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split(, a, /, */); = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; = a[i] "," } } sub(/,$/, "", ); print }' filename
其工作原理如下:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split(, a, /, */) # split 12th field into tokens,
= "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
= a[i] "," # append to result
}
}
sub(/,$/, "", ) # remove trailing comma from resulting field
print # print the transformed line
}
delete b;
POSIX-conforming 只用了很短一段时间,所以如果您使用的是旧的、旧的 awk 而它对您来说失败了,请参阅@MarkReed 对另一个人的评论古代awks应该接受的方式。
使用这个 awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split(, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
使用字段 2 而不是字段 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split(,f,/ *, */)
= ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
= (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
如果你的 awk 不支持 delete seen
你可以使用 split("",seen)
.
为了完整起见,并且因为我个人更喜欢 Perl 而不是 Awk 来处理这类事情,这里有一个 Perl 单行解决方案:
perl -F'\t' -le '%u=();@k=split/,/,$F[11];@u{@k}=@k;$F[11]=join",",sort
keys%u;print join"\t",@F'
解释:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
@k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array @k
@u{@k} = @k; # Copy the contents of @k into @u as key/value pairs
因为散列键是唯一的,所以最后一步意味着 %u
的键现在是 @k
的去重副本。
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", @F; # and print out the modified line
在您的示例数据中,逗号后跟 space 是第 12 个字段的分隔符。之后的每个子字段只是第一个字段的重复。子字段似乎已经排序。
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
基于此,您可以简单地保留第一个子字段并扔掉其余的子字段:
awk -F"\t" '{sub(/, .*/, "", )} 1' fileA
相反,您可以有不同的重复子字段集,其中的键不是这样排序的:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
如果您受困于默认的 MacOS awk,您可以在 awk 可执行脚本中引入 sort/uniq 函数:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split(, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
= s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
如果您有 GNU-awk,我认为您可以将 sort(a, c)
调用替换为 asort(a)
,并完全删除冒泡排序本地函数。
我得到第 12 个字段的以下信息:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281
我有一个制表符分隔的文件 A,其中第 12 列(从 1 开始)包含几个逗号分隔的标识符。但是,其中一些在同一行中可能会出现多次:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(有些在逗号后有白色-space,有些则没有)。
我只想获取唯一标识符并删除第 12 列中每行的倍数:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
这是我目前的情况:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print , , , , , , , , , , , }'
done > out
我的想法是一次遍历每一行,删掉第 12 列,用换行符替换所有逗号,然后排序并使用 uniq 去除重复项,粘贴回去并打印右边的列顺序,跳过原始标识符列。
然而,这似乎不起作用。有什么想法吗?
如果我没理解错的话,用awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split(, a, /, */); = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; = a[i] "," } } sub(/,$/, "", ); print }' filename
其工作原理如下:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split(, a, /, */) # split 12th field into tokens,
= "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
= a[i] "," # append to result
}
}
sub(/,$/, "", ) # remove trailing comma from resulting field
print # print the transformed line
}
delete b;
POSIX-conforming 只用了很短一段时间,所以如果您使用的是旧的、旧的 awk 而它对您来说失败了,请参阅@MarkReed 对另一个人的评论古代awks应该接受的方式。
使用这个 awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split(, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
使用字段 2 而不是字段 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split(,f,/ *, */)
= ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
= (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
如果你的 awk 不支持 delete seen
你可以使用 split("",seen)
.
为了完整起见,并且因为我个人更喜欢 Perl 而不是 Awk 来处理这类事情,这里有一个 Perl 单行解决方案:
perl -F'\t' -le '%u=();@k=split/,/,$F[11];@u{@k}=@k;$F[11]=join",",sort
keys%u;print join"\t",@F'
解释:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
@k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array @k
@u{@k} = @k; # Copy the contents of @k into @u as key/value pairs
因为散列键是唯一的,所以最后一步意味着 %u
的键现在是 @k
的去重副本。
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", @F; # and print out the modified line
在您的示例数据中,逗号后跟 space 是第 12 个字段的分隔符。之后的每个子字段只是第一个字段的重复。子字段似乎已经排序。
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
基于此,您可以简单地保留第一个子字段并扔掉其余的子字段:
awk -F"\t" '{sub(/, .*/, "", )} 1' fileA
相反,您可以有不同的重复子字段集,其中的键不是这样排序的:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
如果您受困于默认的 MacOS awk,您可以在 awk 可执行脚本中引入 sort/uniq 函数:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split(, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
= s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
如果您有 GNU-awk,我认为您可以将 sort(a, c)
调用替换为 asort(a)
,并完全删除冒泡排序本地函数。
我得到第 12 个字段的以下信息:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281