如果第 4 列匹配，sed 替换第 9 列； else 用其他东西替换第 9 列

Question

当读取一个非常大的制表符分隔文件时，如下所示：

.   .   .   A   .   .   .   .   3:.:.:20
.   .   .   B   .   .   .   .   4:.:30
.   .   .   C   .   .   .   .   5:.:.:40:.
.   .   .   D   .   .   .   .   .:.:.:.
.   .   .   A   .   .   .   .   7:.:.:21
.   .   .   B   .   .   .   .   .:.:.:.
.   .   .   D   .   .   .   .   .:.:.:.
.   .   .   C   .   .   .   .   .:.:.

我想将第 9 列保留为 .:.:.:。仅当第 4 列 = D 时。对于所有其他第 4 列值，我有一个单独的 sed 替换查询。所需的输出将如下所示：

.   .   .   A   .   .   .   .   1:.:.:80
.   .   .   B   .   .   .   .   1:.:80
.   .   .   C   .   .   .   .   1:.:.:80:.
.   .   .   D   .   .   .   .   .:.:.:.
.   .   .   A   .   .   .   .   1:.:.:80
.   .   .   B   .   .   .   .   1:.:.:80
.   .   .   D   .   .   .   .   .:.:.:.
.   .   .   C   .   .   .   .   1:.:80

我当前的伪代码是这样的：

if [column 4 = D]
    then
        # either replace or just keep original entry
        sed '/some replacement query/g' {file}
    else
        sed '/some other replacement query/g' {file}
fi

我在想如果我可以逐行读取文件，也许我可以采用这种方法？但是后来我不断地写信更新一个新文件。

我不确定如何在 shell 环境中有效地完成这段时间。由于最终目标是创建一个大型最终文件，因此我当前的路线创建了两个不正确的不同版本。也许有一种方法可以在单个 sed 查询中做我想做的事？但这超出了我的专业范围。

编辑：sed 替换查询当前查看第 9 列并确定字段分隔符（“:”）的数量并将其替换为相应的值，即 3:.:.:20 变为 1:.:.:80 ，4:.:30 变为 1:.:80 ，并且 5:.:.:40:. 变为 1:.:.:80:. 字段分隔符之间的值可以是数字 0-99 或“.”价值。但是，如果第 4 列 = N，则第 9 列必须变成（或者在本例中，保持不变）.:.:.:. 我当前的替换查询将 .:.:.:. 变成 1:.:.:80 而我不知道不想。

sed 's/\t\(\.\|[0-9]\):\(\.\|\([0-9]\+\)\):\(\.\|\([0-9]\+\)\):\(\.\|\([0-9]\+\)\)$/\t1:\.:\.:80/g; 
s/\t\(\.\|[0-9]\):\([0-9],[0-9]\):\(\.\|\([0-9]\+\)\):\(\.\|\([0-9]\+\)\):\(\.\|\([0-9]\+\)\)$/\t1:\.:\.:80:\./g; 
s/\t\(\.\|[0-9]\):\(\.\|\([0-9]\+\)\):\(\.\|\([0-9]\+\)\)$/\t1:\.:80/g'

Answer 1

这是一个可能的解决方案，使用 awk。

cat file | awk '{if ( != "D") { sub(, "Replaced the ninth field" ); print } else { print }}''

如果第 4 个字段 ($4) 不是 D。

，这将用字符串替换第 9 个字段 (

</code>)
<p>如果您足够了解最后一个字段的形状，例如 <code>/\d\/\.\/\d$

之类，您也可以用正则表达式替换

</code>。不确定您是否还需要在那里进行反向引用替换。</p>
<p>但这可能是一个开始。</p>
<pre><code>cat out.tsv | awk '{if ( != "D") { sub(, "Replaced the ninth field" ); print } else { print }}'
. . . A . . . . Replaced the ninth field
.   .   .   B   .   .   .   .   Replaced the ninth field
.   .   .   C   .   .   .   .   Replaced the ninth field
.   .   .   D   .   .   .   .   ./././.
.   .   .   A   .   .   .   .   Replaced the ninth field
.   .   .   B   .   .   .   .   Replaced the ninth field
.   .   .   D   .   .   .   .   ./././.
.   .   .   C   .   .   .   .   Replaced the ninth field

Answer 2

使用您展示的示例，请尝试执行以下 awk 程序。

awk '!="D"{sub(/^[0-9]+|^\./,"1",);sub(/[0-9]+$|\.$/,"80",)} 1' Input_file

OR 如果 Input_file 中有制表符分隔值，则将 BEGIN 部分添加到上述程序如下：

awk '
BEGIN{FS=OFS="\t"}
!="D"{
  sub(/^[0-9]+|^\./,"1",)
  sub(/[0-9]+$|\.$/,"80",)
}
1
' Input_file

Answer 3

awk 通常更擅长面向领域的东西，但也可以用 sed 完成：除非第 4 列是 D 提取第 9 列，编辑和附加为第 10 列，最后删除第 9 列。

编辑：更新为使用 : 作为子字段分隔符（原为 /）。
EDIT2：澄清 \t 不是 POSIX 或 macOS。

sed -E -e '
/^([^\t]*\t){3}D\t/b            # special case: dont edit if D in column 4
h                               # copy line to hold space
s/^([^\t]*\t){8}([^\t]*).*//  # set pattern space = column 9
s/^[^:]+/1/                     # set first subfield = 1
/([0-9]+){2}|[^:]+$/ s//80/     # set 2nd digit group or last subfield = 80
x                               # exchange pattern and hold spaces
G                               # append \n+hold space to pattern space
s/\n/\t/                        # replace \n with field separator
s/[^\t]*\t//9                   # delete (old) column 9
' -- file

补充评论：

tab是字段分隔符；如果你的 sed 不明白 \t （GNU sed 可以，但可能不在 macos 上） - 或者使脚本 POSIX 完全正确 - 将 \t 替换为文字制表符
s//80/ 中的空正则表达式重新应用上次使用的正则表达式（在 /…/ 中）
使用 -E 选项 ERE 扩展正则表达式

Answer 4

我不确定你真的需要对这些映射进行硬编码，但你的问题中没有其他解释如何提出它们，所以假设你这样做，那么这将在每个 Unix 机器上使用任何 shell 中的任何 awk 工作：

$ cat tst.awk
BEGIN {
    FS=OFS="\t"

    str2str[".:.:."]      = "1:.:80"
    str2str[".:.:.:."]    = "1:.:.:80"
    str2str[".:.:.:.:."]  = "1:.:.:80:."

    for (str in str2str) {
        re = str
        gsub(/\./,"(.|[0-9]+)",re)
        re2str["^("re")$"] = str2str[str]
    }
}
 != "D" {
    for (re in re2str) {
        if ( ~ re) {
             = re2str[re]
        }
    }
}
{ print }

$ awk -f tst.awk file
.       .       .       A       .       .       .       .       1:.:.:80
.       .       .       B       .       .       .       .       1:.:80
.       .       .       C       .       .       .       .       1:.:.:80:.
.       .       .       D       .       .       .       .       .:.:.:.
.       .       .       A       .       .       .       .       1:.:.:80
.       .       .       B       .       .       .       .       1:.:.:80
.       .       .       D       .       .       .       .       .:.:.:.
.       .       .       C       .       .       .       .       1:.:80

如果第 4 列匹配，sed 替换第 9 列； else 用其他东西替换第 9 列

If column 4 is a match, sed replace column 9; else replace column 9 with something else

unix

shell

awk

sed