计算 _ 的最大数量并添加额外的 ;如果丢失

count the max number of _ and add additional ; if missing

我有一个包含如下几个字段的文件

deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am

目标是拥有那个

deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am

我正在计算其中一行下划线的最大值然后将其更改为分号并添加额外的分号,如果它不是所有行中找到的最大分号数。

我考虑过为此使用 awk,但我只会使用下面的命令行更改第一个字段之后的所有内容。我的目的也是添加额外的分号

awk 'BEGIN{FS=OFS=";"} {for (i=1;i<=NF;i++) gsub(/_/,";", $i) } 1' file

注意:由于 awk 是逐行处理的,所以我不确定我是否可以这样做,但我问一下以防万一。如果做不到,请告诉我,我会想办法。

谢谢。

应该这样做:

awk -F_ '{for (i=1;i<=NF;i++) a[NR FS i]=$i;c=NF>c?NF:c} END {for (j=1;j<=NR;j++) {for (i=1;i<c;i++) printf "%s;",a[j FS i];print a[j FS c]}}' file
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am

工作原理:

awk -F_ '                               # Set field separator to "_"
    {for (i=1;i<=NF;i++)                # Loop trough one by one field
        a[NR FS i]=$i                   # Store the field in array "a" using both row(NR) and column position(i) as referense
    c=NF>c?NF:c}                        # Find the largest number of fields and store it in "c"
END {                                   # When file read is done, then do at end
    for (j=1;j<=NR;j++) {               # Loop trough all row
        for (i=1;i<c;i++)               # Loop trough all column
            printf "%s;",a[j FS i]      # Print one and one field for every row
        print a[j FS c]                 # Print end field in each row
        }
    }
' file                                  # read the file

这是一个两次通过的解决方案。注意当运行 awk:

时你需要把数据文件放在命令行中两次
$ cat mu.awk
BEGIN { FS="_"; OFS=";" }
NR == FNR { if (max < NF) max = NF; next }
{ =; i = max; j = NF; while (i-- > j) [=10=] = [=10=] OFS }1

$ awk -f mu.awk mu.txt mu.txt
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am

BEGIN 块设置输入和输出文件分隔符。

NF == FNR 块首先遍历文件,设置最大字段数。

最后一个块使第二次通过文件。首先,它重构该行以使用输出文件分隔符,然后为该行缺少最大值的许多字段添加一个额外的 ;

编辑

此版本回答了更新后的问题,仅影响字段 7 之后的字段:

$ cat mu2.awk
BEGIN { OFS=FS=";" }

# First pass, find the max number of "_"
NR == FNR { gsub("[^_]",""); if (max < length()) max = length(); next }

# Second pass:
{
    # count number of "_" less than the max
    line = [=11=]
    gsub("[^_]","", line)
    n = max - length(line)

    # replace "_" with ";" after field 7
    for (i=8; i<=NF; ++i) gsub("_", ";", $i);

    # add an extra ";" for each "_" less than max
    while (n-- > 0) [=11=] = [=11=] ";"
}1

$ awk -f mu2.awk mu2.txt mu2.txt
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am