删除和合并 CSV 文件中的行以实现列的特殊条件
Delete and combine rows in CSV file for special condition on columns
我在 csv 文件中有以下内容:
ID
Contact_id
Tags_id
"id114"
""
"Tags_id3"
"id12"
""
"Tags_id1"
""
""
"Tags_id3"
"id3353"
"contact_id8764"
"Tags_id5"
"id355"
"contact_id16"
"Tags_id6"
""
""
"Tags_id7"
""
""
"Tags_id3"
""
"contact_id564"
"Tags_id2"
""
""
"Tags_id12"
"id12076"
"contact_id137"
"Tags_id7"
""
""
"Tags_id3"
""
""
"Tags_id5"
""
""
"Tags_id1"
...
...
...
用于测试的纯文本:
ID,Contact_id,Tags_id
"id114","","Tags_id3"
"id12","","Tags_id1"
"","","Tags_id3"
"id3353","contact_id8764","Tags_id5"
"id355","contact_id16","Tags_id6"
"","","Tags_id7"
"","","Tags_id3"
"","contact_id564","Tags_id2"
"","","Tags_id12"
"id12076","contact_id137","Tags_id7"
"","","Tags_id3"
"","","Tags_id5"
"","","Tags_id1"
预期结果:
Contact_id
Tags_id
"contact_id8764"
"Tags_id5"
"contact_id16"
"Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id564"
"Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id137"
"Tags_id7,Tags_id3,Tags_id5,Tags_id1"
...
...
纯文本的预期结果:
Contact_id,Tags_id
"contact_id8764","Tags_id5"
"contact_id16","Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id564","Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id137","Tags_id7,Tags_id3,Tags_id5,Tags_id1"
- 首先删除所有带有 ID 且没有 Contact_id 的行(例如带有 id114 的行)。
- 第二次删除 ID 下面的所有行,并且没有 Contact_id(比如 id12 下面的行)直到下一个 ID (id3353)。
- 第三,如果 ID 和 Contact_id 可用,则收集下面的标签,直到具有 Contact_id 的行中的下一个 ID。将相同的标签集合添加到ID下的所有Contact_id中(Contact_id16和Contact_id564具有相同的属于id355的标签。
- 第四去掉ID列
我用 vim 宏尝试过,但没有成功。我知道 awk 可能更适合这项任务,但我仍在学习它并且也无法做到。也许还有另一种方法可以解决这个任务。希望有人能帮忙。
下面的可能很难看,但试试这个。
awk -F ',' -f foo.awk input.txt
foo.awk:
NR == 1 { print , ; next }
{ gsub(/2/, "", [=11=]) } # remove double quotes
&& ( != id) { flush(); id = ? : "" } # when ID is changed, print out the 'buffer'.
!id { next } # note that we defined ID without contact_ID as "" in the previous line.
{ cids[i++] = } # add contact_ID if detected
{ tags = tags "," } # add tag
END { flush() }
function flush() {
for (ind in cids) { print "\"" cids[ind] "\",\"" substr(tags, 2) "\"" }
delete cids; tags = ""; i = 0 # wipe out the buffer
}
另一种选择
$ awk -F, 'BEGIN {e="\"\""; OFS=FS}
NR==1 {print ,; next}
!=e && !=e{s=1; id=; c=; idc[id]=c; idt[id]=; next}
s && !=e {idc[id]=idc[id] FS }
s && !=e {idt[id]=idt[id] FS }
END {for(id in idc)
{n=split(idc[id],cs);
for(i=1;i<=n;i++) print cs[i], idt[id]}}' file
Contact_id,Tags_id
"contact_id16","Tags_id6","Tags_id7","Tags_id3","Tags_id2","Tags_id12"
"contact_id564","Tags_id6","Tags_id7","Tags_id3","Tags_id2","Tags_id12"
"contact_id8764","Tags_id5"
"contact_id137","Tags_id7","Tags_id3","Tags_id5","Tags_id1"
使用这种方法不会保留联系人的顺序,如果重要,则需要一些额外的簿记。
我在 csv 文件中有以下内容:
ID | Contact_id | Tags_id |
---|---|---|
"id114" | "" | "Tags_id3" |
"id12" | "" | "Tags_id1" |
"" | "" | "Tags_id3" |
"id3353" | "contact_id8764" | "Tags_id5" |
"id355" | "contact_id16" | "Tags_id6" |
"" | "" | "Tags_id7" |
"" | "" | "Tags_id3" |
"" | "contact_id564" | "Tags_id2" |
"" | "" | "Tags_id12" |
"id12076" | "contact_id137" | "Tags_id7" |
"" | "" | "Tags_id3" |
"" | "" | "Tags_id5" |
"" | "" | "Tags_id1" |
... | ... | ... |
用于测试的纯文本:
ID,Contact_id,Tags_id
"id114","","Tags_id3"
"id12","","Tags_id1"
"","","Tags_id3"
"id3353","contact_id8764","Tags_id5"
"id355","contact_id16","Tags_id6"
"","","Tags_id7"
"","","Tags_id3"
"","contact_id564","Tags_id2"
"","","Tags_id12"
"id12076","contact_id137","Tags_id7"
"","","Tags_id3"
"","","Tags_id5"
"","","Tags_id1"
预期结果:
Contact_id | Tags_id |
---|---|
"contact_id8764" | "Tags_id5" |
"contact_id16" | "Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12" |
"contact_id564" | "Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12" |
"contact_id137" | "Tags_id7,Tags_id3,Tags_id5,Tags_id1" |
... | ... |
纯文本的预期结果:
Contact_id,Tags_id
"contact_id8764","Tags_id5"
"contact_id16","Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id564","Tags_id6,Tags_id7,Tags_id3,Tags_id2,Tags_id12"
"contact_id137","Tags_id7,Tags_id3,Tags_id5,Tags_id1"
- 首先删除所有带有 ID 且没有 Contact_id 的行(例如带有 id114 的行)。
- 第二次删除 ID 下面的所有行,并且没有 Contact_id(比如 id12 下面的行)直到下一个 ID (id3353)。
- 第三,如果 ID 和 Contact_id 可用,则收集下面的标签,直到具有 Contact_id 的行中的下一个 ID。将相同的标签集合添加到ID下的所有Contact_id中(Contact_id16和Contact_id564具有相同的属于id355的标签。
- 第四去掉ID列
我用 vim 宏尝试过,但没有成功。我知道 awk 可能更适合这项任务,但我仍在学习它并且也无法做到。也许还有另一种方法可以解决这个任务。希望有人能帮忙。
下面的可能很难看,但试试这个。
awk -F ',' -f foo.awk input.txt
foo.awk:
NR == 1 { print , ; next }
{ gsub(/2/, "", [=11=]) } # remove double quotes
&& ( != id) { flush(); id = ? : "" } # when ID is changed, print out the 'buffer'.
!id { next } # note that we defined ID without contact_ID as "" in the previous line.
{ cids[i++] = } # add contact_ID if detected
{ tags = tags "," } # add tag
END { flush() }
function flush() {
for (ind in cids) { print "\"" cids[ind] "\",\"" substr(tags, 2) "\"" }
delete cids; tags = ""; i = 0 # wipe out the buffer
}
另一种选择
$ awk -F, 'BEGIN {e="\"\""; OFS=FS}
NR==1 {print ,; next}
!=e && !=e{s=1; id=; c=; idc[id]=c; idt[id]=; next}
s && !=e {idc[id]=idc[id] FS }
s && !=e {idt[id]=idt[id] FS }
END {for(id in idc)
{n=split(idc[id],cs);
for(i=1;i<=n;i++) print cs[i], idt[id]}}' file
Contact_id,Tags_id
"contact_id16","Tags_id6","Tags_id7","Tags_id3","Tags_id2","Tags_id12"
"contact_id564","Tags_id6","Tags_id7","Tags_id3","Tags_id2","Tags_id12"
"contact_id8764","Tags_id5"
"contact_id137","Tags_id7","Tags_id3","Tags_id5","Tags_id1"
使用这种方法不会保留联系人的顺序,如果重要,则需要一些额外的簿记。