如何将引号中存在的定界符值替换为文件中数据的一部分

how to replace delimiter value present within quotes as part of data in file

我想替换作为每条记录数据一部分的定界符。对于 Ex-

echo '"hi","how,are,you","bye"'|sed -nE 's/"([^,]*),([^,]*),([^,]*)"/";;"/gp'

输出 -->

"hi","how;are;you","bye"

因此,我可以用分号替换数据中存在的分隔符(在本例中为逗号)。 但挑战在于,我们不确定 delmiter 会实时出现多少次,而且它也可能出现在多个领域。 对于 Ex-

"1","2,3,4,5","6","7,8"

"1","2,4,5","6","7,8,9"

"1","4,5","6","7,8,9.2"

这些都是有效记录。 有人可以帮我从这里出去吗。我们如何编写通用代码来处理这个问题?

假设数据不包含嵌入的双引号...

示例数据:

$ cat delim.dat
"hi","how,are,you","bye"
"1","2,3,4,5","6","7,8"
"1","2,4,5","6","7,8,9"
"1","4,5","6","7,8,9.2"

一个 awk 想法,我们在偶数字段中用 ; 替换 ,

awk '
BEGIN { FS=OFS="\"" }
      { for (i=2;i<=NF;i=i+2) gsub(",",";",$i) }
1
' delim.dat

这会生成:

"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"

除了最琐碎的 CSV 数据外,我更喜欢使用直接理解格式的东西,而不是乱用正则表达式来尝试处理引用字段之类的东西。例如(警告:提前公然自我提升!),我的基于 tcl 的类似 awk 的实用程序 tawk,我编写它的部分原因是为了更容易操作 CSV 文件:

 $ tawk -csv -quoteall '
  line {
    for {set n 1} {$n <= $NF} {incr n} {
      set F($n) [string map {, \;} $F($n)]
    }
    print
  }' input.csv
"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"

或者使用 Text::CSV_XS 模块的 perl 方法:

$ perl -MText::CSV_XS -e '
  my $csv = Text::CSV_XS->new({binary=>1, always_quote=>1});
  while (my $row = $csv->getline(\*STDIN)) {
    tr/,/;/ foreach @$row;
    $csv->say(\*STDOUT, $row);
  }' < input.csv
"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"