SED 删除包含模式的新行

Question

我想删除（sed 或 awk）所有只包含一次字符 " 的行上的换行符，但是一旦该行上的换行符被删除，它就可以在下一行中删除。

这是一个例子

line1"test 2015"
line2"test
2015"
line3"test 2020"
line4"test
2017"

应转化为：

line1"test 2015"
line2"test2015"
line3"test 2020"
line4"test2017"

Answer 1

使用 GNU awk：

awk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, ""); } { printf("%s%s", [=10=], RT) }' filename

这是最直接的方法。使用"作为记录分隔符，

NR % 2 == 0 {             # in every other record (those inside quotes)
  gsub(/\n/, "")          # remove the newlines
}
{ 
  printf("%s%s", [=11=], RT)  # then print the line terminated by the same thing
                          # as in the input (to avoid an extra quote at the
                          # end of the output)
}

RT 是一个 GNU 扩展，这就是为什么这需要 gawk。

使用 sed

使用 sed 执行此操作的困难在于引号之间可能有两个换行符，例如

line2"test
123
2015"

这使得在条件之后只获取一行变得很脆弱。因此：

sed '/^[^"]*"[^"]*$/ { :a /\n.*"/! { N; ba; }; s/\n//g; }' filename

即：

/^[^"]*"[^"]*$/ {   # When a line contains only one quote
  :a                # jump label for looping
  /\n.*"/! {        # until there appears another quote
    N               # fetch more lines
    ba
  }
  s/\n//g           # once done, remove the newlines.
}

作为单行代码，这需要 GNU sed，因为 BSD sed 对分支指令的格式很挑剔。但是，应该可以将代码的扩展形式放入文件中，例如 foo.sed 和运行 sed -f foo.sed filename 使用 BSD sed。

请注意，此代码假定在开头引号之后，带有引号的下一行仅包含该引号。如果需要，解决该问题的方法是

sed ':a h; s/[^"]//g; s/""//g; /"/ { x; N; s/\n//; ba }; x' filename

...但这可以说超出了应该使用 sed 合理完成的事情的范围。它是这样工作的：

:a           # jump label for looping
h            # make a copy of the line
s/[^"]//g    # isolate quotes
s/""//g      # remove pairs of quotes
/"/ {        # if there is a quote left (number of quotes is odd)
  x          # swap the unedited text back into the pattern space
  N          # fetch a new line
  s/\n//     # remove the newline between them
  ba         # loop
}
x            # swap the text back in before printing.

使用非 GNU awk

每行多个引号的情况在 awk 中比在 sed 中更容易处理。上面的 GNU awk 代码隐含地执行了它；对于非 GNU awk，它需要做更多的事情（但不是很严重）：

awk -F '"' '{ n = 0; line = ""; do { n += NF != 0 ? NF - 1 : 0; line = line [=17=] } while(n % 2 == 1 && getline == 1) print line }' filename

主要技巧是使用 " 作为字段分隔符，以便字段数告诉我们行中有多少引号。那么：

{
                                       # reset state
  n = 0                                # n is the number of quotes we have
                                       # seen so far
  line = ""                            # line is where we assemble the output
                                       # line

  do {
    n += NF != 0 ? NF - 1 : 0;         # add the number of quotes in the line
                                       # (special handling for empty lines
                                       # where NF == 0)
    line = line [=18=]                     # append the line to the output
  } while(n % 2 == 1 && getline == 1)  # while the number of quotes is odd
                                       # and there's more input, get new lines
                                       # and loop

  print line                           # once done, print the combined result.
}

Answer 2

使用 sed：

sed '/[^"]$/{N;s/\n//}' file

输出：

line1"test 2015"
line2"test2015"
line3"test 2020"
line4"test2017"

搜索 (//) 不以 (^) 结束 ($) 且具有单个字符 " 的行。仅针对这些行 ({})：将下一行 (N) 附加到 sed 的模式 space（当前行）并使用 sed 的搜索和替换（s///）在模式 space 现在嵌入的换行符 (\n) 并且什么都不替换。

Answer 3

这可能适合您 (GNU sed)：

sed -r ':a;N;s/^([^\n"]*"[^\n"]*)\n/ /;ta;P;D' file

这会将两行之间的换行符替换为 space，其中第一行仅包含一个双引号。

N.B。 space 也可能被删除，但数据表明它。

SED 删除包含模式的新行

SED remove new line lines containing a pattern

awk

sed

使用 GNU awk：

使用 sed

使用非 GNU awk