正则表达式删除没有跟随特定字符串的换行符
Regex to remove newlines not followed by specific string
我有一个带用户条目的分隔数据文件需要清理。具体来说:
- 我想删除自由文本字段中嵌入的换行符
- 列数可以从一行更改为下一行
- 每行的第一个字段应该总是以模式
"INC\d{12}"
开始(双引号是模式的一部分)。
- 如果每个
\n
后面没有紧跟模式 "INC\d{12}"
,则应将其替换为单个 space
- 我目前在 cygwin 中使用 Perl(首选),但 awk 或 sed 答案也是可以接受的。
这是一些模拟输入数据(我将其保存到名为 test_input_so.txt
的文件中):
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123
456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"
这是上述数据所需的输出:
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"
我试过几种否定的组合lookaheads/behinds,但我不确定为什么它不起作用。
这是一个例子:
perl -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt
它删除了所有 \n
,但错误地删除了 \n
后跟 "INC123456789012",应该 留在原处。
perl -pe ...
一次只处理一行,因此多行正则表达式对您没有任何好处。
-0
切换到 Perl 可以更改您的输入记录分隔符(Perl 的行概念)并允许您将整个输入作为单个字符串进行操作。
perl -0777 -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt
首先,您需要修复一些杂散引号,以便您的数据可以是有效的 CSV:
- 第 7 行:
"fie"""ld2"
必须是 "fie""ld2"
- 第 11 行:以 2 个双引号结尾
其次,不要在字段之间的逗号后放置space:不是a, b
而是a,b
一旦你修复了那些东西,你就可以使用Text::CSV模块:
我认为您真正想要做的是删除内部 引用字段 中的换行符。此代码的结构取自 Text::CSV perldoc。
perl -MData::Dump=dd -E '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, always_quote => 1 })
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $file = shift @ARGV;
open my $fh, "<:encoding(utf8)", $file or die;
while ( my $row = $csv->getline( $fh ) ) {
my @row = map {s/\n//g; $_} @$row;
$csv->combine(@row);
my $line = $csv->string();
say $line if $line ne q{""};
}
$csv->eof or $csv->error_diag();
close $fh;
' test_input_so.txt
"INC000111111111","field2","field3"
"INC000222222222","field2","field3","INC000123456789 blahblah"
"INC000444444444","fie""ld2","field3"
"INC000123456789","field2","field3",""
"INC000333333333","INC000123456789","field3"
"INC000555555555","field2","field3","field4"
另一个 Perl
$ perl -0777 -ne ' while( /(^"INC00.+?)(\n"INC.*|\Z)/msg ) { $x=;$_=; $x=~s/\n//g; print "$x\n" } ' test_input_so.txt
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"
$
输入:
$ cat test_input_so.txt
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123
456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"
$
我有一个带用户条目的分隔数据文件需要清理。具体来说:
- 我想删除自由文本字段中嵌入的换行符
- 列数可以从一行更改为下一行
- 每行的第一个字段应该总是以模式
"INC\d{12}"
开始(双引号是模式的一部分)。 - 如果每个
\n
后面没有紧跟模式"INC\d{12}"
,则应将其替换为单个 space
- 我目前在 cygwin 中使用 Perl(首选),但 awk 或 sed 答案也是可以接受的。
这是一些模拟输入数据(我将其保存到名为 test_input_so.txt
的文件中):
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123
456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"
这是上述数据所需的输出:
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"
我试过几种否定的组合lookaheads/behinds,但我不确定为什么它不起作用。
这是一个例子:
perl -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt
它删除了所有 \n
,但错误地删除了 \n
后跟 "INC123456789012",应该 留在原处。
perl -pe ...
一次只处理一行,因此多行正则表达式对您没有任何好处。
-0
切换到 Perl 可以更改您的输入记录分隔符(Perl 的行概念)并允许您将整个输入作为单个字符串进行操作。
perl -0777 -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt
首先,您需要修复一些杂散引号,以便您的数据可以是有效的 CSV:
- 第 7 行:
"fie"""ld2"
必须是"fie""ld2"
- 第 11 行:以 2 个双引号结尾
其次,不要在字段之间的逗号后放置space:不是a, b
而是a,b
一旦你修复了那些东西,你就可以使用Text::CSV模块:
我认为您真正想要做的是删除内部 引用字段 中的换行符。此代码的结构取自 Text::CSV perldoc。
perl -MData::Dump=dd -E '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, always_quote => 1 })
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $file = shift @ARGV;
open my $fh, "<:encoding(utf8)", $file or die;
while ( my $row = $csv->getline( $fh ) ) {
my @row = map {s/\n//g; $_} @$row;
$csv->combine(@row);
my $line = $csv->string();
say $line if $line ne q{""};
}
$csv->eof or $csv->error_diag();
close $fh;
' test_input_so.txt
"INC000111111111","field2","field3"
"INC000222222222","field2","field3","INC000123456789 blahblah"
"INC000444444444","fie""ld2","field3"
"INC000123456789","field2","field3",""
"INC000333333333","INC000123456789","field3"
"INC000555555555","field2","field3","field4"
另一个 Perl
$ perl -0777 -ne ' while( /(^"INC00.+?)(\n"INC.*|\Z)/msg ) { $x=;$_=; $x=~s/\n//g; print "$x\n" } ' test_input_so.txt
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"
$
输入:
$ cat test_input_so.txt
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123
456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"
$