使用特殊字符作为要在 Linux 中提取的开始和结束位置提取关键字前后的字符串

Extract string before and after keyword using special character as start and end positions to be extracted in Linux

我有一个如下所示的日志文件。它在格式方面没有任何一致性。我已经能够删除所有不必要的新行,以便每个警告都在一个新行中。

Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 
'ddd225581' have the same position.
Warning: Variants 'e223513' 
and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the
same 
position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same
position.
Warning: 6000 het. haploid genotypes present (see Tet_merged.hh ); many
commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands 
treat these as missing.

我的输出如下所示:

Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.

为了获得此输出,我使用了以下命令:

cat Test_lines.txt | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=13=]}' Test_lines.txt

首先,我提取了包含单引号(我感兴趣的那些)的警告行,然后我只打印了以 "position." 结尾的那些行,并删除了所有其他额外的换行符。

但是,我想为每个警告行提取“和”字符串前后单引号之间的字符串。在这种情况下,所需的输出应该是:

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'

为了最后这个目的,我尝试使用以下语法:

cat Test_lines.txt | grep "'" | grep -o -P '.{0,3} and .{0,4}'

但是这种语法依赖于位置而不是单引号等定界符。有什么办法可以用特定位置替换特定定界符的第 n 个存在,在本例中为单引号?

非常感谢,

最佳,

Yatrosin

您可以将 awk 的输出传递给 grep -o "'.*'" ,因此命令为:

cat Test_lines.txt | grep "'" | 
awk '/position\.$/ {print; next} {printf "%s ", [=10=]}' Test_lines.txt

cat Test_lines.txt | grep -o "'.*'"

完整示例:

echo "Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
" -n | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=11=]}' | grep -o "'.*'"

输出:

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'

使用单个 GNU awk 命令:

awk -v RS='\<position\.' \
'/7/{ gsub(/^[^7]+|\n+|[^7]+$/, ""); print [=10=] }' Test_lines.txt

输出:

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'