使用特殊字符作为要在 Linux 中提取的开始和结束位置提取关键字前后的字符串

Question

我有一个如下所示的日志文件。它在格式方面没有任何一致性。我已经能够删除所有不必要的新行，以便每个警告都在一个新行中。

Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 
'ddd225581' have the same position.
Warning: Variants 'e223513' 
and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the
same 
position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same
position.
Warning: 6000 het. haploid genotypes present (see Tet_merged.hh ); many
commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands 
treat these as missing.

我的输出如下所示：

Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.

为了获得此输出，我使用了以下命令：

cat Test_lines.txt | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=13=]}' Test_lines.txt

首先，我提取了包含单引号（我感兴趣的那些）的警告行，然后我只打印了以 "position." 结尾的那些行，并删除了所有其他额外的换行符。

但是，我想为每个警告行提取“和”字符串前后单引号之间的字符串。在这种情况下，所需的输出应该是：

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'

为了最后这个目的，我尝试使用以下语法：

cat Test_lines.txt | grep "'" | grep -o -P '.{0,3} and .{0,4}'

但是这种语法依赖于位置而不是单引号等定界符。有什么办法可以用特定位置替换特定定界符的第 n 个存在，在本例中为单引号？

非常感谢，

最佳，

Yatrosin

Answer 1

您可以将 awk 的输出传递给 grep -o "'.*'" ，因此命令为：

cat Test_lines.txt | grep "'" | 
awk '/position\.$/ {print; next} {printf "%s ", [=10=]}' Test_lines.txt

cat Test_lines.txt | grep -o "'.*'"

完整示例：

echo "Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
" -n | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=11=]}' | grep -o "'.*'"

输出：

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'

Answer 2

使用单个 GNU awk 命令：

awk -v RS='\<position\.' \
'/7/{ gsub(/^[^7]+|\n+|[^7]+$/, ""); print [=10=] }' Test_lines.txt

输出：

'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'

使用特殊字符作为要在 Linux 中提取的开始和结束位置提取关键字前后的字符串

Extract string before and after keyword using special character as start and end positions to be extracted in Linux

string

grep

newline

pattern-matching

special-characters