使用特殊字符作为要在 Linux 中提取的开始和结束位置提取关键字前后的字符串
Extract string before and after keyword using special character as start and end positions to be extracted in Linux
我有一个如下所示的日志文件。它在格式方面没有任何一致性。我已经能够删除所有不必要的新行,以便每个警告都在一个新行中。
Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and
'ddd225581' have the same position.
Warning: Variants 'e223513'
and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the
same
position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same
position.
Warning: 6000 het. haploid genotypes present (see Tet_merged.hh ); many
commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
我的输出如下所示:
Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
为了获得此输出,我使用了以下命令:
cat Test_lines.txt | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=13=]}' Test_lines.txt
首先,我提取了包含单引号(我感兴趣的那些)的警告行,然后我只打印了以 "position." 结尾的那些行,并删除了所有其他额外的换行符。
但是,我想为每个警告行提取“和”字符串前后单引号之间的字符串。在这种情况下,所需的输出应该是:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'
为了最后这个目的,我尝试使用以下语法:
cat Test_lines.txt | grep "'" | grep -o -P '.{0,3} and .{0,4}'
但是这种语法依赖于位置而不是单引号等定界符。有什么办法可以用特定位置替换特定定界符的第 n 个存在,在本例中为单引号?
非常感谢,
最佳,
Yatrosin
您可以将 awk
的输出传递给 grep -o "'.*'"
,因此命令为:
cat Test_lines.txt | grep "'" |
awk '/position\.$/ {print; next} {printf "%s ", [=10=]}' Test_lines.txt
cat Test_lines.txt | grep -o "'.*'"
完整示例:
echo "Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
" -n | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=11=]}' | grep -o "'.*'"
输出:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'
使用单个 GNU awk
命令:
awk -v RS='\<position\.' \
'/7/{ gsub(/^[^7]+|\n+|[^7]+$/, ""); print [=10=] }' Test_lines.txt
输出:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'
我有一个如下所示的日志文件。它在格式方面没有任何一致性。我已经能够删除所有不必要的新行,以便每个警告都在一个新行中。
Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and
'ddd225581' have the same position.
Warning: Variants 'e223513'
and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the
same
position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same
position.
Warning: 6000 het. haploid genotypes present (see Tet_merged.hh ); many
commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
我的输出如下所示:
Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
为了获得此输出,我使用了以下命令:
cat Test_lines.txt | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=13=]}' Test_lines.txt
首先,我提取了包含单引号(我感兴趣的那些)的警告行,然后我只打印了以 "position." 结尾的那些行,并删除了所有其他额外的换行符。
但是,我想为每个警告行提取“和”字符串前后单引号之间的字符串。在这种情况下,所需的输出应该是:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'
为了最后这个目的,我尝试使用以下语法:
cat Test_lines.txt | grep "'" | grep -o -P '.{0,3} and .{0,4}'
但是这种语法依赖于位置而不是单引号等定界符。有什么办法可以用特定位置替换特定定界符的第 n 个存在,在本例中为单引号?
非常感谢,
最佳,
Yatrosin
您可以将 awk
的输出传递给 grep -o "'.*'"
,因此命令为:
cat Test_lines.txt | grep "'" |
awk '/position\.$/ {print; next} {printf "%s ", [=10=]}' Test_lines.txt
cat Test_lines.txt | grep -o "'.*'"
完整示例:
echo "Warning: Variants 'aaa8212' and 'bbb2388_ver2' have the same position.
Warning: Variants 'ccc9186' and 'ddd225581' have the same position.
Warning: Variants 'e223513' and 'ffff13855' have the same position.
Warning: Variants 'gg08395' and 'hhh34224' have the same position.
Warning: Variants 'iii454353428' and 'jjjjjj82428' have the same position.
" -n | grep "'" | awk '/position\.$/ {print; next} {printf "%s ", [=11=]}' | grep -o "'.*'"
输出:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'
使用单个 GNU awk
命令:
awk -v RS='\<position\.' \
'/7/{ gsub(/^[^7]+|\n+|[^7]+$/, ""); print [=10=] }' Test_lines.txt
输出:
'aaa8212' and 'bbb2388_ver2'
'ccc9186' and 'ddd225581'
'e223513' and 'ffff13855'
'gg08395' and 'hhh34224'
'iii454353428' and 'jjjjjj82428'