如何在 Python 中使用正则表达式提取由“----”分隔的文本的特殊部分?
How can I extract special part of a text which is separated by "----" using regex in Python?
'----
Airport SPQU :S16:20:25.6431 W071:34:22.3800 8338ft
Country Name="Peru"
State Name=""
City Name="Arequipa"
Airport Name="Rodriguez Ballon"
in file: ORBX\FTX_VECTOR\FTX_VECTOR_AEC\scenery\AEC_SPQU.bgl
----
Airport SPRF :S14:15:59.9484 W070:27:59.9997 14419ft
Country Name="Peru"
State Name=""
City Name="San Rafael"
Airport Name="San Rafael"
in file: Scenery04\scenery\APX29370.bgl
Start 12 : S14:15:40.9653 W070:28:38.3900 14419ft Hdg: 117.0T, Length 8760ft
Start 30 : S14:16:18.9314 W070:27:21.6092 14419ft Hdg: 297.0T, Length 8760ft
0120 Lat -14.261198 Long -70.477715 Alt 14419 Hdg 120 Len 8760 Wid 98
0300 Lat -14.272106 Long -70.455620 Alt 14419 Hdg 300 Len 8760 Wid 98
----
Airport TNCB :N12:08:25.5567 W068:16:34.3503 20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891 W068:17:16.0525 20ft Hdg: 92.0T, Length 9448ft
Start 28 : N12:08:20.1144 W068:15:43.9767 20ft Hdg: 272.0T, Length 9448ft
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----
Airport TNCC :N12:11:20.0649 W068:57:34.8897 29ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Curacao I"
Airport Name="Willemstad-Hato Intl."
in file: Scenery03\scenery\APX29270.bgl
Start 11 : N12:11:30.5607 W068:58:24.9607 29ft Hdg: 102.1T, Length 11186ft
Start 29 : N12:11:08.2410 W068:56:38.2654 29ft Hdg: 282.1T, Length 11186ft
0110 Lat 12.191923 Long -68.974129 Alt 29 Hdg 111 Len 11186 Wid 197 ILS 111.90, Flags: GS DME BC
0290 Lat 12.185513 Long -68.943428 Alt 29 Hdg 291 Len 11186 Wid 197
----
Airport TNCE :N17:29:32.4738 W062:58:29.8992 129ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Eustatius I"
Airport Name="F.D. Roosevelt"
in file: ORBX\FTX_OLC\FTX_VECTOR_FixedAPT\scenery\APT_TNCE.BGL
Start 6 : N17:29:35.1949 W062:59:02.6666 129ft Hdg: 50.3T, Length 4268ft
Start 24 : N17:30:00.9808 W062:58:30.1439 129ft Hdg: 230.2T, Length 4268ft
0060 Lat 17.492956 Long -62.984272 Alt 129 Hdg 63 Len 4268 Wid 98
0240 Lat 17.500425 Long -62.974819 Alt 129 Hdg 243 Len 4268 Wid 98
----
Airport TNCM :N18:02:27.0378 W063:06:34.2595 13ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Maarten I"
Airport Name="Princess Juliana Intl"
in file: Scenery03\scenery\APX31250.bgl
Start 9 : N18:02:21.9843 W063:07:08.8215 13ft Hdg: 81.7T, Length 7150ft
Start 27 : N18:02:31.8322 W063:05:57.8823 13ft Hdg: 261.7T, Length 7150ft
0090 Lat 18.039392 Long -63.119469 Alt 13 Hdg 95 Len 7150 Wid 148
0270 Lat 18.042223 Long -63.099060 Alt 13 Hdg 275 Len 7150 Wid 148
----'
这是我的部分文字。我正在尝试提取这部分:
'----
Airport TNCB :N12:08:25.5567 W068:16:34.3503 20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891 W068:17:16.0525 20ft Hdg: 92.0T, Length 9448ft
Start 28 : N12:08:20.1144 W068:15:43.9767 20ft Hdg: 272.0T, Length 9448ft
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----'
我试过这个正则表达式模式,但是它从我想提取的位置开始提取到结束:
----.+?TNCB.+?----
正如我所说,它从预期结果的开头提取到结尾。重要的是它会在匹配的字符串“TNCB”之后检查一次“----”的出现,但不会在该字符串之前提取一次。我该如何解决?我怎样才能安排它从“TNCB”之前的“-”的前 4 个开始?
import re
airport_tuple = ('TNCB','RPUJ','00IS','WALQ')
def read_text():
with open("symbols.txt","r") as f:
list_of_strings = f.readlines()
text = " ".join(list_of_strings)
return text
def main():
text = read_text()
print(re.findall(r"(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----", text))
if __name__ == "__main__":
main()
您可以使用
(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----
参见regex demo。
详情:
(?m)^
- 行首((?m)
等于 re.M
/re.MULTILINE
)
----\n
- ----
和换行符
(Airport\s+TNCB.*(?:\n.*)*?)
- 第 1 组:
Airport\s+TNCB
- Airport
,一个或多个空格,TNCB
.*
- 该行的其余部分
(?:\n.*)*?
- 换行符出现零次或多次(尽可能少),然后是该行的其余部分
\n----
- 换行符和 ----
子字符串。
在Python中,可以使用
re.findall(r'^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----', text, re.M)
'----
Airport SPQU :S16:20:25.6431 W071:34:22.3800 8338ft
Country Name="Peru"
State Name=""
City Name="Arequipa"
Airport Name="Rodriguez Ballon"
in file: ORBX\FTX_VECTOR\FTX_VECTOR_AEC\scenery\AEC_SPQU.bgl
----
Airport SPRF :S14:15:59.9484 W070:27:59.9997 14419ft
Country Name="Peru"
State Name=""
City Name="San Rafael"
Airport Name="San Rafael"
in file: Scenery04\scenery\APX29370.bgl
Start 12 : S14:15:40.9653 W070:28:38.3900 14419ft Hdg: 117.0T, Length 8760ft
Start 30 : S14:16:18.9314 W070:27:21.6092 14419ft Hdg: 297.0T, Length 8760ft
0120 Lat -14.261198 Long -70.477715 Alt 14419 Hdg 120 Len 8760 Wid 98
0300 Lat -14.272106 Long -70.455620 Alt 14419 Hdg 300 Len 8760 Wid 98
----
Airport TNCB :N12:08:25.5567 W068:16:34.3503 20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891 W068:17:16.0525 20ft Hdg: 92.0T, Length 9448ft
Start 28 : N12:08:20.1144 W068:15:43.9767 20ft Hdg: 272.0T, Length 9448ft
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----
Airport TNCC :N12:11:20.0649 W068:57:34.8897 29ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Curacao I"
Airport Name="Willemstad-Hato Intl."
in file: Scenery03\scenery\APX29270.bgl
Start 11 : N12:11:30.5607 W068:58:24.9607 29ft Hdg: 102.1T, Length 11186ft
Start 29 : N12:11:08.2410 W068:56:38.2654 29ft Hdg: 282.1T, Length 11186ft
0110 Lat 12.191923 Long -68.974129 Alt 29 Hdg 111 Len 11186 Wid 197 ILS 111.90, Flags: GS DME BC
0290 Lat 12.185513 Long -68.943428 Alt 29 Hdg 291 Len 11186 Wid 197
----
Airport TNCE :N17:29:32.4738 W062:58:29.8992 129ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Eustatius I"
Airport Name="F.D. Roosevelt"
in file: ORBX\FTX_OLC\FTX_VECTOR_FixedAPT\scenery\APT_TNCE.BGL
Start 6 : N17:29:35.1949 W062:59:02.6666 129ft Hdg: 50.3T, Length 4268ft
Start 24 : N17:30:00.9808 W062:58:30.1439 129ft Hdg: 230.2T, Length 4268ft
0060 Lat 17.492956 Long -62.984272 Alt 129 Hdg 63 Len 4268 Wid 98
0240 Lat 17.500425 Long -62.974819 Alt 129 Hdg 243 Len 4268 Wid 98
----
Airport TNCM :N18:02:27.0378 W063:06:34.2595 13ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Maarten I"
Airport Name="Princess Juliana Intl"
in file: Scenery03\scenery\APX31250.bgl
Start 9 : N18:02:21.9843 W063:07:08.8215 13ft Hdg: 81.7T, Length 7150ft
Start 27 : N18:02:31.8322 W063:05:57.8823 13ft Hdg: 261.7T, Length 7150ft
0090 Lat 18.039392 Long -63.119469 Alt 13 Hdg 95 Len 7150 Wid 148
0270 Lat 18.042223 Long -63.099060 Alt 13 Hdg 275 Len 7150 Wid 148
----'
这是我的部分文字。我正在尝试提取这部分:
'----
Airport TNCB :N12:08:25.5567 W068:16:34.3503 20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891 W068:17:16.0525 20ft Hdg: 92.0T, Length 9448ft
Start 28 : N12:08:20.1144 W068:15:43.9767 20ft Hdg: 272.0T, Length 9448ft
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----'
我试过这个正则表达式模式,但是它从我想提取的位置开始提取到结束:
----.+?TNCB.+?----
正如我所说,它从预期结果的开头提取到结尾。重要的是它会在匹配的字符串“TNCB”之后检查一次“----”的出现,但不会在该字符串之前提取一次。我该如何解决?我怎样才能安排它从“TNCB”之前的“-”的前 4 个开始?
import re
airport_tuple = ('TNCB','RPUJ','00IS','WALQ')
def read_text():
with open("symbols.txt","r") as f:
list_of_strings = f.readlines()
text = " ".join(list_of_strings)
return text
def main():
text = read_text()
print(re.findall(r"(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----", text))
if __name__ == "__main__":
main()
您可以使用
(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----
参见regex demo。
详情:
(?m)^
- 行首((?m)
等于re.M
/re.MULTILINE
)----\n
-----
和换行符(Airport\s+TNCB.*(?:\n.*)*?)
- 第 1 组:Airport\s+TNCB
-Airport
,一个或多个空格,TNCB
.*
- 该行的其余部分(?:\n.*)*?
- 换行符出现零次或多次(尽可能少),然后是该行的其余部分
\n----
- 换行符和----
子字符串。
在Python中,可以使用
re.findall(r'^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----', text, re.M)