如何在 Python 中使用正则表达式提取由“----”分隔的文本的特殊部分?

How can I extract special part of a text which is separated by "----" using regex in Python?

'----
Airport SPQU :S16:20:25.6431  W071:34:22.3800  8338ft
Country Name="Peru"
State Name=""
City Name="Arequipa"
Airport Name="Rodriguez Ballon"
in file: ORBX\FTX_VECTOR\FTX_VECTOR_AEC\scenery\AEC_SPQU.bgl
----
Airport SPRF :S14:15:59.9484  W070:27:59.9997  14419ft
Country Name="Peru"
State Name=""
City Name="San Rafael"
Airport Name="San Rafael"
in file: Scenery04\scenery\APX29370.bgl
Start 12 : S14:15:40.9653  W070:28:38.3900  14419ft Hdg: 117.0T, Length 8760ft 
Start 30 : S14:16:18.9314  W070:27:21.6092  14419ft Hdg: 297.0T, Length 8760ft 
0120 Lat -14.261198 Long -70.477715 Alt 14419 Hdg 120 Len 8760 Wid 98
0300 Lat -14.272106 Long -70.455620 Alt 14419 Hdg 300 Len 8760 Wid 98
----
Airport TNCB :N12:08:25.5567  W068:16:34.3503  20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891  W068:17:16.0525  20ft Hdg: 92.0T, Length 9448ft 
Start 28 : N12:08:20.1144  W068:15:43.9767  20ft Hdg: 272.0T, Length 9448ft 
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----
Airport TNCC :N12:11:20.0649  W068:57:34.8897  29ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Curacao I"
Airport Name="Willemstad-Hato Intl."
in file: Scenery03\scenery\APX29270.bgl
Start 11 : N12:11:30.5607  W068:58:24.9607  29ft Hdg: 102.1T, Length 11186ft 
Start 29 : N12:11:08.2410  W068:56:38.2654  29ft Hdg: 282.1T, Length 11186ft 
0110 Lat 12.191923 Long -68.974129 Alt 29 Hdg 111 Len 11186 Wid 197 ILS 111.90, Flags: GS DME BC
0290 Lat 12.185513 Long -68.943428 Alt 29 Hdg 291 Len 11186 Wid 197
----
Airport TNCE :N17:29:32.4738  W062:58:29.8992  129ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Eustatius I"
Airport Name="F.D. Roosevelt"
in file: ORBX\FTX_OLC\FTX_VECTOR_FixedAPT\scenery\APT_TNCE.BGL
Start 6 : N17:29:35.1949  W062:59:02.6666  129ft Hdg: 50.3T, Length 4268ft 
Start 24 : N17:30:00.9808  W062:58:30.1439  129ft Hdg: 230.2T, Length 4268ft 
0060 Lat 17.492956 Long -62.984272 Alt 129 Hdg 63 Len 4268 Wid 98
0240 Lat 17.500425 Long -62.974819 Alt 129 Hdg 243 Len 4268 Wid 98
----
Airport TNCM :N18:02:27.0378  W063:06:34.2595  13ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Maarten I"
Airport Name="Princess Juliana Intl"
in file: Scenery03\scenery\APX31250.bgl
Start 9 : N18:02:21.9843  W063:07:08.8215  13ft Hdg: 81.7T, Length 7150ft 
Start 27 : N18:02:31.8322  W063:05:57.8823  13ft Hdg: 261.7T, Length 7150ft 
0090 Lat 18.039392 Long -63.119469 Alt 13 Hdg 95 Len 7150 Wid 148
0270 Lat 18.042223 Long -63.099060 Alt 13 Hdg 275 Len 7150 Wid 148
----'

这是我的部分文字。我正在尝试提取这部分:

'----
Airport TNCB :N12:08:25.5567  W068:16:34.3503  20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery03\scenery\APX29270.bgl
Start 10 : N12:08:23.2891  W068:17:16.0525  20ft Hdg: 92.0T, Length 9448ft 
Start 28 : N12:08:20.1144  W068:15:43.9767  20ft Hdg: 272.0T, Length 9448ft 
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----'

我试过这个正则表达式模式,但是它从我想提取的位置开始提取到结束:

----.+?TNCB.+?----

正如我所说,它从预期结果的开头提取到结尾。重要的是它会在匹配的字符串“TNCB”之后检查一次“----”的出现,但不会在该字符串之前提取一次。我该如何解决?我怎样才能安排它从“TNCB”之前的“-”的前 4 个开始?

import re

airport_tuple =  ('TNCB','RPUJ','00IS','WALQ')

def read_text():
    with open("symbols.txt","r") as f:
        list_of_strings = f.readlines()
        text = " ".join(list_of_strings)
    return text

def main():
    text = read_text()
    print(re.findall(r"(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----", text))
    

if __name__ == "__main__":
    main()

您可以使用

(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----

参见regex demo

详情:

  • (?m)^ - 行首((?m) 等于 re.M/re.MULTILINE
  • ----\n - ---- 和换行符
  • (Airport\s+TNCB.*(?:\n.*)*?) - 第 1 组:
    • Airport\s+TNCB - Airport,一个或多个空格,TNCB
    • .* - 该行的其余部分
    • (?:\n.*)*? - 换行符出现零次或多次(尽可能少),然后是该行的其余部分
  • \n---- - 换行符和 ---- 子字符串。

在Python中,可以使用

re.findall(r'^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----', text, re.M)