带分组的 Regex .search 不收集组

Regex .search with grouping is not collecting groups

我正在尝试搜索以下列表

/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/

使用此代码:

next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match

for match in matches:
    #refining_nextpage = re.compile()
    print(match.group())

并收到以下错误:AttributeError: 'str' object has no attribute 'group'

我认为 \d+ 周围的括号会将一个或多个数字分组。我的目标是获取字符串末尾 "_p/" 之前的数字。

你可以试试这个:

import re

# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$',  re.M)
matches = next_page.findall(href_search)
print(matches)

它给出:

['2', '3', '6', '7', '8', '2']

您正在过滤原始列表,因此return编辑的是原始字符串,而不是匹配对象。如果要return匹配对象,需要map搜索到列表,然后过滤匹配对象。例如:

next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))

for match in matches:
    #refining_nextpage = re.compile()
    print(match.group())

输出:

/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/

如果您只想要匹配的数字部分,请使用 match.group(1) 而不是 match.group()

我认为 re.findall 应该可以做到:

next_page.findall(href_search)  # ['2', '3', '6', '7', '8', '2']

或者,您可以拆分行,然后单独搜索它们:

matches = []
for line in href_search.splitlines():
    match = next_page.search(line)
    if match:
        matches.append(match.group(1))

matches  # ['2', '3', '6', '7', '8', '2']

filter 函数只会删除不符合正则表达式的行,并将 return 字符串,例如:

>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']

如果您想要 match 对象,那么列表理解就可以解决问题:

>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example]  # Get the matches
[None,
 <re.Match object; span=(3, 5), match='45'>,
 None,
 <re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None]  # Filter None values
['45', '123']

你可以做正则表达式 (?<=\/)\d+(?=\_p\/$)。以 regex101 为例

解释:

(?<=\/) : 往后看 /

\d+ : 寻找一位或多位数字

(?=\_p\/$) :在字符串

的末尾向前看_p/

如果匹配,则 return 只有 \d+ 值。

您可以编写代码一次获取所有数据,也可以逐行遍历它们并获取您需要的数据。

下面是两者的代码:

text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''

import re
for txt in text_line.split('\n'):
    t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
    print (t)

t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)

第一部分是一行一行的做,第二个结果是一次抓取。

两者的输出是:

一行一行:

['2']
['3']
['6']
['7']
['8']
['2']

一次全部抓取:

['2', '3', '6', '7', '8', '2']

对于第二个,我没有给出 $ 符号,因为我们需要全部抓取它。