带分组的 Regex .search 不收集组
Regex .search with grouping is not collecting groups
我正在尝试搜索以下列表
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
使用此代码:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
并收到以下错误:AttributeError: 'str' object has no attribute 'group'
。
我认为 \d+
周围的括号会将一个或多个数字分组。我的目标是获取字符串末尾 "_p/"
之前的数字。
你可以试试这个:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
它给出:
['2', '3', '6', '7', '8', '2']
您正在过滤原始列表,因此return编辑的是原始字符串,而不是匹配对象。如果要return匹配对象,需要map
搜索到列表,然后过滤匹配对象。例如:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
输出:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
如果您只想要匹配的数字部分,请使用 match.group(1)
而不是 match.group()
。
我认为 re.findall
应该可以做到:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
或者,您可以拆分行,然后单独搜索它们:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']
filter
函数只会删除不符合正则表达式的行,并将 return 字符串,例如:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
如果您想要 match
对象,那么列表理解就可以解决问题:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']
你可以做正则表达式 (?<=\/)\d+(?=\_p\/$)
。以 regex101 为例
解释:
(?<=\/)
: 往后看 /
\d+
: 寻找一位或多位数字
(?=\_p\/$)
:在字符串
的末尾向前看_p/
如果匹配,则 return 只有 \d+
值。
您可以编写代码一次获取所有数据,也可以逐行遍历它们并获取您需要的数据。
下面是两者的代码:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
第一部分是一行一行的做,第二个结果是一次抓取。
两者的输出是:
一行一行:
['2']
['3']
['6']
['7']
['8']
['2']
一次全部抓取:
['2', '3', '6', '7', '8', '2']
对于第二个,我没有给出 $ 符号,因为我们需要全部抓取它。
我正在尝试搜索以下列表
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
使用此代码:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
并收到以下错误:AttributeError: 'str' object has no attribute 'group'
。
我认为 \d+
周围的括号会将一个或多个数字分组。我的目标是获取字符串末尾 "_p/"
之前的数字。
你可以试试这个:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
它给出:
['2', '3', '6', '7', '8', '2']
您正在过滤原始列表,因此return编辑的是原始字符串,而不是匹配对象。如果要return匹配对象,需要map
搜索到列表,然后过滤匹配对象。例如:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
输出:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
如果您只想要匹配的数字部分,请使用 match.group(1)
而不是 match.group()
。
我认为 re.findall
应该可以做到:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
或者,您可以拆分行,然后单独搜索它们:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']
filter
函数只会删除不符合正则表达式的行,并将 return 字符串,例如:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
如果您想要 match
对象,那么列表理解就可以解决问题:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']
你可以做正则表达式 (?<=\/)\d+(?=\_p\/$)
。以 regex101 为例
解释:
(?<=\/)
: 往后看 /
\d+
: 寻找一位或多位数字
(?=\_p\/$)
:在字符串
_p/
如果匹配,则 return 只有 \d+
值。
您可以编写代码一次获取所有数据,也可以逐行遍历它们并获取您需要的数据。
下面是两者的代码:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
第一部分是一行一行的做,第二个结果是一次抓取。
两者的输出是:
一行一行:
['2']
['3']
['6']
['7']
['8']
['2']
一次全部抓取:
['2', '3', '6', '7', '8', '2']
对于第二个,我没有给出 $ 符号,因为我们需要全部抓取它。