如何使用正则表达式提取特定的 img src url 格式?
How can I extract a specific img src url format using regex?
我的字符串:
Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|
我想将这 3 个链接放入列表中:
http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8
他们遵循这种模式:
src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"
我知道我应该使用 re.findall(pattern, string)
来实现。
但最大的问题是:如何构建适用于此的模式?
我不太擅长编写正则表达式模式。我总是感到困惑...几乎完成工作的是这个:
pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
但我得到的只是这个列表:
[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']
看来问题出在 ~r
部分和之后的部分。
您的正则表达式中缺少 ~
字符:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
顺便说一句:this 是在 Python!
中测试正则表达式的超级方法
试试这个脚本:
text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)
结果是
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']
试试这个:
(?:src=)(".*?")
并获取组 \1
这些数据来自哪里?我建议使用 html 解析器,而不是尝试使用正则表达式进行提取。如果来自 html
,您可以从那里的标签中提取完整值
下面我将您的字符串放入 test.html(对于 python,使用 beautifulsoup 作为示例)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']
我的字符串:
Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|
我想将这 3 个链接放入列表中:
http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8
他们遵循这种模式:
src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"
我知道我应该使用 re.findall(pattern, string)
来实现。
但最大的问题是:如何构建适用于此的模式?
我不太擅长编写正则表达式模式。我总是感到困惑...几乎完成工作的是这个:
pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
但我得到的只是这个列表:
[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']
看来问题出在 ~r
部分和之后的部分。
您的正则表达式中缺少 ~
字符:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
顺便说一句:this 是在 Python!
中测试正则表达式的超级方法试试这个脚本:
text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)
结果是
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']
试试这个:
(?:src=)(".*?")
并获取组 \1
这些数据来自哪里?我建议使用 html 解析器,而不是尝试使用正则表达式进行提取。如果来自 html
,您可以从那里的标签中提取完整值下面我将您的字符串放入 test.html(对于 python,使用 beautifulsoup 作为示例)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']