获取某些链接的 href 字符串

Question

使用 beautifulSoup，我想获取与某些 href 关联的字符串，其中包含“/genre/”。例如，我使用以下命令获得了以下 href：

soup.find_all('a', href=True)

输出为：

 <a href="/genre/Animation?ref_=tt_stry_gnr"> Animation</a>,
 <a href="/genre/Adventure?ref_=tt_stry_gnr"> Adventure</a>,
 <a href="/genre/Family?ref_=tt_stry_gnr"> Family</a>,
 <a href="/title/tt0235917/parentalguide?ref_=tt_stry_pg#certification"> See all certifications</a>,
 <a href="/title/tt0235917/parentalguide?ref_=tt_stry_pg" itemprop="url"> View content advisory</a>,

但是，我想 select 仅 "genre" 作为链接并获得输出：

Animation
Adventure
Family

我尝试使用：

import re
imdb_page.find_all('a', {'href': re.compile(r'/genre/\d.*')})

但是我得到一个空数组。有任何想法吗？

Answer 1

你的正则表达式有误，应该是

>>> for a in soup.find_all('a', {'href': re.compile(r'^/genre/.*')}):
...     print a.text
... 
 Animation
 Adventure
 Family

正则表达式解释

^ 将模式锚定在字符串的开头，
/genre/ 匹配 genre
.* 匹配任何内容

/genre/\d.*

有什么问题

\d 匹配任何数字。也就是说，您正在尝试匹配 /genre/ 之后的数字（例如 href="/genre/1qwert" ）。

但是在输入字符串中，没有href遵循这种模式。

因此你得到一个空字符串。

获取某些链接的 href 字符串

Get href string of certain links

python

beautifulsoup