用 Beautiful Soup 解析 HTML 以在 href 后得到 link
Parsing HTML with Beautiful Soup to get link after href
这是 find_all('a')
的结果(很长):
</a>, <a class="btn text-default text-dark clear_filters pull-right group-ib" href="#" id="export_dialog_close" title="Cancel"><span class="glyphicon glyphicon-remove"></span><span>Cancel</span></a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:SHIPNAME/direction:asc">Vessel Name</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:TIMESTAMP_UTC/direction:asc">Timestamp</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:PORT_NAME/direction:asc">Port</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:MOVE_TYPE_NAME/direction:asc">Port Call type</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:ELAPSED/direction:asc">Time Elapsed</a>, <a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" title="View details for: SIDER LUCK">SIDER LUCK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/163/port_name:MILAZZO/_:3525d580eade08cfdb72083b248185a9" title="View details for: MILAZZO">MILAZZO</a>, <a href="/en/ais/details/ships/shipid:288753/imo:9389693/mmsi:249474000/vessel:OOCL%20ISTANBUL/_:3525d580eade08cfdb72083b248185a9" title="View details for: OOCL ISTANBUL">OOCL ISTANBUL</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/17436/port_name:AMBARLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: AMBARLI">AMBARLI</a>, <a href="/en/ais/details/ships/shipid:754480/imo:9045613/mmsi:636014098/vessel:TK%20ROTTERDAM/_:3525d580eade08cfdb72083b248185a9" title="View details for: TK ROTTERDAM">TK ROTTERDAM</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/3504/port_name:DILISKELESI/_:3525d580eade08cfdb72083b248185a9" title="View details for: DILISKELESI">DILISKELESI</a>, <a href="/en/ais/details/ships/shipid:412277/imo:9039585/mmsi:353430000/vessel:SEA%20AEOLIS/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEA AEOLIS">SEA AEOLIS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/1/port_name:PIRAEUS/_:3525d580eade08cfdb72083b248185a9" title="View details for: PIRAEUS">PIRAEUS</a>, <a href="/en/ais/details/ships/shipid:346713/imo:7614599/mmsi:273327300/vessel:SOLIDAT/_:3525d580eade08cfdb72083b248185a9" title="View details for: SOLIDAT">SOLIDAT</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/883/port_name:SEVASTOPOL/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEVASTOPOL">SEVASTOPOL</a>, <a href="/en/ais/details/ships/shipid:752974/imo:9195298/mmsi:636011072/vessel:OCEANPRINCESS/_:3525d580eade08cfdb72083b248185a9" title="View details for: OCEANPRINCESS">OCEANPRINCESS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/21780/port_name:EREGLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: EREGLI">EREGLI</a>, <a href="/en/ais/details/ships/shipid:201260/imo:9385075/mmsi:235102768/vessel:EMERALD%20BAY/_:3525d580eade08cfdb72083b248185a9" title="View details for: EMERALD BAY">EMERALD BAY</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ships/shipid:418956/imo:9102746/mmsi:356579000/vessel:MSC%20DON%20GIOVANNI/_:3525d580eade08cfdb72083b248185a9" title="View details for: MSC DON GIOVANNI">MSC DON GIOVANNI</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/67/port_name:CONSTANTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: CONSTANTA">CONSTANTA</a>, <a href="/en/ais/details/ships/shipid:748395/imo:9460734/mmsi:622121422/vessel:WADI%20SAFAGA/_:3525d580eade08cfdb72083b248185a9" title="View details for: WADI SAFAGA">WADI SAFAGA</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/997/port_name:DAMIETTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: DAMIETTA">DAMIETTA</a>
我想拉出以/en/ais/details/ships/shipid:
开头的字符串如:
<a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9"
我能够复制这些示例 (Find specific link w/ beautifulsoup or How to get Beautiful Soup to get link from href and class?),但我不想使用正则表达式。
到目前为止我有:
for i in ase: #ase is where the html is sotred
print(i.get('href')) #prints everysingle href.
简而言之,我的问题是如何在不使用正则表达式的情况下仅保留具有我感兴趣的字符串的 href
?
尝试以下列表理解:
[h.get('href') for h in ase if 'string' in h.get('href', '')]
这将为您提供一个仅包含包含子字符串 'string'
.
的链接的列表
更新:
正如 @PadraicCunningham 在评论中指出的那样,如果 [=14] 'string' in h.get('href')
(这是我最初回答的一部分)将引发 TypeError
=] 没有键 'href'
- 不太可能,因为 h
将代表 <a>
标签,但肯定也是一种不平凡的可能性。为了允许这种可能性,您可以简单地向 .get()
传递一个要返回的默认参数 ''
而不是 None
当键不存在时。
此外,我并没有声称我的解决方案是最好的;它可能不是特别有效或优雅。然而,根据我对 OPs 问题的理解,这个解决方案是可行的,是最小的,并且很容易理解。
我仍然建议您使用正则表达式,因为它更简洁并且可以为您节省列表的另一个循环。
import re
find_all('a', href=re.compile("/en/ais/details/ships/shipid:"))
在 documentation 中,您可以找到与此类似的解决方案。
不是最好的。它会 找到所有链接,然后过滤掉它们 。为什么我们不直接 直接获得我们需要的链接而不需要额外的过滤 - BeautifulSoup
非常有能力做到这一点:
prefix = "/en/ais/details/ships/shipid"
[a["href"] for a in soup("a", href=lambda x: x and x.startswith(prefix))]
或者,代替 function, you can pass a regular expression pattern 检查字符串 "starts with" 是否为所需的子字符串:
pattern = re.compile(r"^/en/ais/details/ships/shipid")
[a["href"] for a in soup("a", href=pattern)]
^
这里表示字符串的开头。
或者,我们甚至可以使用 CSS 选择器:
[a["href"] for a in soup.select('a[href^="/en/ais/details/ships/shipid"]')]
^=
是一个 "starts-with" 选择器。
这是 find_all('a')
的结果(很长):
</a>, <a class="btn text-default text-dark clear_filters pull-right group-ib" href="#" id="export_dialog_close" title="Cancel"><span class="glyphicon glyphicon-remove"></span><span>Cancel</span></a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:SHIPNAME/direction:asc">Vessel Name</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:TIMESTAMP_UTC/direction:asc">Timestamp</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:PORT_NAME/direction:asc">Port</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:MOVE_TYPE_NAME/direction:asc">Port Call type</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:ELAPSED/direction:asc">Time Elapsed</a>, <a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" title="View details for: SIDER LUCK">SIDER LUCK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/163/port_name:MILAZZO/_:3525d580eade08cfdb72083b248185a9" title="View details for: MILAZZO">MILAZZO</a>, <a href="/en/ais/details/ships/shipid:288753/imo:9389693/mmsi:249474000/vessel:OOCL%20ISTANBUL/_:3525d580eade08cfdb72083b248185a9" title="View details for: OOCL ISTANBUL">OOCL ISTANBUL</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/17436/port_name:AMBARLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: AMBARLI">AMBARLI</a>, <a href="/en/ais/details/ships/shipid:754480/imo:9045613/mmsi:636014098/vessel:TK%20ROTTERDAM/_:3525d580eade08cfdb72083b248185a9" title="View details for: TK ROTTERDAM">TK ROTTERDAM</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/3504/port_name:DILISKELESI/_:3525d580eade08cfdb72083b248185a9" title="View details for: DILISKELESI">DILISKELESI</a>, <a href="/en/ais/details/ships/shipid:412277/imo:9039585/mmsi:353430000/vessel:SEA%20AEOLIS/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEA AEOLIS">SEA AEOLIS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/1/port_name:PIRAEUS/_:3525d580eade08cfdb72083b248185a9" title="View details for: PIRAEUS">PIRAEUS</a>, <a href="/en/ais/details/ships/shipid:346713/imo:7614599/mmsi:273327300/vessel:SOLIDAT/_:3525d580eade08cfdb72083b248185a9" title="View details for: SOLIDAT">SOLIDAT</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/883/port_name:SEVASTOPOL/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEVASTOPOL">SEVASTOPOL</a>, <a href="/en/ais/details/ships/shipid:752974/imo:9195298/mmsi:636011072/vessel:OCEANPRINCESS/_:3525d580eade08cfdb72083b248185a9" title="View details for: OCEANPRINCESS">OCEANPRINCESS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/21780/port_name:EREGLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: EREGLI">EREGLI</a>, <a href="/en/ais/details/ships/shipid:201260/imo:9385075/mmsi:235102768/vessel:EMERALD%20BAY/_:3525d580eade08cfdb72083b248185a9" title="View details for: EMERALD BAY">EMERALD BAY</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ships/shipid:418956/imo:9102746/mmsi:356579000/vessel:MSC%20DON%20GIOVANNI/_:3525d580eade08cfdb72083b248185a9" title="View details for: MSC DON GIOVANNI">MSC DON GIOVANNI</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/67/port_name:CONSTANTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: CONSTANTA">CONSTANTA</a>, <a href="/en/ais/details/ships/shipid:748395/imo:9460734/mmsi:622121422/vessel:WADI%20SAFAGA/_:3525d580eade08cfdb72083b248185a9" title="View details for: WADI SAFAGA">WADI SAFAGA</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/997/port_name:DAMIETTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: DAMIETTA">DAMIETTA</a>
我想拉出以/en/ais/details/ships/shipid:
开头的字符串如:
<a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9"
我能够复制这些示例 (Find specific link w/ beautifulsoup or How to get Beautiful Soup to get link from href and class?),但我不想使用正则表达式。
到目前为止我有:
for i in ase: #ase is where the html is sotred
print(i.get('href')) #prints everysingle href.
简而言之,我的问题是如何在不使用正则表达式的情况下仅保留具有我感兴趣的字符串的 href
?
尝试以下列表理解:
[h.get('href') for h in ase if 'string' in h.get('href', '')]
这将为您提供一个仅包含包含子字符串 'string'
.
更新:
正如 @PadraicCunningham 在评论中指出的那样,如果 [=14] 'string' in h.get('href')
(这是我最初回答的一部分)将引发 TypeError
=] 没有键 'href'
- 不太可能,因为 h
将代表 <a>
标签,但肯定也是一种不平凡的可能性。为了允许这种可能性,您可以简单地向 .get()
传递一个要返回的默认参数 ''
而不是 None
当键不存在时。
此外,我并没有声称我的解决方案是最好的;它可能不是特别有效或优雅。然而,根据我对 OPs 问题的理解,这个解决方案是可行的,是最小的,并且很容易理解。
我仍然建议您使用正则表达式,因为它更简洁并且可以为您节省列表的另一个循环。
import re
find_all('a', href=re.compile("/en/ais/details/ships/shipid:"))
在 documentation 中,您可以找到与此类似的解决方案。
BeautifulSoup
非常有能力做到这一点:
prefix = "/en/ais/details/ships/shipid"
[a["href"] for a in soup("a", href=lambda x: x and x.startswith(prefix))]
或者,代替 function, you can pass a regular expression pattern 检查字符串 "starts with" 是否为所需的子字符串:
pattern = re.compile(r"^/en/ais/details/ships/shipid")
[a["href"] for a in soup("a", href=pattern)]
^
这里表示字符串的开头。
或者,我们甚至可以使用 CSS 选择器:
[a["href"] for a in soup.select('a[href^="/en/ais/details/ships/shipid"]')]
^=
是一个 "starts-with" 选择器。