如何使用 Python 打开 "partial" 链接？

Question

我正在开发一个网络抓取工具，它可以打开一个网页，并打印该网页中的任何 link 如果 link 包含关键字 (我稍后会打开这些 link 以进一步抓取）。

例如，我正在使用请求模块打开 "cnn.com"，然后尝试解析该网页中的所有 href/links。然后，如果 link 中的任何一个包含特定的单词（例如 "china"），Python 应该打印出 link.

我可以简单地使用请求打开主页，将所有 href 保存到列表中 ('links')，然后使用：

links = [...]

keyword = "china"

for link in links:
   if keyword in link:
      print(link)

但是这个方法的问题是，我原来解析出来的link并不完整link。例如，所有带有 CNBC 网页的 link 的结构如下：

href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"

但是对于 CNN 的页面，它们是这样写的（不完整 links...它们缺少“/”之前的部分）：

href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

这是一个问题，因为我正在编写更多脚本来自动打开这些 link 来解析它们。但是Python打不开

"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

因为它不是完整的 link。

那么，什么是解决这个问题的可靠解决方案（也适用于其他网站，而不仅仅是 CNN）？

编辑： 我知道我在这个 post 中作为例子写的 link 不包含 "China" 这个词，但这只是例子。

Answer 1

尝试使用 urllib.parse 包中的 urljoin 函数。它有两个参数，第一个是您当前正在解析的页面的 URL，它作为相对 link 的基础，第二个是您找到的 link。如果您找到的 link 以 http:// 或 https:// 开头，它将 return 只是 link，否则它将解析 URL 相对于您作为第一个参数传递的内容。

例如：

#!/usr/bin/env python3

from urllib.parse import urljoin

print(
  urljoin(
    "https://www.cnbc.com/",
    "/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
  )
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

print(
  urljoin(
    "https://www.cnbc.com/",
    "http://some-other.website/"
  )
)
# prints "http://some-other.website/"

如何使用 Python 打开 "partial" 链接？

How to open "partial" links using Python?

parsing

href

hyperlink

web-scraping

python-3.x