从 json 文件中提取 URL

Question

我使用 postman 从 api 中获取 urls，这样我就可以查看某些标题。响应已保存为 .json 文件。

我的 response.json 文件的片段如下所示：

{
    "apiUrl":"https://api.ft.com/example/83example74-3c9b-11ea-a01a-example547046735",
    "title": {
        "title": "Example title example title example title"
    },
    "lifecycle": {
        "initialPublishDateTime":"2020-01-21T22:54:57Z",
        "lastPublishDateTime":"2020-01-21T23:38:19Z"
    },
    "location":{
        "uri":"https://www.ft.com/exampleurl/83example74-3c9b-11ea-a01a-example547046735"
    },
    "summary": "...",
    # ............(this continues for all different titles I found)
}

因为我想看文章，所以我想生成一个包含所有 url 的列表。我对 apiUrl 不感兴趣，只对 uri 感兴趣。

我当前的 python 文件如下所示

with open ("My path to file/response.json") as file:
    for line in file:
        urls = re.findall('https://(?:[-\www.]|(?:%[\da-fA-F]{2}))+', line)
        print(urls)

这给了我以下输出： ['https://api.ft.com', 'https://www.ft.com', 'https://api.ft.com', 'https://www.ft.com',........

但是，我希望能够看到 www.ft.com 的整个 url（而不是 api.ft.com url，因为我对那些）。例如，我希望我的程序提取如下内容：https://www.ft.com/thisisanexampleurl/83example74-3c9b-11ea-a01a-example547046735

我希望程序对整个响应文件执行此操作

有谁知道这样做的方法吗？

所有帮助将不胜感激。雷蒙德

Answer 1

提取的方法有很多，下面是最简单的表达方式

str_='first url "https://api.ft.com/example/83example74-3c9b-11ea-a01a-example547046735" plus second url "https://www.ft.com/exampleurl/83example74-3c9b-11ea-a01a-example547046735'
import re
re.findall("(?P<url>https?://[^\s]+)", str_)
Output=
['https://api.ft.com/example/83example74-3c9b-11ea-a01a-example547046735"', 'https://www.ft.com/exampleurl/83example74-3c9b-11ea-a01a-example547046735']

Answer 2

假设 url 散布在整个 json 对象中，您可以递归搜索每个键的每个嵌套对象值以确定它是否是 url.

此外，如果格式正确 json，使用 json.loads 将比文件对象更容易搜索。

例如，使用 python validators 包

import validators

Iterate through the object.

Check each value with -> `validators.url(value)`

If True -> return value

Answer 3

如果您确定哪些键包含 URL，您可以使用 nested_lookup 库来检索它们：

from nested_lookup import nested_lookup

urls = []
for key in ('uri', 'apiUrl'):
    urls.extend(nested_lookup(key, data))
print(urls)

# ['https://www.ft.com/exampleurl/83example74-3c9b-11ea-a01a-example547046735', 'https://api.ft.com/example/83example74-3c9b-11ea-a01a-example547046735']

Answer 4

感谢大家的意见。

我找到了另一种方法来解决我的问题（我用新闻api代替了python。本质上也是这样做的，但是只看金融时间api我现在获取更多站点和文章）。这对我来说效果更好

雷蒙德·范宗内维尔德

从 json 文件中提取 URL

Extracting URLS from json file

python

url

json

extract

postman