如何用最终重定向替换 HTML 中的所有 URL?
How do I replace all URLs in HTML with their final redirect?
最好使用 BeautifulSoup,因为我已经将其用于其他目的。但是任何 Python 解决方案都可以。
s = BeautifulSoup(bodyhtml, features="lxml")
items = s.find_all("div", {"class": "text-block"})
# I want to replace all URLs in `items` with their final redirect.
这是一个示例 URL:
https://tracking.tldrnewsletter.com/CL0/https:%2F%2Farstechnica.com%2Finformation-technology%2F2020%2F04%2Fmeet-dark_nexus-quite-possibly-the-most-potent-iot-botnet-ever%2F/1/0100017163ab9f84-cfdbd3c3-ef8c-4b34-b2a0-f6f4b8f78359-000000/BEB0JUmMqamX4piPthkn_oJ78cjvd6UocEmGf7iO5Pk=136
这里是item[5]
(所有项目都一样):
<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a><br/><br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span><br/></span><br/></div>
获取相关的 a
元素。假设前缀全部相同,请将 href
属性的前缀替换为空字符串。摆脱第一个 / 之后的任何东西。然后像这样取消转义:
from bs4 import BeautifulSoup
from urllib.parse import unquote
html = """
<head>
<body>
<p>
<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</p>
</body>
</head>
"""
s = BeautifulSoup(html, features="lxml")
for a in s.select('div.text-block a'):
a['href'] = unquote(a['href'].replace("https://tracking.tldrnewsletter.com/CL0/", "").split('/')[0])
print(s)
输出:
<html><head>
</head><body>
<p>
</p><div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://www.polygon.com/2020/4/8/21213551/google-stadia-free-pro-subscription"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</body>
</html>
最好使用 BeautifulSoup,因为我已经将其用于其他目的。但是任何 Python 解决方案都可以。
s = BeautifulSoup(bodyhtml, features="lxml")
items = s.find_all("div", {"class": "text-block"})
# I want to replace all URLs in `items` with their final redirect.
这是一个示例 URL:
https://tracking.tldrnewsletter.com/CL0/https:%2F%2Farstechnica.com%2Finformation-technology%2F2020%2F04%2Fmeet-dark_nexus-quite-possibly-the-most-potent-iot-botnet-ever%2F/1/0100017163ab9f84-cfdbd3c3-ef8c-4b34-b2a0-f6f4b8f78359-000000/BEB0JUmMqamX4piPthkn_oJ78cjvd6UocEmGf7iO5Pk=136
这里是item[5]
(所有项目都一样):
<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a><br/><br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span><br/></span><br/></div>
获取相关的 a
元素。假设前缀全部相同,请将 href
属性的前缀替换为空字符串。摆脱第一个 / 之后的任何东西。然后像这样取消转义:
from bs4 import BeautifulSoup
from urllib.parse import unquote
html = """
<head>
<body>
<p>
<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</p>
</body>
</head>
"""
s = BeautifulSoup(html, features="lxml")
for a in s.select('div.text-block a'):
a['href'] = unquote(a['href'].replace("https://tracking.tldrnewsletter.com/CL0/", "").split('/')[0])
print(s)
输出:
<html><head>
</head><body>
<p>
</p><div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://www.polygon.com/2020/4/8/21213551/google-stadia-free-pro-subscription"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for 9. Stadia Pro will cost .99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</body>
</html>