如何在 Python 中没有的抓取网址前添加 "https://www.example.com/"

Question

我是使用 Python 的菜鸟，我试图从网站上抓取 URL 列表并将它们发送到 .CSV 文件，但我不断收到一堆 URL，它们只是部分的。他们没有“https://www.example.com" before the rest of the URL. I've found that I need to add something like "['https://www.example.com{0}'.format(link) if link.startswith('/') else link for link in url_list]" 到我的代码中，但我应该在哪里添加它？那是我应该补充的吗？谢谢你的帮助！这是我的代码：

url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
    writer=csv.writer(f,delimiter=' ',lineterminator='\r')
    for link in url_list:
        url=link.get('href')
        if url:
            writer.writerow([url])
f.close()

如果您发现任何其他需要更改的内容，请告诉我。谢谢！

Answer 1

一个简单的if语句就可以做到这一点。只需检查 URL 中是否存在 https://www.example.com，如果不存在，则将其连接起来。

url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
    writer=csv.writer(f,delimiter=' ',lineterminator='\r')
    for link in url_list:
        url=link.get('href')
        # updated
        if url != '#' and url is not None:
            # added
            if 'https://www.example.com' not in url:
                url = 'https://www.example.com' + url
            writer.writerow([url])
f.close()

如何在 Python 中没有的抓取网址前添加 "https://www.example.com/"

How to add "https://www.example.com/" before scraped URLs in Python that don't already have it

python

uri

for-loop

web-scraping