如何从字符串中提取 url 并保存到列表中

how to extract url from a string and save to a list

我无法从字符串中保存 urls。

我试过这样的东西

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})

url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'

但它没有返回完整的 url 列表。我需要 url 才能导航到下一页。

您应该使用 find() 而不是 find_all(),那么这个修改后的 url 列表应该可以工作:

Links1 = soup.find_all("div",{"class:","pagination"})
urls = [i['href'] for i in Links1.find_all('a') if 'href' in i.attrs]

您是在“下一页”之后还是想要所有链接?

所以你想要:

/jobs?q=software+engineer+&l=Kerala&start=10

或者你在追求所有这些吗?

/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10

几个问题:

  1. Links1 是一个元素列表。然后你在列表上使用 .find('a'),这是行不通的。
  2. 既然你想要 href 属性,考虑使用 find('a',href=True)

下面是我的处理方式:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})

url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'

输出:

print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10

要获取所有这些链接:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})

urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'