如何从字符串中提取 url 并保存到列表中
how to extract url from a string and save to a list
我无法从字符串中保存 urls。
我试过这样的东西
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})
url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'
但它没有返回完整的 url 列表。我需要 url 才能导航到下一页。
您应该使用 find()
而不是 find_all()
,那么这个修改后的 url 列表应该可以工作:
Links1 = soup.find_all("div",{"class:","pagination"})
urls = [i['href'] for i in Links1.find_all('a') if 'href' in i.attrs]
您是在“下一页”之后还是想要所有链接?
所以你想要:
/jobs?q=software+engineer+&l=Kerala&start=10
或者你在追求所有这些吗?
/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10
几个问题:
Links1
是一个元素列表。然后你在列表上使用 .find('a')
,这是行不通的。
- 既然你想要 href 属性,考虑使用
find('a',href=True)
下面是我的处理方式:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})
url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'
输出:
print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10
要获取所有这些链接:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})
urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'
我无法从字符串中保存 urls。
我试过这样的东西
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})
url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'
但它没有返回完整的 url 列表。我需要 url 才能导航到下一页。
您应该使用 find()
而不是 find_all()
,那么这个修改后的 url 列表应该可以工作:
Links1 = soup.find_all("div",{"class:","pagination"})
urls = [i['href'] for i in Links1.find_all('a') if 'href' in i.attrs]
您是在“下一页”之后还是想要所有链接?
所以你想要:
/jobs?q=software+engineer+&l=Kerala&start=10
或者你在追求所有这些吗?
/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10
几个问题:
Links1
是一个元素列表。然后你在列表上使用.find('a')
,这是行不通的。- 既然你想要 href 属性,考虑使用
find('a',href=True)
下面是我的处理方式:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})
url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'
输出:
print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10
要获取所有这些链接:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})
urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'