Python / BeautifulSoup return 确实有工作的 ID
Python / BeautifulSoup return ids with indeed jobs
我使用 BeautifulSoup 设置了一个基本的 indeed 网络抓取工具,我可以 return 从 indeed 职位搜索的第一页获取每个职位的职位和公司 url 我正在使用:
def extract():
headers = headers
url = f'https://www.indeed.com/jobs?q=Network%20Architect&start=&vjk=e8bcf3fbe7498a5f'
r = requests.get(url,headers)
#return r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
print(f'title: {title}')
print(f'company: {company}')
c = extract()
transform(c)
输出
title: new Network Architect
company: MetroSys
title: new Network Architect
company: Federal Working Group
title: new REMOTE Network Architect - CCIE
company: CyberCoders
title: new Network Architect SME
company: Emergere Technologies
title: Cybersecurity Apprentice
company: IBM
title: Network Engineer (NEW YORK) ONSITE ONLY NEED TO APPLY
company: QnA Tech
title: new Network Architect
company: EdgeCo Holdings
title: new Network Architect
company: JKL Technologies, Inc.
title: Network Architect
company: OTELCO
title: new Network Architect
company: Illinois Municipal Retirement Fund (IMRF)
title: new Network Architect, Google Enterprise Network
company: Google
title: new Network Infrastructure Lead Or Architect- Menlo Park CA -Ful...
company: Xforia Technologies
title: Network Architect
company: Fairfax County Public Schools
title: new Network Engineer
company: Labatt Food Service
title: new Network Architect (5056-3)
company: JND
现在看来,他们确实为每项工作都有一个唯一的 ID,我正在尝试通过每项工作访问此 ID,以便稍后在 SQL 数据库中使用它,这样我就不会添加重复的工作。我可以使用以下代码访问作业 ID:
for tag in soup.find_all('a', class_ = 'result') :
print(tag.get('id'))
输出:
job_a678f3bfc20cb753
job_eef3e4c10d979c1e
job_faedfdbadab2f19b
job_190a6b55b99c78f0
job_32d20498e8fbf692
job_aeaabb9af50f36d6
job_92432325a24212d0
job_819ce9d7ec6e5890
job_d979bf7daac01528
job_0879369d166a9b94
job_2d377bc2e5085ad7
job_bb8e5d0f651c072f
job_dcff58df466f1ecb
job_f70d55871eb1df3f
sj_54a09e5e34e08948
当我尝试使用我的工作代码实现此功能时,我可以访问 ID,但是,它们全部 return 在一起,而不是一次一个与相应的工作一起编辑,或者每个工作发布 1 个(而不是总共 15 个得到 15x15)我试过这种方式:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
这样:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
第二种方法最接近我的结果,但是我没有得到 1 个职位、1 个公司和 1 个 ID,总共有 15 个职位发布,我得到的是每个职位发布的 ID returned所以 15x15.
想要的结果只是得到它 return编辑为:
title
company
ID
title
company
ID
你仍然有这份工作并从中提取信息,那么为什么不简单地从中提取 id -> job.get('id')
应该适合你:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
id = job.get('id')
print(f'title: {title}')
print(f'company: {company}')
print(f'id: {id}')
我使用 BeautifulSoup 设置了一个基本的 indeed 网络抓取工具,我可以 return 从 indeed 职位搜索的第一页获取每个职位的职位和公司 url 我正在使用:
def extract():
headers = headers
url = f'https://www.indeed.com/jobs?q=Network%20Architect&start=&vjk=e8bcf3fbe7498a5f'
r = requests.get(url,headers)
#return r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
print(f'title: {title}')
print(f'company: {company}')
c = extract()
transform(c)
输出
title: new Network Architect
company: MetroSys
title: new Network Architect
company: Federal Working Group
title: new REMOTE Network Architect - CCIE
company: CyberCoders
title: new Network Architect SME
company: Emergere Technologies
title: Cybersecurity Apprentice
company: IBM
title: Network Engineer (NEW YORK) ONSITE ONLY NEED TO APPLY
company: QnA Tech
title: new Network Architect
company: EdgeCo Holdings
title: new Network Architect
company: JKL Technologies, Inc.
title: Network Architect
company: OTELCO
title: new Network Architect
company: Illinois Municipal Retirement Fund (IMRF)
title: new Network Architect, Google Enterprise Network
company: Google
title: new Network Infrastructure Lead Or Architect- Menlo Park CA -Ful...
company: Xforia Technologies
title: Network Architect
company: Fairfax County Public Schools
title: new Network Engineer
company: Labatt Food Service
title: new Network Architect (5056-3)
company: JND
现在看来,他们确实为每项工作都有一个唯一的 ID,我正在尝试通过每项工作访问此 ID,以便稍后在 SQL 数据库中使用它,这样我就不会添加重复的工作。我可以使用以下代码访问作业 ID:
for tag in soup.find_all('a', class_ = 'result') :
print(tag.get('id'))
输出:
job_a678f3bfc20cb753
job_eef3e4c10d979c1e
job_faedfdbadab2f19b
job_190a6b55b99c78f0
job_32d20498e8fbf692
job_aeaabb9af50f36d6
job_92432325a24212d0
job_819ce9d7ec6e5890
job_d979bf7daac01528
job_0879369d166a9b94
job_2d377bc2e5085ad7
job_bb8e5d0f651c072f
job_dcff58df466f1ecb
job_f70d55871eb1df3f
sj_54a09e5e34e08948
当我尝试使用我的工作代码实现此功能时,我可以访问 ID,但是,它们全部 return 在一起,而不是一次一个与相应的工作一起编辑,或者每个工作发布 1 个(而不是总共 15 个得到 15x15)我试过这种方式:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
这样:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
第二种方法最接近我的结果,但是我没有得到 1 个职位、1 个公司和 1 个 ID,总共有 15 个职位发布,我得到的是每个职位发布的 ID returned所以 15x15.
想要的结果只是得到它 return编辑为:
title
company
ID
title
company
ID
你仍然有这份工作并从中提取信息,那么为什么不简单地从中提取 id -> job.get('id')
应该适合你:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
id = job.get('id')
print(f'title: {title}')
print(f'company: {company}')
print(f'id: {id}')