Python / BeautifulSoup return 确实有工作的 ID

Question

我使用 BeautifulSoup 设置了一个基本的 indeed 网络抓取工具，我可以 return 从 indeed 职位搜索的第一页获取每个职位的职位和公司 url 我正在使用：

def extract():
    headers = headers
    url = f'https://www.indeed.com/jobs?q=Network%20Architect&start=&vjk=e8bcf3fbe7498a5f'
    r = requests.get(url,headers)
    #return r.status_code
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        print(f'title: {title}')
        print(f'company: {company}')
        
        
        
c = extract()
transform(c)

输出

title: new Network Architect
company: MetroSys
title: new Network Architect
company: Federal Working Group
title: new REMOTE Network Architect - CCIE
company: CyberCoders
title: new Network Architect SME
company: Emergere Technologies
title: Cybersecurity Apprentice
company: IBM
title: Network Engineer (NEW YORK) ONSITE ONLY NEED TO APPLY
company: QnA Tech
title: new Network Architect
company: EdgeCo Holdings
title: new Network Architect
company: JKL Technologies, Inc.
title: Network Architect
company: OTELCO
title: new Network Architect
company: Illinois Municipal Retirement Fund (IMRF)
title: new Network Architect, Google Enterprise Network
company: Google
title: new Network Infrastructure Lead Or Architect- Menlo Park CA -Ful...
company: Xforia Technologies
title: Network Architect
company: Fairfax County Public Schools
title: new Network Engineer
company: Labatt Food Service
title: new Network Architect (5056-3)
company: JND

现在看来，他们确实为每项工作都有一个唯一的 ID，我正在尝试通过每项工作访问此 ID，以便稍后在 SQL 数据库中使用它，这样我就不会添加重复的工作。我可以使用以下代码访问作业 ID：

for tag in soup.find_all('a', class_ = 'result') :
    print(tag.get('id'))

输出：

job_a678f3bfc20cb753
job_eef3e4c10d979c1e
job_faedfdbadab2f19b
job_190a6b55b99c78f0
job_32d20498e8fbf692
job_aeaabb9af50f36d6
job_92432325a24212d0
job_819ce9d7ec6e5890
job_d979bf7daac01528
job_0879369d166a9b94
job_2d377bc2e5085ad7
job_bb8e5d0f651c072f
job_dcff58df466f1ecb
job_f70d55871eb1df3f
sj_54a09e5e34e08948

当我尝试使用我的工作代码实现此功能时，我可以访问 ID，但是，它们全部 return 在一起，而不是一次一个与相应的工作一起编辑，或者每个工作发布 1 个（而不是总共 15 个得到 15x15）我试过这种方式：

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
         tag = soup.find_all('a', class_='result')
         for x in tag:
           print(x.get('id'))
        print(f'title: {title}')
        print(f'company: {company}')

这样：

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        tag = soup.find_all('a', class_='result')
        for x in tag:
            print(x.get('id'))
            print(f'title: {title}')
            print(f'company: {company}')

第二种方法最接近我的结果，但是我没有得到 1 个职位、1 个公司和 1 个 ID，总共有 15 个职位发布，我得到的是每个职位发布的 ID returned所以 15x15.

想要的结果只是得到它 return编辑为：

title
company
ID
title
company
ID

Answer 1

你仍然有这份工作并从中提取信息，那么为什么不简单地从中提取 id -> job.get('id') 应该适合你：

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        id = job.get('id')
        print(f'title: {title}')
        print(f'company: {company}')
        print(f'id: {id}')

Python / BeautifulSoup return 确实有工作的 ID

Python / BeautifulSoup return ids with indeed jobs

python

beautifulsoup