如何使用属性 application/ld+json 和 data-react-helmet 进行网页抓取 BeautifulSoup?
How to web scraping BeautifulSoup with attribute application/ld+json and data-react-helmet?
我是使用 python 进行网络抓取的新手。我已经编码使用 Selenium 和 BeautifulSoup 从求职门户网站提取数据。我做的流程是:
- 抓取招聘门户网站上的整个 link 招聘信息
- 从循环获取的每个 link 招聘信息中抓取详细信息。
我在脚本标签类型 = 'application/ld+json' 和 data-react-helmet 上使用 find_all BeautifulSoup 方法抓取了详细信息。但是我收到错误消息列表索引超出范围。有谁知道如何解决它?
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' + jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' + jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' + occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")
数据由 Javascript 从 API 调用 json 响应中动态加载,您可以随心所欲地获取所有数据。下面给出了如何仅使用 requests
模块
从 api 中提取数据的示例
import requests
import json
payload={
"requests":[
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
},
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
}
]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"
jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)
for item in jsonData['results'][0]['hits']:
title=item['_highlightResult']['title']['value']
company=item['_highlightResult']['company']['name']['value']
skill=item['_highlightResult']['job_skills'][0]['name']['value']
salary_max=item['salary_max']
salary_min=item['salary_min']
print(title)
print(company)
print(skill)
print(salary_max)
print(salary_min)
输出:
Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000
我是使用 python 进行网络抓取的新手。我已经编码使用 Selenium 和 BeautifulSoup 从求职门户网站提取数据。我做的流程是:
- 抓取招聘门户网站上的整个 link 招聘信息
- 从循环获取的每个 link 招聘信息中抓取详细信息。
我在脚本标签类型 = 'application/ld+json' 和 data-react-helmet 上使用 find_all BeautifulSoup 方法抓取了详细信息。但是我收到错误消息列表索引超出范围。有谁知道如何解决它?
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' + jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' + jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' + occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")
数据由 Javascript 从 API 调用 json 响应中动态加载,您可以随心所欲地获取所有数据。下面给出了如何仅使用 requests
模块
import requests
import json
payload={
"requests":[
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
},
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
}
]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"
jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)
for item in jsonData['results'][0]['hits']:
title=item['_highlightResult']['title']['value']
company=item['_highlightResult']['company']['name']['value']
skill=item['_highlightResult']['job_skills'][0]['name']['value']
salary_max=item['salary_max']
salary_min=item['salary_min']
print(title)
print(company)
print(skill)
print(salary_max)
print(salary_min)
输出:
Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000