如何使用 selenium 和 beautiful soup 抓取隐藏的 class 数据
How to scrape hidden class data using selenium and beautiful soup
我正在尝试抓取 java 启用脚本的网页内容。我需要在该网站的 table 中提取数据。但是 table 的每一行都有按钮(箭头),通过它我们可以获得该行的其他信息。
我需要提取每一行的附加说明。通过检查发现每一行的那些箭头的内容属于同一个class。但是 class 隐藏在源代码中。只有在检查时才能观察到。我尝试稀疏化的数据来自 webpage。
我用过硒美汤。我可以抓取 table 的数据,但不能抓取 table 中那些箭头的内容。我的 python 返回给我该箭头的 class 的空列表。但是为正常 table 数据的 class 工作。
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
您感兴趣的内容是在您单击按钮时生成的,因此您会希望找到该按钮。你可以用一百万种方法来做到这一点,但我建议如下:
element = driver.find_elements(By.XPATH, '//button')
对于您的具体情况,您还可以使用:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
获得按钮元素后,我们可以执行以下操作:
element.click()
在此之后解析页面应该会得到您正在寻找的 javascript 生成的内容
要打印隐藏数据,可以使用这个例子:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
打印:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.
我正在尝试抓取 java 启用脚本的网页内容。我需要在该网站的 table 中提取数据。但是 table 的每一行都有按钮(箭头),通过它我们可以获得该行的其他信息。
我需要提取每一行的附加说明。通过检查发现每一行的那些箭头的内容属于同一个class。但是 class 隐藏在源代码中。只有在检查时才能观察到。我尝试稀疏化的数据来自 webpage。
我用过硒美汤。我可以抓取 table 的数据,但不能抓取 table 中那些箭头的内容。我的 python 返回给我该箭头的 class 的空列表。但是为正常 table 数据的 class 工作。
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
您感兴趣的内容是在您单击按钮时生成的,因此您会希望找到该按钮。你可以用一百万种方法来做到这一点,但我建议如下:
element = driver.find_elements(By.XPATH, '//button')
对于您的具体情况,您还可以使用:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
获得按钮元素后,我们可以执行以下操作:
element.click()
在此之后解析页面应该会得到您正在寻找的 javascript 生成的内容
要打印隐藏数据,可以使用这个例子:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
打印:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.