Beautifulsoup 提取信息
Beatiful Soup Extract Information
我正在尝试提取化学名称,它的 occurrences/uses 和使用美丽汤添加的日期。
这是清单中化学品的一个例子
https://oehha.ca.gov/chemicals/abiraterone-acetate
有人可以帮我吗?非常感谢!
我的期望输出将是
Abiraterone acetat from L253
<h1 class="title" id="page-title"><span class="ca-gov-icon-arrow-down"></span> Abiraterone acetate </h1>
A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer
from L265
<h3 class="label-above">Occurence(s)/Use(s)</h3><p>A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.</p>
02/02/2016 from L266
<h3 class="label-above">Date Added</h3><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2016-02-02T00:00:00-08:00">02/02/2016</span> </div>
请注意,该网站受 incapsula
防火墙保护,以防止机器人和浏览器自动化。
使用 Selenium
我们可以实现您的以下目标:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://oehha.ca.gov/chemicals/abiraterone-acetate'
sada = browser.get(url)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
title = soup.find('h1', {'class': 'title'})
print(title.text.strip())
details = soup.find(string='Occurence(s)/Use(s)').find_next('p').contents[0]
print(details)
date = soup.find('span', {'class': 'date-display-single'})
print(date.text)
browser.close()
输出:
Abiraterone acetate
A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.
02/02/2016
我正在尝试提取化学名称,它的 occurrences/uses 和使用美丽汤添加的日期。 这是清单中化学品的一个例子 https://oehha.ca.gov/chemicals/abiraterone-acetate
有人可以帮我吗?非常感谢!
我的期望输出将是
Abiraterone acetat from L253
<h1 class="title" id="page-title"><span class="ca-gov-icon-arrow-down"></span> Abiraterone acetate </h1>
A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer
from L265
<h3 class="label-above">Occurence(s)/Use(s)</h3><p>A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.</p>
02/02/2016 from L266
<h3 class="label-above">Date Added</h3><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2016-02-02T00:00:00-08:00">02/02/2016</span> </div>
请注意,该网站受 incapsula
防火墙保护,以防止机器人和浏览器自动化。
使用 Selenium
我们可以实现您的以下目标:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://oehha.ca.gov/chemicals/abiraterone-acetate'
sada = browser.get(url)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
title = soup.find('h1', {'class': 'title'})
print(title.text.strip())
details = soup.find(string='Occurence(s)/Use(s)').find_next('p').contents[0]
print(details)
date = soup.find('span', {'class': 'date-display-single'})
print(date.text)
browser.close()
输出:
Abiraterone acetate
A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.
02/02/2016