Beautifulsoup 提取信息

Beatiful Soup Extract Information

我正在尝试提取化学名称,它的 occurrences/uses 和使用美丽汤添加的日期。 这是清单中化学品的一个例子 https://oehha.ca.gov/chemicals/abiraterone-acetate

有人可以帮我吗?非常感谢!

我的期望输出将是

Abiraterone acetat from L253
<h1 class="title" id="page-title"><span class="ca-gov-icon-arrow-down"></span> Abiraterone acetate </h1>

A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer
from L265
<h3 class="label-above">Occurence(s)/Use(s)</h3><p>A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.</p>

02/02/2016 from L266
<h3 class="label-above">Date Added</h3><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2016-02-02T00:00:00-08:00">02/02/2016</span>  </div>

请注意,该网站受 incapsula 防火墙保护,以防止机器人和浏览器自动化。

使用 Selenium 我们可以实现您的以下目标:

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
url = 'https://oehha.ca.gov/chemicals/abiraterone-acetate'
sada = browser.get(url)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

title = soup.find('h1', {'class': 'title'})
print(title.text.strip())
details = soup.find(string='Occurence(s)/Use(s)').find_next('p').contents[0]
print(details)
date = soup.find('span', {'class': 'date-display-single'})
print(date.text)

browser.close()

输出:

Abiraterone acetate
A CYP17 inhibitor indicated in combination with prednisone for the treatment of patients with metastatic castration-resistant prostate cancer.
02/02/2016