Python 使用 beautifulsoup 进行网络抓取 - 无法从 Clinicaltrials.gov 中提取首席研究员

Question

（免责声明：我是 Python 和网络抓取新手，但我正在尽我最大努力学习）。

我正在尝试从 clinicaltrials.gov 的研究中提取 3 个关键数据点。他们有一个 API，但 API 没有捕捉到我需要的东西。我想获得 (1) 研究的简短描述，(2) 首席研究员 (PI)，以及 (3) 与研究相关的一些关键词。我相信我的代码捕获了 1 和 3，但没有捕获 2。我似乎无法弄清楚为什么我没有得到首席研究员的名字。这是我的代码中的两个站点：

https://clinicaltrials.gov/ct2/show/NCT03530579 https://clinicaltrials.gov/ct2/show/NCT03436992

这是我的代码（我知道 PI 代码是错误的，但我想证明我试过了）：

import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv   

fields=['PI','Project_Summary', 'Keywords']
with open(r'test.csv', 'a') as f:
     writer = csv.writer(f)
     writer.writerow(fields)

urls = ['https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/ct2/show/NCT03530579']
for url in urls:

     response = requests.get(url)
     soup = BeautifulSoup(response.content, 'html.parser')
     #get_keywords
     for rows in soup.find_all("td"):
          k = rows.get_text()     
          Keywords = k.strip()
     #get Principal Investigator   
     PI = soup.find_all('padding:1ex 1em 0px 0px;white-space:nowrap;')

     #Get description    
     Description = soup.find(class_='ct-body3 tr-indent2').get_text()
     d = {'Summary2':[PI,Description,Keywords]} 

     df = pd.DataFrame(d)
     print (df)
     import csv   
     fields=[PI,Description, Keywords]
     with open(r'test.csv', 'a') as f:
          writer = csv.writer(f)
          writer.writerow(fields)

Answer 1

您或许可以使用以下选择器

即PI = soup.select_one('.tr-table_cover [headers=name]').text

import requests
from bs4 import BeautifulSoup  
urls = ['https://clinicaltrials.gov/ct2/show/NCT03530579', 'https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/show/NCT03834376']
with requests.Session() as s:
    for url in urls:
        r = s.get(url)
        soup = BeautifulSoup(r.text, "lxml")
        item = soup.select_one('.tr-table_cover [headers=name]').text if soup.select_one('.tr-table_cover [headers=name]') is not None else 'No PI'
        print(item)

. 是一个 class selector and the [] is an attribute selector. The space between is a descendant combinator 指定右侧检索到的元素是左侧元素的子元素

Answer 2

我只是使用 pandas 来获取表格。这将 return 一个数据帧列表。然后您可以遍历这些以查找 PI:

tables = pd.read_html(url)
for table in tables:
    try:
        if 'Principal Investigator' in table.iloc[0,0]:
            pi =  table.iloc[0,1]
    except:
        continue

Answer 3

所以有很多方法可以沿着 DOM 树走下去，而你的方式非常 "brittle"。这意味着您选择的开始搜索的选择器非常具体并且绑定到 CSS 样式，这比整个文档的结构更容易改变。

但如果我是你，我会根据一些标准过滤掉一些节点，然后在你筛选噪音时专注于那个特定的组。

因此，查看您显示的那些 URL，数据结构整齐且使用 tables。基于此我们可以做出一些假设，例如

是table
它将在其中包含 "principal investigator" 字符串

# get all the tables in the page
tables = soup.find_all('table')
# now filter down to a smaller set of tables that might contain the info
refined_tables = [table for table in tables if 'principal investigator' in str(table).lower()]

此时我们的 refined_tables 列表中有一个强大的候选者，它可能实际上包含我们的主要 table 并且理想情况下大小为 1 假设我们使用的 "principal investigator" 过滤器不是' t other tables.

内的任何其他地方

principal_investigator = [ele for ele in refined_tables.findAll('td') if 'name' in ele.attrs['headers']][0].text

在这里，通过查看该站点所做的是，他们正在使用属性 headers 分配 table 行中 td 标记的角色。

因此，从本质上讲，只需从顶层考虑，并开始通过简单的步骤尽可能地缩小范围，以帮助您找到所需的内容。

Python 使用 beautifulsoup 进行网络抓取 - 无法从 Clinicaltrials.gov 中提取首席研究员

Python web scraping with beautifulsoup - can't extract Principal Investigator from Clinicaltrials.gov

python

beautifulsoup

html-parsing

web-scraping

export-to-csv