使用 lxml 解析 Pubmed API xml 然后将 children 抓取到字典中

Question

我正在尝试 re-learn python 所以我的技能有所欠缺。我目前正在使用 Pubmed API。我试图解析给定 here 的 XML 文件，然后运行一个循环来遍历每个 child ('/pubmedarticle') 并抓取一些东西, 现在只是文章标题，然后将它们输入到 pubmedid (pmid) 键下的字典中。

即输出应如下所示：

{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'} 
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}

稍后我会添加更多因素，如作者和期刊等，现在我只想弄清楚如何使用 lxml 包将我想要的数据放入字典中。我知道有很多软件包可以为我做到这一点，但这违背了学习的目的。我尝试了很多不同的事情，这就是我目前正在尝试做的事情：

from lxml import etree    
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)

dict_out = {}

for x in tree.xpath('//PubmedArticle'):
    pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
    title = ''.join([x.text for x in x.xpath('//ArticleTitle')])

    dict_out[pmid] = {'title': title}

print(dict_out)

我可能对如何进行这个过程有误解，但如果有人能提供见解或引导我朝着正确的方向寻求资源，那将不胜感激。

编辑：抱歉。我写这篇文章的速度比我应该的要快得多。我已经解决了所有的情况。此外，它抛出的结果似乎结合了 PMID，同时只给出了第一个标题：

{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}

Ta

Answer 1

首先，xml是case-sensitive，而你在xpath中使用的是小写标签。

另外我认为 pmid 应该是某个数字（或代表数字的字符串），在您的情况下这似乎有所不同：

在我的测试中

`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`

生成串联数字的字符串，这不是您要查找的内容。

Answer 2

code.py:

#!/usr/bin/env python3

import sys
import requests
from lxml import etree
from pprint import pprint as pp

ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"


def main():
    response = requests.get(ARTICLE_URL)
    tree = etree.fromstring(response.content)
    ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
    titles = tree.xpath("//Article/ArticleTitle")
    if len(ids) != len(titles):
        print("ID count doesn't match Title count...")
        return
    result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
    pp(result)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

备注:

为了清楚起见，我对代码进行了一些结构化并重命名了一些变量
ids 包含 PMID 节点的列表，而 titles 包含 (相应的）ArticleTitle 节点（注意路径！）
以所需格式将它们连接在一起的方法是使用 [Python]: PEP 274 -- Dict Comprehensions, and for iterating on 2 list at the same time, [Python 3]: zip(*iterables) 已被使用

输出:

(py35x64_test) c:\Work\Dev\Whosebug\q47433632>"c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
                       'antibiotic appropriateness prescription and to reduce '
                       'costs in pediatrics.'},
 '29150897': {'title': 'Determining best outcomes from community-acquired '
                       'pneumonia and how to achieve them.'}}

使用 lxml 解析 Pubmed API xml 然后将 children 抓取到字典中

Parsing Pubmed API xml with lxml then grabbing children into dictionary

python

xml

parsing

lxml

dictionary