使用 lxml 解析 Pubmed API xml 然后将 children 抓取到字典中
Parsing Pubmed API xml with lxml then grabbing children into dictionary
我正在尝试 re-learn python 所以我的技能有所欠缺。我目前正在使用 Pubmed API。我试图解析给定 here 的 XML 文件,然后 运行 一个循环来遍历每个 child ('/pubmedarticle') 并抓取一些东西, 现在只是文章标题,然后将它们输入到 pubmedid (pmid) 键下的字典中。
即输出应如下所示:
{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'}
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}
稍后我会添加更多因素,如作者和期刊等,现在我只想弄清楚如何使用 lxml 包将我想要的数据放入字典中。我知道有很多软件包可以为我做到这一点,但这违背了学习的目的。我尝试了很多不同的事情,这就是我目前正在尝试做的事情:
from lxml import etree
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)
dict_out = {}
for x in tree.xpath('//PubmedArticle'):
pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
title = ''.join([x.text for x in x.xpath('//ArticleTitle')])
dict_out[pmid] = {'title': title}
print(dict_out)
我可能对如何进行这个过程有误解,但如果有人能提供见解或引导我朝着正确的方向寻求资源,那将不胜感激。
编辑:抱歉。我写这篇文章的速度比我应该的要快得多。我已经解决了所有的情况。此外,它抛出的结果似乎结合了 PMID,同时只给出了第一个标题:
{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}
Ta
首先,xml是case-sensitive,而你在xpath中使用的是小写标签。
另外我认为 pmid
应该是某个数字(或代表数字的字符串),在您的情况下这似乎有所不同:
在我的测试中
`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`
生成串联数字的字符串,这不是您要查找的内容。
code.py:
#!/usr/bin/env python3
import sys
import requests
from lxml import etree
from pprint import pprint as pp
ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
def main():
response = requests.get(ARTICLE_URL)
tree = etree.fromstring(response.content)
ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
titles = tree.xpath("//Article/ArticleTitle")
if len(ids) != len(titles):
print("ID count doesn't match Title count...")
return
result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
pp(result)
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
备注:
- 为了清楚起见,我对代码进行了一些结构化并重命名了一些变量
- ids 包含 PMID 节点的列表,而 titles 包含 (相应的)ArticleTitle 节点(注意路径!)
- 以所需格式将它们连接在一起的方法是使用 [Python]: PEP 274 -- Dict Comprehensions, and for iterating on 2 list at the same time, [Python 3]: zip(*iterables) 已被使用
输出:
(py35x64_test) c:\Work\Dev\Whosebug\q47433632>"c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
'antibiotic appropriateness prescription and to reduce '
'costs in pediatrics.'},
'29150897': {'title': 'Determining best outcomes from community-acquired '
'pneumonia and how to achieve them.'}}
我正在尝试 re-learn python 所以我的技能有所欠缺。我目前正在使用 Pubmed API。我试图解析给定 here 的 XML 文件,然后 运行 一个循环来遍历每个 child ('/pubmedarticle') 并抓取一些东西, 现在只是文章标题,然后将它们输入到 pubmedid (pmid) 键下的字典中。
即输出应如下所示:
{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'}
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}
稍后我会添加更多因素,如作者和期刊等,现在我只想弄清楚如何使用 lxml 包将我想要的数据放入字典中。我知道有很多软件包可以为我做到这一点,但这违背了学习的目的。我尝试了很多不同的事情,这就是我目前正在尝试做的事情:
from lxml import etree
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)
dict_out = {}
for x in tree.xpath('//PubmedArticle'):
pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
title = ''.join([x.text for x in x.xpath('//ArticleTitle')])
dict_out[pmid] = {'title': title}
print(dict_out)
我可能对如何进行这个过程有误解,但如果有人能提供见解或引导我朝着正确的方向寻求资源,那将不胜感激。
编辑:抱歉。我写这篇文章的速度比我应该的要快得多。我已经解决了所有的情况。此外,它抛出的结果似乎结合了 PMID,同时只给出了第一个标题:
{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}
Ta
首先,xml是case-sensitive,而你在xpath中使用的是小写标签。
另外我认为 pmid
应该是某个数字(或代表数字的字符串),在您的情况下这似乎有所不同:
在我的测试中
`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`
生成串联数字的字符串,这不是您要查找的内容。
code.py:
#!/usr/bin/env python3
import sys
import requests
from lxml import etree
from pprint import pprint as pp
ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
def main():
response = requests.get(ARTICLE_URL)
tree = etree.fromstring(response.content)
ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
titles = tree.xpath("//Article/ArticleTitle")
if len(ids) != len(titles):
print("ID count doesn't match Title count...")
return
result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
pp(result)
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
备注:
- 为了清楚起见,我对代码进行了一些结构化并重命名了一些变量
- ids 包含 PMID 节点的列表,而 titles 包含 (相应的)ArticleTitle 节点(注意路径!)
- 以所需格式将它们连接在一起的方法是使用 [Python]: PEP 274 -- Dict Comprehensions, and for iterating on 2 list at the same time, [Python 3]: zip(*iterables) 已被使用
输出:
(py35x64_test) c:\Work\Dev\Whosebug\q47433632>"c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32 {'29149862': {'title': 'Telemedicine as an effective intervention to improve ' 'antibiotic appropriateness prescription and to reduce ' 'costs in pediatrics.'}, '29150897': {'title': 'Determining best outcomes from community-acquired ' 'pneumonia and how to achieve them.'}}