Python 带 SDMX 的 BS4

Question

我想检索 SDMX 文件中给定的数据（如 https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its）。我尝试使用 BeautifulSoup，但它似乎看不到标签。在下面的代码中

import urllib2
from bs4 import BeautifulSoup 
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")

这给了我一个空对象。

BS4 是错误的工具，还是（更有可能）我做错了什么？提前致谢

Answer 1

soup.findAll("bbk:series") 会 return 结果。

事实上，在这种情况下，即使您使用 lxml 作为解析器，BeautifulSoup 仍会将其解析为 html，因为 html 标签不区分大小写， BeautifulSoup 将所有标签小写，因此 soup.findAll("bbk:series") 有效。请参阅官方文档中的 Other parser problems。

如果您想将其解析为 xml，请改用 soup = BeautifulSoup(html_source, 'xml')。它还使用 lxml，因为 lxml 是唯一的 xml 解析器 BeautifulSoup。现在您可以使用 ts_series = soup.findAll("Series") 来获得结果，因为 beautifulSoup 将去除命名空间部分 bbk.

Python 带 SDMX 的 BS4

Python BS4 with SDMX

python

xml-parsing

python-2.7

bs4

sdmx