Python 从 xml 中提取数据
Python extract data from xml
我正在尝试从此网页获取值:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>
我用这段代码测试过:
import xml.etree.ElementTree as ET
from urllib import request
url = "http://SomeSite/WebService.asmx/LoadVariableHistory?username=USERNAME&password=PASSWORD&variableName=CBT2_G_PRM_FB2&startDateTime=2020-12-01&endDateTime=2020-12-02&sampling=3"
print ("Obter: ", url)
html = request.urlopen(url)
data = html.read()
print("Obtido: ",len(data),"caracteres")
tree = ET.fromstring(data)
results = tree.findall('Value')
for i in results:
print(i)
出于安全原因,我隐藏了完整的 URL。
我做错了什么没有得到价值?我需要完成这一部分,以便我可以使用 DataTime 构建字典:Value
提前致谢
tree = ET.fromstring(data)
for detail in tree.findall('vwHistoryDetail'):
v = detail.find('Value').text
print(v)
你最好遍历一个对象并提取子元素,而不是直接抓取子元素,因为值可能是在文档的不同部分重复使用的标签
见下文
import xml.etree.ElementTree as ET
import re
#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)
输出
{'2020-12-01T00:00:00': '28671555', '2020-12-02T00:00:00': '28675970', '2020-12-03T00:00:00': '28680249', '2020-12-04T00:00:00': '28684224'}
您当前的实施中出现了几个问题:
- 您的 XML 包含一个默认命名空间,
xmlns="http://tempuri.org/"
需要您定义一个前缀才能解析节点内容; findall
维护一个 命名空间 参数。
- 您的路径表达式假设
Value
是 root 的子节点。您需要使用双斜杠路径 .//
,因为 Value
是 root 的后代。
- 您需要提取迭代器变量的
text
。否则,您将 return <Element ... >
对象通常在最终使用需求中没有用处。
考虑调整
tree = ET.fromstring(data)
nmsp = {'doc': 'http://tempuri.org/'} # NAMESPACE PREFIX ASSIGNMENT
results = tree.findall('.//doc:Value', namespaces = nmsp) # NAMESPACE PREFIX USE WITH './/' PATH
for i in results:
print(i.text) # RETRIEVE TEXT VALUE
# 28671555
# 28675970
# 28680249
# 28684224
更好的是,return .Value
及其兄弟姐妹的字典 list/dict 理解(其中 split
删除了字典键中的默认名称空间):
data_list_of_dicts = [{i.tag.split('}')[-1]: i.text for i in hd}
for hd in tree.findall('.//doc:vwHistoryDetail', namespaces = nmsp)]
print(data_list_of_dicts)
# [{'idVariable': '2561', 'DateTime': '2020-12-01T00:00:00', 'idPeriodType': '1', 'Value': '28671555', 'ValueDetail': '4415'},
# {'idVariable': '2561', 'DateTime': '2020-12-02T00:00:00', 'idPeriodType': '1', 'Value': '28675970', 'ValueDetail': '4279'},
# {'idVariable': '2561', 'DateTime': '2020-12-03T00:00:00', 'idPeriodType': '1', 'Value': '28680249', 'ValueDetail': '3975'},
# {'idVariable': '2561', 'DateTime': '2020-12-04T00:00:00', 'idPeriodType': '1', 'Value': '28684224', 'ValueDetail': '4236'}]
对于时间键值字典:
time_value_dict = {hd.find('doc:DateTime', namespaces=nmsp).text:
hd.find('doc:Value', namespaces=nmsp).text
for hd in tree.findall('.//doc:vwHistoryDetail', namespaces=nmsp)}
print(time_value_dict)
# {'2020-12-01T00:00:00': '28671555',
# '2020-12-02T00:00:00': '28675970',
# '2020-12-03T00:00:00': '28680249',
# '2020-12-04T00:00:00': '28684224'}
我正在尝试从此网页获取值:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>
我用这段代码测试过:
import xml.etree.ElementTree as ET
from urllib import request
url = "http://SomeSite/WebService.asmx/LoadVariableHistory?username=USERNAME&password=PASSWORD&variableName=CBT2_G_PRM_FB2&startDateTime=2020-12-01&endDateTime=2020-12-02&sampling=3"
print ("Obter: ", url)
html = request.urlopen(url)
data = html.read()
print("Obtido: ",len(data),"caracteres")
tree = ET.fromstring(data)
results = tree.findall('Value')
for i in results:
print(i)
出于安全原因,我隐藏了完整的 URL。 我做错了什么没有得到价值?我需要完成这一部分,以便我可以使用 DataTime 构建字典:Value
提前致谢
tree = ET.fromstring(data)
for detail in tree.findall('vwHistoryDetail'):
v = detail.find('Value').text
print(v)
你最好遍历一个对象并提取子元素,而不是直接抓取子元素,因为值可能是在文档的不同部分重复使用的标签
见下文
import xml.etree.ElementTree as ET
import re
#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)
输出
{'2020-12-01T00:00:00': '28671555', '2020-12-02T00:00:00': '28675970', '2020-12-03T00:00:00': '28680249', '2020-12-04T00:00:00': '28684224'}
您当前的实施中出现了几个问题:
- 您的 XML 包含一个默认命名空间,
xmlns="http://tempuri.org/"
需要您定义一个前缀才能解析节点内容;findall
维护一个 命名空间 参数。 - 您的路径表达式假设
Value
是 root 的子节点。您需要使用双斜杠路径.//
,因为Value
是 root 的后代。 - 您需要提取迭代器变量的
text
。否则,您将 return<Element ... >
对象通常在最终使用需求中没有用处。
考虑调整
tree = ET.fromstring(data)
nmsp = {'doc': 'http://tempuri.org/'} # NAMESPACE PREFIX ASSIGNMENT
results = tree.findall('.//doc:Value', namespaces = nmsp) # NAMESPACE PREFIX USE WITH './/' PATH
for i in results:
print(i.text) # RETRIEVE TEXT VALUE
# 28671555
# 28675970
# 28680249
# 28684224
更好的是,return .Value
及其兄弟姐妹的字典 list/dict 理解(其中 split
删除了字典键中的默认名称空间):
data_list_of_dicts = [{i.tag.split('}')[-1]: i.text for i in hd}
for hd in tree.findall('.//doc:vwHistoryDetail', namespaces = nmsp)]
print(data_list_of_dicts)
# [{'idVariable': '2561', 'DateTime': '2020-12-01T00:00:00', 'idPeriodType': '1', 'Value': '28671555', 'ValueDetail': '4415'},
# {'idVariable': '2561', 'DateTime': '2020-12-02T00:00:00', 'idPeriodType': '1', 'Value': '28675970', 'ValueDetail': '4279'},
# {'idVariable': '2561', 'DateTime': '2020-12-03T00:00:00', 'idPeriodType': '1', 'Value': '28680249', 'ValueDetail': '3975'},
# {'idVariable': '2561', 'DateTime': '2020-12-04T00:00:00', 'idPeriodType': '1', 'Value': '28684224', 'ValueDetail': '4236'}]
对于时间键值字典:
time_value_dict = {hd.find('doc:DateTime', namespaces=nmsp).text:
hd.find('doc:Value', namespaces=nmsp).text
for hd in tree.findall('.//doc:vwHistoryDetail', namespaces=nmsp)}
print(time_value_dict)
# {'2020-12-01T00:00:00': '28671555',
# '2020-12-02T00:00:00': '28675970',
# '2020-12-03T00:00:00': '28680249',
# '2020-12-04T00:00:00': '28684224'}