在使用 Python 和 xml.etree.ElementTree 解析 XML 时遇到一些挑战
Having some challenges parsing XML with Python and xml.etree.ElementTree
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<title type="text">Tbl_Projects_Tableau</title>
<id>https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/Tbl_Projects_Tableau/</id>
<updated>2018-07-25T21:27:59Z</updated>
<link rel="self" title="Tbl_Projects_Tableau" href="Tbl_Projects_Tableau" />
<entry m:etag="W/"8"">
<id>https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/Tbl_Projects_Tableau(1)</id>
<title type="text"></title>
<updated>2018-06-14T17:15:27Z</updated>
<author>
<name />
</author>
<link rel="edit" title="Tbl_Projects_TableauItem" href="Tbl_Projects_Tableau(1)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/FBN_ID" type="application/atom+xml;type=entry" title="FBN_ID" href="Tbl_Projects_Tableau(1)/FBN_ID" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/CreatedBy" type="application/atom+xml;type=entry" title="CreatedBy" href="Tbl_Projects_Tableau(1)/CreatedBy" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/ModifiedBy" type="application/atom+xml;type=entry" title="ModifiedBy" href="Tbl_Projects_Tableau(1)/ModifiedBy" />
<category term="Microsoft.SharePoint.DataService.Tbl_Projects_TableauItem" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:FBN_IDId m:type="Edm.Int32">6</d:FBN_IDId>
<d:Title m:null="true" />
<d:PROJECT_NAME>Project Swoop</d:PROJECT_NAME>
<d:Cluster>ABC</d:Cluster>
<d:PROJECT_SITE>ABC9</d:PROJECT_SITE>
<d:PROJECT_ORIGINALAMT m:type="Edm.Double">500000</d:PROJECT_ORIGINALAMT>
<d:PROJECT_ORG>Nookie</d:PROJECT_ORG>
<d:PROJECT_GROUP>Smooth</d:PROJECT_GROUP>
<d:c__OldID m:type="Edm.Double">1</d:c__OldID>
<d:ContentTypeID>0x0100FD279BEBCF3C4F45BB75D6147D315C09</d:ContentTypeID>
<d:Id m:type="Edm.Int32">1</d:Id>
<d:ContentType>Item</d:ContentType>
<d:Modified m:type="Edm.DateTime">2018-06-14T17:15:27</d:Modified>
<d:Created m:type="Edm.DateTime">2018-06-14T16:58:50</d:Created>
<d:CreatedById m:type="Edm.Int32">2</d:CreatedById>
<d:ModifiedById m:type="Edm.Int32">2</d:ModifiedById>
<d:Owshiddenversion m:type="Edm.Int32">8</d:Owshiddenversion>
<d:Version>1.0</d:Version>
<d:Path>/sites/SmoothBudget/Lists/Projects_Tableau1</d:Path>
</m:properties>
</content>
</entry>
</feed>
这里是 XML 我正在尝试解析为 CSV 的示例。
到目前为止,这是我的代码:
import config
import csv
import pymysql
import requests
import xml.etree.ElementTree as ET
from requests_ntlm import HttpNtlmAuth
ssoUsername = config.username
ssoPassword = config.password
f = open(path+csvFile,'w',newline='')
csvwriter = csv.writer(f)
column_headers = ['FBN','Project_Name','Cluster','Site','OP2_USD','Type','Group']
csvwriter.writerow(column_headers)
rows = []
r2 = requests.get(project_url, auth=HttpNtlmAuth('ANT\'+ssoUsername,ssoPassword), verify=False)
projectData = r2.content
etree2 = ET.fromstring(projectData)
#print(etree2.findall('.****'))
for element in etree2.findall(".****") :
print(element.find('{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId'))
fbnKey2 = element.find('{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId')
FBN = fbnMap.get(fbnKey2)
所以此时我无法获取“{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId”元素的 .text。无论我尝试什么 xPath,它总是给我 NoneType has not attribute text 错误。
这里是 print(etree2.findall('.****'))
的结果
[<Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentTypeID' at 0x000001C4B7F31BD8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Id' at 0x000001C4B7F31C28>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentType' at 0x000001C4B7F31C78>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Modified' at 0x000001C4B7F31CC8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Created' at 0x000001C4B7F31D18>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}CreatedById' at 0x000001C4B7F31D68>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ModifiedById' at 0x000001C4B7F31DB8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Owshiddenversion' at 0x000001C4B7F31E08>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Version' at 0x000001C4B7F31E58>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Path' at 0x000001C4B7F31EA8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId' at 0x000001C4B7F3ABD8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Title' at 0x000001C4B7F3AB88>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_NAME' at 0x000001C4B7F3AB38>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Cluster' at 0x000001C4B7F3AA98>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_SITE' at 0x000001C4B7F3AA48>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_ORIGINALAMT' at 0x000001C4B7F3A9F8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_ORG' at 0x000001C4B7F3A908>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_GROUP' at 0x000001C4B7F3A868>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}c__OldID' at 0x000001C4B7F3A8B8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentTypeID' at 0x000001C4B7F3A778>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Id' at 0x000001C4B7F3A728>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentType' at 0x000001C4B7F3A6D8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Modified' at 0x000001C4B7F3A138>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Created' at 0x000001C4B7F3A048>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}CreatedById' at 0x000001C4B7F3A638>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ModifiedById' at 0x000001C4B7F3A5E8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Owshiddenversion' at 0x000001C4B7F3A548>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Version' at 0x000001C4B7F3A598>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Path' at 0x000001C4B7F3A4F8>]
看来我应该能够得到 FBNIDId,但我能做的最好的就是得到
None
<Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId' at 0x000001C4B7F3ABD8>
这会导致 none 类型错误。我唯一能让它正常工作的时间是:
for element in etree2.findall(".//{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId") :
fbnKey2 = element.text
FBN = fbnMap.get(fbnKey2)
但是如果我这样做,那么我必须对我需要的每个元素都这样做,然后弄清楚如何将它们全部组合成一行,然后循环添加所有行,这似乎是错误的。
建议?
也许您可以使用完整路径:
/feed/entry/content/m:properties/d:FBN_IDId
或:
/feed/entry/content/{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties/{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId
在你的循环中
for element in etree2.findall(".****") :
您正在遍历向下四层的所有元素。对于每个这样的元素,您然后在其子元素中搜索名为 FBN_IDId
的元素。这将找到向下五层的任何此类元素。
但是,没有任何此类元素。它们只存在四层以下。
也许您想遍历所有元素 三层 级别并寻找这些元素的子元素,命名为 FBN_IDId
?这可以通过删除循环中的 *
之一来完成:
for element in etree2.findall(".***") :
但是这个循环也会找到空的 <name />
元素。
也许写成
会更好
for element in etree2.findall(".//{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties") :
循环遍历任意深度的所有 <m:properties>
个元素。
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<title type="text">Tbl_Projects_Tableau</title>
<id>https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/Tbl_Projects_Tableau/</id>
<updated>2018-07-25T21:27:59Z</updated>
<link rel="self" title="Tbl_Projects_Tableau" href="Tbl_Projects_Tableau" />
<entry m:etag="W/"8"">
<id>https://share.corp.com/sites/CPIBudget/_vti_bin/ListData.svc/Tbl_Projects_Tableau(1)</id>
<title type="text"></title>
<updated>2018-06-14T17:15:27Z</updated>
<author>
<name />
</author>
<link rel="edit" title="Tbl_Projects_TableauItem" href="Tbl_Projects_Tableau(1)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/FBN_ID" type="application/atom+xml;type=entry" title="FBN_ID" href="Tbl_Projects_Tableau(1)/FBN_ID" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/CreatedBy" type="application/atom+xml;type=entry" title="CreatedBy" href="Tbl_Projects_Tableau(1)/CreatedBy" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/ModifiedBy" type="application/atom+xml;type=entry" title="ModifiedBy" href="Tbl_Projects_Tableau(1)/ModifiedBy" />
<category term="Microsoft.SharePoint.DataService.Tbl_Projects_TableauItem" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:FBN_IDId m:type="Edm.Int32">6</d:FBN_IDId>
<d:Title m:null="true" />
<d:PROJECT_NAME>Project Swoop</d:PROJECT_NAME>
<d:Cluster>ABC</d:Cluster>
<d:PROJECT_SITE>ABC9</d:PROJECT_SITE>
<d:PROJECT_ORIGINALAMT m:type="Edm.Double">500000</d:PROJECT_ORIGINALAMT>
<d:PROJECT_ORG>Nookie</d:PROJECT_ORG>
<d:PROJECT_GROUP>Smooth</d:PROJECT_GROUP>
<d:c__OldID m:type="Edm.Double">1</d:c__OldID>
<d:ContentTypeID>0x0100FD279BEBCF3C4F45BB75D6147D315C09</d:ContentTypeID>
<d:Id m:type="Edm.Int32">1</d:Id>
<d:ContentType>Item</d:ContentType>
<d:Modified m:type="Edm.DateTime">2018-06-14T17:15:27</d:Modified>
<d:Created m:type="Edm.DateTime">2018-06-14T16:58:50</d:Created>
<d:CreatedById m:type="Edm.Int32">2</d:CreatedById>
<d:ModifiedById m:type="Edm.Int32">2</d:ModifiedById>
<d:Owshiddenversion m:type="Edm.Int32">8</d:Owshiddenversion>
<d:Version>1.0</d:Version>
<d:Path>/sites/SmoothBudget/Lists/Projects_Tableau1</d:Path>
</m:properties>
</content>
</entry>
</feed>
这里是 XML 我正在尝试解析为 CSV 的示例。
到目前为止,这是我的代码:
import config
import csv
import pymysql
import requests
import xml.etree.ElementTree as ET
from requests_ntlm import HttpNtlmAuth
ssoUsername = config.username
ssoPassword = config.password
f = open(path+csvFile,'w',newline='')
csvwriter = csv.writer(f)
column_headers = ['FBN','Project_Name','Cluster','Site','OP2_USD','Type','Group']
csvwriter.writerow(column_headers)
rows = []
r2 = requests.get(project_url, auth=HttpNtlmAuth('ANT\'+ssoUsername,ssoPassword), verify=False)
projectData = r2.content
etree2 = ET.fromstring(projectData)
#print(etree2.findall('.****'))
for element in etree2.findall(".****") :
print(element.find('{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId'))
fbnKey2 = element.find('{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId')
FBN = fbnMap.get(fbnKey2)
所以此时我无法获取“{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId”元素的 .text。无论我尝试什么 xPath,它总是给我 NoneType has not attribute text 错误。
这里是 print(etree2.findall('.****'))
的结果[<Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentTypeID' at 0x000001C4B7F31BD8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Id' at 0x000001C4B7F31C28>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentType' at 0x000001C4B7F31C78>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Modified' at 0x000001C4B7F31CC8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Created' at 0x000001C4B7F31D18>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}CreatedById' at 0x000001C4B7F31D68>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ModifiedById' at 0x000001C4B7F31DB8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Owshiddenversion' at 0x000001C4B7F31E08>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Version' at 0x000001C4B7F31E58>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Path' at 0x000001C4B7F31EA8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId' at 0x000001C4B7F3ABD8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Title' at 0x000001C4B7F3AB88>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_NAME' at 0x000001C4B7F3AB38>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Cluster' at 0x000001C4B7F3AA98>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_SITE' at 0x000001C4B7F3AA48>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_ORIGINALAMT' at 0x000001C4B7F3A9F8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_ORG' at 0x000001C4B7F3A908>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}PROJECT_GROUP' at 0x000001C4B7F3A868>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}c__OldID' at 0x000001C4B7F3A8B8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentTypeID' at 0x000001C4B7F3A778>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Id' at 0x000001C4B7F3A728>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ContentType' at 0x000001C4B7F3A6D8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Modified' at 0x000001C4B7F3A138>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Created' at 0x000001C4B7F3A048>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}CreatedById' at 0x000001C4B7F3A638>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}ModifiedById' at 0x000001C4B7F3A5E8>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Owshiddenversion' at 0x000001C4B7F3A548>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Version' at 0x000001C4B7F3A598>, <Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}Path' at 0x000001C4B7F3A4F8>]
看来我应该能够得到 FBNIDId,但我能做的最好的就是得到
None
<Element '{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId' at 0x000001C4B7F3ABD8>
这会导致 none 类型错误。我唯一能让它正常工作的时间是:
for element in etree2.findall(".//{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId") :
fbnKey2 = element.text
FBN = fbnMap.get(fbnKey2)
但是如果我这样做,那么我必须对我需要的每个元素都这样做,然后弄清楚如何将它们全部组合成一行,然后循环添加所有行,这似乎是错误的。
建议?
也许您可以使用完整路径:
/feed/entry/content/m:properties/d:FBN_IDId
或:
/feed/entry/content/{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties/{http://schemas.microsoft.com/ado/2007/08/dataservices}FBN_IDId
在你的循环中
for element in etree2.findall(".****") :
您正在遍历向下四层的所有元素。对于每个这样的元素,您然后在其子元素中搜索名为 FBN_IDId
的元素。这将找到向下五层的任何此类元素。
但是,没有任何此类元素。它们只存在四层以下。
也许您想遍历所有元素 三层 级别并寻找这些元素的子元素,命名为 FBN_IDId
?这可以通过删除循环中的 *
之一来完成:
for element in etree2.findall(".***") :
但是这个循环也会找到空的 <name />
元素。
也许写成
会更好for element in etree2.findall(".//{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties") :
循环遍历任意深度的所有 <m:properties>
个元素。