使用 ElementTree 读取类似 .xml 的电子表格
Reading a spreadsheet like .xml with ElementTree
我正在使用 ElementTree 读取一个 xml 文件,但是有一个单元格我无法读取其中的数据。
我修改了我的文件,制作了一个可重现的示例,我将在接下来展示:
from xml.etree import ElementTree
import io
xmlf = """<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook ss:ResourcesPackageName="" ss:ResourcesPackageVersion="" xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="DigitalOutput" ss:IsDeviceType="true">
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">A</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">B</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">C</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="7"><ss:Data ss:Type="String"
xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">CAN'T READ </Font><Font>THIS</Font></ss:Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="10"><Data ss:Type="String">D</Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
</Row>
</Worksheet>
</Workbook>"""
ss = "urn:schemas-microsoft-com:office:spreadsheet"
worksheet_label = '{%s}Worksheet' % ss
row_label = '{%s}Row' % ss
cell_label = '{%s}Cell' % ss
data_label = '{%s}Data' % ss
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall(worksheet_label):
for table in ws.findall(row_label):
for c in table.findall(cell_label):
data = c.find(data_label)
print(data.text)
输出为:
A
B
C
None
D
所以,第四个单元格没有被读取。你能帮我解决这个问题吗?
第四个单元格的文本内容属于绑定到另一个名称空间的两个 Font
子元素。演示:
for e in root.iter():
text = e.text.strip() if e.text else None
if text:
print(e, text)
输出:
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> A
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> B
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> C
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e08> CAN'T READ
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e48> THIS
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01e48> D
Question: Reading a spreadsheet like .xml with ElementTree
文档:The lxml.etree Tutorial- Namespaces
定义namespaces
使用
ns = {'ss':"urn:schemas-microsoft-com:office:spreadsheet",
'html':"http://www.w3.org/TR/REC-html40"
}
将namespaces
与find(.../findall(...
结合使用
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall('ss:Worksheet', ns):
for table in ws.findall('ss:Row', ns):
for c in table.findall('ss:Cell', ns):
data = c.find('ss:Data', ns)
if data.text is None:
text = []
data = data.findall('html:Font', ns)
for element in data:
text.append(element.text)
data_text = ''.join(text)
print(data_text)
else:
print(data.text)
Output:
A
B
C
CAN'T READ THIS
D
测试 Python:3.5
我正在使用 ElementTree 读取一个 xml 文件,但是有一个单元格我无法读取其中的数据。
我修改了我的文件,制作了一个可重现的示例,我将在接下来展示:
from xml.etree import ElementTree
import io
xmlf = """<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook ss:ResourcesPackageName="" ss:ResourcesPackageVersion="" xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="DigitalOutput" ss:IsDeviceType="true">
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">A</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">B</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">C</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="7"><ss:Data ss:Type="String"
xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">CAN'T READ </Font><Font>THIS</Font></ss:Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="10"><Data ss:Type="String">D</Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
</Row>
</Worksheet>
</Workbook>"""
ss = "urn:schemas-microsoft-com:office:spreadsheet"
worksheet_label = '{%s}Worksheet' % ss
row_label = '{%s}Row' % ss
cell_label = '{%s}Cell' % ss
data_label = '{%s}Data' % ss
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall(worksheet_label):
for table in ws.findall(row_label):
for c in table.findall(cell_label):
data = c.find(data_label)
print(data.text)
输出为:
A
B
C
None
D
所以,第四个单元格没有被读取。你能帮我解决这个问题吗?
第四个单元格的文本内容属于绑定到另一个名称空间的两个 Font
子元素。演示:
for e in root.iter():
text = e.text.strip() if e.text else None
if text:
print(e, text)
输出:
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> A
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> B
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> C
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e08> CAN'T READ
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e48> THIS
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01e48> D
Question: Reading a spreadsheet like .xml with ElementTree
文档:The lxml.etree Tutorial- Namespaces
定义
namespaces
使用ns = {'ss':"urn:schemas-microsoft-com:office:spreadsheet", 'html':"http://www.w3.org/TR/REC-html40" }
将
结合使用namespaces
与find(.../findall(...
tree = ElementTree.parse(io.StringIO(xmlf)) root = tree.getroot() for ws in root.findall('ss:Worksheet', ns): for table in ws.findall('ss:Row', ns): for c in table.findall('ss:Cell', ns): data = c.find('ss:Data', ns) if data.text is None: text = [] data = data.findall('html:Font', ns) for element in data: text.append(element.text) data_text = ''.join(text) print(data_text) else: print(data.text)
Output:
A B C CAN'T READ THIS D
测试 Python:3.5