有没有办法创建 XML 元素树?

Is there a way to create XML element tree?

我目前正在编写一些 XSD 和 DTD 来验证一些 XML 文件,我正在手工编写它们,因为我对 XSD 生成器的体验非常糟糕(例如氧气)。

但是,我已经有一个样本 XML,我需要对其执行此操作,这个 XML 确实很大,例如,我有一个包含 4312 个子元素的元素。

由于我对 XSD 生成器的体验非常糟糕,我想创建一种 XML 树,它只包含唯一的标签和属性,所以我不在查看 XML 以编写 XSD.

时必须处理重复元素

我的意思是我有这个 XML(由 W3 提供):

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="1.0">
    <name>Belgian Waffles</name>
    <price>.95</price>
    <description>
   Two of our famous Belgian Waffles with plenty of real maple syrup
   </description>
    <calories>650</calories>
</food>
<food>
    <name>Strawberry Belgian Waffles</name>
    <price>.95</price>
    <description>
    Light Belgian waffles covered with strawberries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>Berry-Berry Belgian Waffles</name>
    <price>.95</price>
    <description>
    Belgian waffles covered with assorted fresh berries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>French Toast</name>
    <price>.50</price>
    <description>
    Thick slices made from our homemade sourdough bread
    </description>
    <calories>600</calories>
    <some_complex_type_element_1>
      <some_simple_type_element_1>Text.</some_simple_type_element_1>
    </some_complex_type_element_1>
</food>
<food>
    <name>Homestyle Breakfast</name>
    <price>.95</price>
    <description>
    Two eggs, bacon or sausage, toast, and our ever-popular hash browns
    </description>
    <calories>950</calories>
    <some_simple_type_element_2>Text.</some_simple_type_element_2>
</food>
</breakfast_menu>

如您所见,根元素下有 4 种独特元素。

这些是:

我想要实现的是此 XML 的某种树形表示,但仅包含唯一元素且不包含文本。

所以根据我的示例(我不关心标签内的信息)在根目录中有 4 个不同的独特元素,所以我想得到另一个 XML 表示,甚至一些 ASCII 表示文档的结构,例如:

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food some_attribute="">
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
    <some_complex_type_element_1>
      <some_simple_type_element_1></some_simple_type_element_1>
    </some_complex_type_element_1>
</food>
<food>
    <name></name>
    <price></price>
    <description></description>
    <calories></calories>
    <some_simple_type_element_2></some_simple_type_element_2>
</food>
</breakfast_menu>

注意只有标签,没有实际值,只有唯一标签,我也想保留属性,但我不关心它的值,只关心它现在存在。

第二个选项是一些 ASCII,例如:

breakfast_menu
├── food some_attribute
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   └── calories
├── food
│   ├── name
│   ├── price
│   ├── description
│   ├── calories
│   └── some_complex_type_element_1
│       └── some_simple_type_element_1
└─ food
    ├── name
    ├── price
    ├── description
    ├── calories
    └── some_simple_type_element_2

你知道有什么软件可以生成这样的东西(最好是在 mac 上)吗?

或者 python 和 elementtree 是否可行?

我只需要生成这样的东西,我正在寻找最简单的解决方案,如果你有更好的想法(也许有更好的方法),我愿意接受每一个建议,所以请告诉我。

谢谢

编辑

使用 Power Query 你可以生成你的 XML 的“好的”表示,从我的测试来看它是一种工作。

您可以生成如下所示的 XML 结构,但是,这不是最好的解决方案,对于属性也不理想。

您可以使用类似的步骤重现此结果:

但这不是最干净的解决方案,我仍在寻找想法,谢谢!

看看这是否满足您的需求。

from simplified_scrapy import SimplifiedDoc, utils

xml = '''
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
    <food some_attribute="1.0">
        <name>Belgian Waffles</name>
        <price>.95</price>
        <description>
    Two of our famous Belgian Waffles with plenty of real maple syrup
    </description>
        <calories>650</calories>
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <price>.95</price>
        <description>
        Light Belgian waffles covered with strawberries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>Berry-Berry Belgian Waffles</name>
        <price>.95</price>
        <description>
        Belgian waffles covered with assorted fresh berries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>French Toast</name>
        <price>.50</price>
        <description>
        Thick slices made from our homemade sourdough bread
        </description>
        <calories>600</calories>
        <some_complex_type_element_1>
        <some_simple_type_element_1>Text.</some_simple_type_element_1>
        </some_complex_type_element_1>
    </food>
    <food>
        <name>Homestyle Breakfast</name>
        <price>.95</price>
        <description>
        Two eggs, bacon or sausage, toast, and our ever-popular hash browns
        </description>
        <calories>950</calories>
        <some_simple_type_element_2>Text.</some_simple_type_element_2>
    </food>
</breakfast_menu>
'''

def loop(node):
    para = {}
    for k in node:
        if k=='tag' or k=='html': continue
        para[k] = ''
    if para: node.setAttrs(para) # Remove attributes
    children = node.children
    if children:
        for c in children:
            loop(c)
    else:
        if node.text:
            node.setContent('') # Remove value

doc = SimplifiedDoc(xml)
# Remove values and attributes
loop(doc.breakfast_menu)

dicNode = {}
for node in doc.breakfast_menu.children:
    key = node.outerHtml
    if dicNode.get(key):
        node.remove() # Delete duplicate
    else:
        dicNode[key] = True

print(doc.html)

结果:

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
    <food some_attribute="">
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
        <some_complex_type_element_1>
        <some_simple_type_element_1></some_simple_type_element_1>
        </some_complex_type_element_1>
    </food>
    <food>
        <name></name>
        <price></price>
        <description></description>
        <calories></calories>
        <some_simple_type_element_2></some_simple_type_element_2>
    </food>
</breakfast_menu>

对于大文件,请尝试以下方法。

from simplified_scrapy import SimplifiedDoc, utils
from simplified_scrapy.core.regex_helper import replaceReg

filePath = 'test.xml'
doc = SimplifiedDoc()
doc.loadFile(filePath, lineByline=True)

utils.appendFile('dest.xml','<?xml version="1.0" encoding="UTF-8"?><breakfast_menu>')
dicNode = {}
for node in doc.getIterable('food'):
    key = node.outerHtml
    key = replaceReg(key, '>[^>]*?<', '><')
    key = replaceReg(key, '"[^"]*?"', '""')

    if not dicNode.get(key):
        dicNode[key] = True
        utils.appendFile('dest.xml', key)


utils.appendFile('dest.xml', '</breakfast_menu>')