如何将标记文本添加到简单字符串中的元素?
How to add tagged text to an element from a simple string?
使用 python lxml,我想生成一个 etree.Element,其内容取自字符串。我有两个案例:
- 这是一个简单的字符串(例如:"Hello world!")。
- 它是一个标记字符串,但是对于 python 它仍然是一个字符串,我事先不知道它是一个标记字符串(例如:"Hello <value-of select=\"world\"/>! ").
第二种情况如何处理?
这是一个幼稚的、不起作用的方法:
>>> from lxml import etree
>>> string = "Hello <value-of select=\"world\"/>!"
>>> xml = etree.Element('root')
>>> xml.text = string
>>> etree.tostring(xml)
... b'<root>Hello <value-of select="world"/>!</root>'
我很清楚,如果我知道我的字符串的结构,我必须使用 etree.Element 的 tail 方法,如 the lxml tutorial 中所述。所以这是一个有效的,不可推广的方式:
>>> from lxml import etree
>>> xml2 = etree.Element('root')
>>> xml2.text = "Hello "
>>> valueof = etree.SubElement(xml2, 'value-of')
>>> valueof.set('select', 'world')
>>> valueof.tail = '!'
>>> etree.tostring(xml2)
... b'<root>Hello <value-of select="world"/>!</root>'
但是如何在事先不知道确切字符串的情况下自动执行此操作?
我不知道如何解析字符串以便拆分它的各个部分。或者也许我应该尝试另一种方式。
我试过这个:
>>> from lxml import etree
>>> from io import StringIO
>>> string="Hello <value-of select=\"world\"/>!"
>>> tree = etree.parse(StringIO(string))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81117)
File "src/lxml/parser.pxi", line 1828, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118072)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
File "src/lxml/parser.pxi", line 1729, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:116899)
File "src/lxml/parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:110886)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
但是由于 etree.parse 需要一个格式正确的 xml 并且没有根元素,所以它失败了。所以我尝试了这个,希望它不那么严格:
>>> tree = etree.parse(StringIO(string), etree.XMLParser(recover=True))
>>> etree.tostring(tree)
但是输出是空的,所以我似乎无法解析我的字符串以便将生成的树添加到现有的树中......这是我需要做的事情的方式,因为我编写了我的 xml 从零开始。
回到我的问题:如何处理我之前提出的2个案例?
只需将字符串(简单或标记)包裹在根元素中,使其成为 well-formed XML。
from lxml import etree
simple = "Hello world!"
tagged = "Hello <value-of select=\"world\"/>!"
xml1 = "<root>" + simple + "</root>"
xml2 = "<root>" + tagged + "</root>"
# fromstring() returns an Element object
elem1 = etree.fromstring(xml1)
elem2 = etree.fromstring(xml2)
使用 python lxml,我想生成一个 etree.Element,其内容取自字符串。我有两个案例:
- 这是一个简单的字符串(例如:"Hello world!")。
- 它是一个标记字符串,但是对于 python 它仍然是一个字符串,我事先不知道它是一个标记字符串(例如:"Hello <value-of select=\"world\"/>! ").
第二种情况如何处理?
这是一个幼稚的、不起作用的方法:
>>> from lxml import etree
>>> string = "Hello <value-of select=\"world\"/>!"
>>> xml = etree.Element('root')
>>> xml.text = string
>>> etree.tostring(xml)
... b'<root>Hello <value-of select="world"/>!</root>'
我很清楚,如果我知道我的字符串的结构,我必须使用 etree.Element 的 tail 方法,如 the lxml tutorial 中所述。所以这是一个有效的,不可推广的方式:
>>> from lxml import etree
>>> xml2 = etree.Element('root')
>>> xml2.text = "Hello "
>>> valueof = etree.SubElement(xml2, 'value-of')
>>> valueof.set('select', 'world')
>>> valueof.tail = '!'
>>> etree.tostring(xml2)
... b'<root>Hello <value-of select="world"/>!</root>'
但是如何在事先不知道确切字符串的情况下自动执行此操作?
我不知道如何解析字符串以便拆分它的各个部分。或者也许我应该尝试另一种方式。
我试过这个:
>>> from lxml import etree
>>> from io import StringIO
>>> string="Hello <value-of select=\"world\"/>!"
>>> tree = etree.parse(StringIO(string))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81117)
File "src/lxml/parser.pxi", line 1828, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118072)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
File "src/lxml/parser.pxi", line 1729, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:116899)
File "src/lxml/parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:110886)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
但是由于 etree.parse 需要一个格式正确的 xml 并且没有根元素,所以它失败了。所以我尝试了这个,希望它不那么严格:
>>> tree = etree.parse(StringIO(string), etree.XMLParser(recover=True))
>>> etree.tostring(tree)
但是输出是空的,所以我似乎无法解析我的字符串以便将生成的树添加到现有的树中......这是我需要做的事情的方式,因为我编写了我的 xml 从零开始。
回到我的问题:如何处理我之前提出的2个案例?
只需将字符串(简单或标记)包裹在根元素中,使其成为 well-formed XML。
from lxml import etree
simple = "Hello world!"
tagged = "Hello <value-of select=\"world\"/>!"
xml1 = "<root>" + simple + "</root>"
xml2 = "<root>" + tagged + "</root>"
# fromstring() returns an Element object
elem1 = etree.fromstring(xml1)
elem2 = etree.fromstring(xml2)