lxml 对象标识符似乎在对象处于活动状态时被重用

lxml object identifiers appear to be reused while objects are alive

我在 Ubuntu 上使用 Python 3.6.8 和 lxml-4.3.4。

我所追求的是将大型 XML 内容分解成片段文件以使其更易于工作,并保留已解析元素的源文件名和行号,以便我可以形成有用的解析时错误消息。当 XML 格式正确时,我将提出的错误特定于我的应用程序。

下面是一些示例 XML 片段文件:

one.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
  <one>1</one>
  <one>11</one>
  <one>111</one>
  <one>1111</one>
</data>

two.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
  <two>2</two>
  <two>22</two>
  <two>222</two>
  <two>2222</two>
  <two>22222</two>
  <two>222222</two>
</data>

我的计划是使用lxml来解析每个文件,然后将元素树简单地拼接在一起,形成一个单一的根。然后我的程序的其余部分可以使用完整的树。

如果某个元素的内容对我的应用程序无效,我想提供它来自的片段文件和行号。 lxml 已经有行号,但没有源文件。所以我想追踪它。请注意,我决定不尝试扩展 lxml 的 类 并使用元素对象标识符到片段文件的映射,我希望即使 lxml 重构其源代码也能持久。

from lxml import etree

# Too much data for one source file, so let's define
# fragment files, each of which looks like a stand
# alone XML file w/ header and root <data>...</data>
# to make syntax highlighters happy.
xmlFragmentFiles = ['one.xml', 'two.xml']

# lxml tracks line number for parsed elements, but not
# source filename. Rather than try to extend the deep
# inner classes of the module, let's try keeping a map
# from parsed elements to fragment file they just came
# from.
element2fragment = {}
def AddFragmentFileToETree(element, fragmentFile):
  # The entry we're just about to add.
  print('%s:%s' % (id(element), fragmentFile))
  element2fragment[id(element)] = fragmentFile
  for child in element:
    AddFragmentFileToETree(child, fragmentFile)

# Fabricate a root that we'll stitch each fragment's
# children onto as we parse them.
root = etree.fromstring('<data></data>')
AddFragmentFileToETree(root, 'Programmatic Root')

for filename in xmlFragmentFiles:
  # It doesn't seem to matter whether we create a new
  # parser per fragment, or reuse a single parser.
  parser = etree.XMLParser(remove_comments=True)
  subroot = etree.parse(filename, parser).getroot()  
  for child in subroot:
    root.append(child)
    AddFragmentFileToETree(child, filename)

# Clearly the final desired tree is here, and presumably
# all the subelements we care about are reachable from
# the programmatic root meaning the objects are still
# live, so why did any object identifier get reused?
print(etree.tostring(
  root, encoding=str, pretty_print=True))

当我 运行 这样做时,我可以看到包含片段文件的每个不同元素的整个所需树都带有漂亮的印刷品。但是,查看我们插入的映射条目,我们可以清楚地看到对象正在被重用!?

140611035114248:Programmatic Root
140611035114056:one.xml <-- see here
140611035114376:one.xml
140611035114440:one.xml
140611035114056:one.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and again
<data><one>1</one>
  <one>11</one>
  <one>111</one>
  <one>1111</one>
<two>2</two>
  <two>22</two>   <-- yet all distinct elements still exist
  <two>222</two>
  <two>2222</two>
  <two>22222</two>
  <two>222222</two>
</data>

有什么关于对象的建议吗?也许我应该远离作为 c 库的 lxml?我切换到 lxml 只是为了跟踪行号。

我决定研究 extending/customizing 解析器...并找到了这个原始问题的答案。

https://lxml.de/element_classes.html

他们警告说 python Element 代理是无状态的,

Element instances are created and garbage collected at need, so there is normally no way to predict when and how often a proxy is created for them.

他们接着说,如果你真的需要它们携带状态,你必须为每个保留一个实时引用

proxy_cache = list(root.iter())

这对我有用。我假设当元素有对子元素的实时引用时根就足够了,但代理显然是根据 C 中维护的真实树的需要出现的。