并行化 XML 读取出错

Question

我使用大 XML 文件（~2Go），到目前为止，阅读是这样完成的：

private void readParameters(XmlReader m, Measurement meas)
{
    while (m.ReadToFollowing("PAR"))
    {
        XmlReader par = m.ReadSubtree();
        readParameter(par, meas);
        par.Close();
        ((IDisposable)par).Dispose();
    }
}

进展顺利，但是太慢了。所以我把我的科学带进来，试图平行阅读：

private void readParameters(XmlReader m, Measurement meas)
{
    List<XmlReader> readers = new List<XmlReader>();
    while (m.ReadToFollowing("PAR"))
    {
        readers.Add(m.ReadSubtree());
    }

    Parallel.ForEach(readers, reader =>
        {
            readParameter(reader, meas);
            reader.Close();
            ((IDisposable)reader).Dispose();
        }
    );
}

但它在 foreach 的每次迭代中读取相同的节点。我怎样才能解决这个问题？这甚至是并行阅读的好方法吗？

Answer 1

因为，正如ReadSubtree的备注中所写：

ReadSubtree can be called only on element nodes. When the entire sub-tree has been read, calls to the Read method returns false. When the new XmlReader has been closed, the original XmlReader will be positioned on the EndElement node of the sub-tree. Thus, if you called the ReadSubtree method on the start tag of the book element, after the sub-tree has been read and the new XmlReader has been closed, the original XmlReader is positioned on the end tag of the book element. You should not perform any operations on the original XmlReader until the new XmlReader has been closed. This action is not supported and can result in unpredictable behavior.

显然这个方法不是线程安全的。您不能 "put aside" 一些 ReadSubtree() 然后在以后尝试使用它们。

一般来说，考虑到 XmlReader

represents a reader that provides fast, noncached, forward-only access to XML data.

显然你不能为所欲为。一般来说，因为 XmlReader 使用的 Stream 可能是只向前的，所以克隆它需要 Stream 是 "forked" （每个 "copy" XmlReader 的克隆）（Stream 不保证可能的事情）或者 XmlReader 正在缓存节点（[=14= 保证不会做的事情） ])

按照@mike z 的建议，您可以

List<XElement> elements = new List<XElement>();

while (m.ReadToFollowing("PAR"))
{
    elements.Add(XElement.Load(m.ReadSubtree()));
}

Parallel.ForEach(elements, el =>
{
});

但我不确定这会改变任何东西，除了你的内存使用（注意超过 2gb 的内存消失:-)），因为现在整个 Xml 解析是在 "main" 线程，所有 PAR 元素都在 XDocument 个对象中读取。

或者您可以试试：

public sealed class MyClass : IEnumerable<XElement>, IDisposable
{
    public readonly XmlReader Reader;

    public MyClass(XmlReader reader)
    {
        Reader = reader;
    }

    // Sealed class
    public void Dispose()
    {
        Reader.Dispose();
    }

    public IEnumerator<XElement> GetEnumerator()
    {
        while (Reader.ReadToFollowing("PAR"))
        {
            yield return XElement.Load(Reader.ReadSubtree());
        }
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

private static void readParameters(XmlReader m, Measurement meas)
{
    var enu = new MyClass(m);

    Parallel.ForEach(enu, reader =>
    {
        // You do the work here 
    });
}

现在 Parallel.ForEach 由枚举器 MyClass（请原谅我的名字 :-) ）延迟加载子树。

并行化 XML 读取出错

Parallelize XML Reading gone wrong

c#

xml

parallel-processing