并行化 XML 读取出错
Parallelize XML Reading gone wrong
我使用大 XML 文件(~2Go),到目前为止,阅读是这样完成的:
private void readParameters(XmlReader m, Measurement meas)
{
while (m.ReadToFollowing("PAR"))
{
XmlReader par = m.ReadSubtree();
readParameter(par, meas);
par.Close();
((IDisposable)par).Dispose();
}
}
进展顺利,但是太慢了。所以我把我的科学带进来,试图平行阅读:
private void readParameters(XmlReader m, Measurement meas)
{
List<XmlReader> readers = new List<XmlReader>();
while (m.ReadToFollowing("PAR"))
{
readers.Add(m.ReadSubtree());
}
Parallel.ForEach(readers, reader =>
{
readParameter(reader, meas);
reader.Close();
((IDisposable)reader).Dispose();
}
);
}
但它在 foreach
的每次迭代中读取相同的节点。我怎样才能解决这个问题?这甚至是并行阅读的好方法吗?
因为,正如ReadSubtree的备注中所写:
ReadSubtree can be called only on element nodes. When the entire sub-tree has been read, calls to the Read method returns false. When the new XmlReader has been closed, the original XmlReader will be positioned on the EndElement node of the sub-tree. Thus, if you called the ReadSubtree method on the start tag of the book element, after the sub-tree has been read and the new XmlReader has been closed, the original XmlReader is positioned on the end tag of the book element.
You should not perform any operations on the original XmlReader until the new XmlReader has been closed. This action is not supported and can result in unpredictable behavior.
显然这个方法不是线程安全的。您不能 "put aside" 一些 ReadSubtree()
然后在以后尝试使用它们。
一般来说,考虑到 XmlReader
represents a reader that provides fast, noncached, forward-only access to XML data.
显然你不能为所欲为。一般来说,因为 XmlReader
使用的 Stream
可能是只向前的,所以克隆它需要 Stream
是 "forked" (每个 "copy" XmlReader
的克隆)(Stream
不保证可能的事情)或者 XmlReader
正在缓存节点([=14= 保证不会做的事情) ])
按照@mike z 的建议,您可以
List<XElement> elements = new List<XElement>();
while (m.ReadToFollowing("PAR"))
{
elements.Add(XElement.Load(m.ReadSubtree()));
}
Parallel.ForEach(elements, el =>
{
});
但我不确定这会改变任何东西,除了你的内存使用(注意超过 2gb 的内存消失:-)),因为现在整个 Xml 解析是在 "main" 线程,所有 PAR 元素都在 XDocument
个对象中读取。
或者您可以试试:
public sealed class MyClass : IEnumerable<XElement>, IDisposable
{
public readonly XmlReader Reader;
public MyClass(XmlReader reader)
{
Reader = reader;
}
// Sealed class
public void Dispose()
{
Reader.Dispose();
}
public IEnumerator<XElement> GetEnumerator()
{
while (Reader.ReadToFollowing("PAR"))
{
yield return XElement.Load(Reader.ReadSubtree());
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
private static void readParameters(XmlReader m, Measurement meas)
{
var enu = new MyClass(m);
Parallel.ForEach(enu, reader =>
{
// You do the work here
});
}
现在 Parallel.ForEach
由枚举器 MyClass
(请原谅我的名字 :-) )延迟加载子树。
我使用大 XML 文件(~2Go),到目前为止,阅读是这样完成的:
private void readParameters(XmlReader m, Measurement meas)
{
while (m.ReadToFollowing("PAR"))
{
XmlReader par = m.ReadSubtree();
readParameter(par, meas);
par.Close();
((IDisposable)par).Dispose();
}
}
进展顺利,但是太慢了。所以我把我的科学带进来,试图平行阅读:
private void readParameters(XmlReader m, Measurement meas)
{
List<XmlReader> readers = new List<XmlReader>();
while (m.ReadToFollowing("PAR"))
{
readers.Add(m.ReadSubtree());
}
Parallel.ForEach(readers, reader =>
{
readParameter(reader, meas);
reader.Close();
((IDisposable)reader).Dispose();
}
);
}
但它在 foreach
的每次迭代中读取相同的节点。我怎样才能解决这个问题?这甚至是并行阅读的好方法吗?
因为,正如ReadSubtree的备注中所写:
ReadSubtree can be called only on element nodes. When the entire sub-tree has been read, calls to the Read method returns false. When the new XmlReader has been closed, the original XmlReader will be positioned on the EndElement node of the sub-tree. Thus, if you called the ReadSubtree method on the start tag of the book element, after the sub-tree has been read and the new XmlReader has been closed, the original XmlReader is positioned on the end tag of the book element. You should not perform any operations on the original XmlReader until the new XmlReader has been closed. This action is not supported and can result in unpredictable behavior.
显然这个方法不是线程安全的。您不能 "put aside" 一些 ReadSubtree()
然后在以后尝试使用它们。
一般来说,考虑到 XmlReader
represents a reader that provides fast, noncached, forward-only access to XML data.
显然你不能为所欲为。一般来说,因为 XmlReader
使用的 Stream
可能是只向前的,所以克隆它需要 Stream
是 "forked" (每个 "copy" XmlReader
的克隆)(Stream
不保证可能的事情)或者 XmlReader
正在缓存节点([=14= 保证不会做的事情) ])
按照@mike z 的建议,您可以
List<XElement> elements = new List<XElement>();
while (m.ReadToFollowing("PAR"))
{
elements.Add(XElement.Load(m.ReadSubtree()));
}
Parallel.ForEach(elements, el =>
{
});
但我不确定这会改变任何东西,除了你的内存使用(注意超过 2gb 的内存消失:-)),因为现在整个 Xml 解析是在 "main" 线程,所有 PAR 元素都在 XDocument
个对象中读取。
或者您可以试试:
public sealed class MyClass : IEnumerable<XElement>, IDisposable
{
public readonly XmlReader Reader;
public MyClass(XmlReader reader)
{
Reader = reader;
}
// Sealed class
public void Dispose()
{
Reader.Dispose();
}
public IEnumerator<XElement> GetEnumerator()
{
while (Reader.ReadToFollowing("PAR"))
{
yield return XElement.Load(Reader.ReadSubtree());
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
private static void readParameters(XmlReader m, Measurement meas)
{
var enu = new MyClass(m);
Parallel.ForEach(enu, reader =>
{
// You do the work here
});
}
现在 Parallel.ForEach
由枚举器 MyClass
(请原谅我的名字 :-) )延迟加载子树。