如何从大型 Lucene.Net 索引中获取所有索引词?
How to get all indexed terms from a large Lucene.Net index?
如何检索 Lucene.Net 中大型索引的所有索引(但未存储)术语?
我这样做的原因是因为我正在从 Lucene.Net 转移到最新的 Apache Lucene 版本,并且索引格式在这些版本中发生了多次更改。我正在通过阅读条款来迁移数据,然后将它们重新编入新格式的索引。我知道 Lucene 编解码器包,但是对于 Lucene.Net.
使用的格式,它不能提供足够远的向后兼容性
还有类似的问题,例如Find list of terms indexed by Lucene
但是,上述方法的问题是 IndexReader.Terms
从索引中读取每个词项,这会导致大型索引出现 OutOfMemoryException
。
如何以理智的方式从大型索引中获取所有术语,而不会有 运行 内存不足的风险?
示例代码(在调用 reader.Terms(orderBy)
时抛出 OutOfMemoryException
):
var results = new List<string>();
var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
{
for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
{
if (term.Field != "MyField")
{
break;
}
results.Add(term.Text);
}
}
查看代码,在这种情况下您可能 运行 内存不足的唯一原因似乎是因为您正在将所有术语写入 List<string>
。为避免 运行ning 内存不足,您应该将字符串保存到磁盘。
var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
using (var stream = new FileStream("TheFile.txt", FileMode.Create, FileAccess.Write))
using (var writer = new StreamWriter(stream))
{
for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
{
if (term.Field != "MyField")
{
break;
}
writer.WriteLine(term.Text);
}
}
While this may answer your question, the fact that you are trying to pull more terms out of the index than you have memory for is a sign that you are asking the wrong question. I suggest you ask another question that exemplifies the actual task you are trying to do - most likely there is a better (more efficient) way to do it than to read all of this raw data from the index.
如何检索 Lucene.Net 中大型索引的所有索引(但未存储)术语?
我这样做的原因是因为我正在从 Lucene.Net 转移到最新的 Apache Lucene 版本,并且索引格式在这些版本中发生了多次更改。我正在通过阅读条款来迁移数据,然后将它们重新编入新格式的索引。我知道 Lucene 编解码器包,但是对于 Lucene.Net.
使用的格式,它不能提供足够远的向后兼容性还有类似的问题,例如Find list of terms indexed by Lucene
但是,上述方法的问题是 IndexReader.Terms
从索引中读取每个词项,这会导致大型索引出现 OutOfMemoryException
。
如何以理智的方式从大型索引中获取所有术语,而不会有 运行 内存不足的风险?
示例代码(在调用 reader.Terms(orderBy)
时抛出 OutOfMemoryException
):
var results = new List<string>();
var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
{
for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
{
if (term.Field != "MyField")
{
break;
}
results.Add(term.Text);
}
}
查看代码,在这种情况下您可能 运行 内存不足的唯一原因似乎是因为您正在将所有术语写入 List<string>
。为避免 运行ning 内存不足,您应该将字符串保存到磁盘。
var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
using (var stream = new FileStream("TheFile.txt", FileMode.Create, FileAccess.Write))
using (var writer = new StreamWriter(stream))
{
for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
{
if (term.Field != "MyField")
{
break;
}
writer.WriteLine(term.Text);
}
}
While this may answer your question, the fact that you are trying to pull more terms out of the index than you have memory for is a sign that you are asking the wrong question. I suggest you ask another question that exemplifies the actual task you are trying to do - most likely there is a better (more efficient) way to do it than to read all of this raw data from the index.