如何从大型 Lucene.Net 索引中获取所有索引词？

Question

如何检索 Lucene.Net 中大型索引的所有索引（但未存储）术语？

我这样做的原因是因为我正在从 Lucene.Net 转移到最新的 Apache Lucene 版本，并且索引格式在这些版本中发生了多次更改。我正在通过阅读条款来迁移数据，然后将它们重新编入新格式的索引。我知道 Lucene 编解码器包，但是对于 Lucene.Net.

使用的格式，它不能提供足够远的向后兼容性

还有类似的问题，例如Find list of terms indexed by Lucene

但是，上述方法的问题是 IndexReader.Terms 从索引中读取每个词项，这会导致大型索引出现 OutOfMemoryException。

如何以理智的方式从大型索引中获取所有术语，而不会有运行内存不足的风险？

示例代码（在调用 reader.Terms(orderBy) 时抛出 OutOfMemoryException）：

var results = new List<string>();
var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
{
    for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
    {
        if (term.Field != "MyField")
        {
            break;
        }
        results.Add(term.Text);
    }
}

Answer 1

查看代码，在这种情况下您可能运行内存不足的唯一原因似乎是因为您正在将所有术语写入 List<string>。为避免运行ning 内存不足，您应该将字符串保存到磁盘。

var orderBy = new Term("MyField", string.Empty);
using (var reader = IndexReader.Open(FSDirectory.Open(_indexPath), true))
using (var termEnum = reader.Terms(orderBy))
using (var stream = new FileStream("TheFile.txt", FileMode.Create, FileAccess.Write))
using (var writer = new StreamWriter(stream))
{
    for (var term = termEnum.Term; term != null; termEnum.Next(), term = termEnum.Term)
    {
        if (term.Field != "MyField")
        {
            break;
        }
        writer.WriteLine(term.Text);
    }
}

While this may answer your question, the fact that you are trying to pull more terms out of the index than you have memory for is a sign that you are asking the wrong question. I suggest you ask another question that exemplifies the actual task you are trying to do - most likely there is a better (more efficient) way to do it than to read all of this raw data from the index.

如何从大型 Lucene.Net 索引中获取所有索引词？

How to get all indexed terms from a large Lucene.Net index?

c#

lucene

lucene.net