获取 TextField 的 MaxBytesLengthExceededException

Question

我得到一个 org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException，同时试图将一个包含 9 个元素的列表插入到 Solr 中，其中一个元素的长度为 242905 字节。

错误：

"error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException",
      "error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException",
      "root-error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException"],
    "msg":"Async exception during distributed update: Error from server at http://solr-host:8983/solr/search_collection_xx: Bad Request \n\n request: http://solr-host:8983/solr/search_collection_xx \n\n Remote error message: Exception writing document id <document_id> to the index; possible analysis error: Document contains at least one immense term in field=\"text_field_name\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[115, 97, 115, 109, 101, 45, 100, 97, 109, 101, 46, 99, 111, 109, 47, 108, 121, 99, 107, 97, 47, 37, 50, 50, 37, 50, 48, 109, 101, 116]...', original message: bytes can be at most 32766 in length; got 242905. Perhaps the document has an indexed string field (solr.StrField) which is too large",
    "code":400}
}

相关solr_schema:

    <dynamicField name="text_field_*"  indexed="true" stored="true" multiValued="true" type="case_insensitive_text" />

    <fieldType name="case_insensitive_text" class="solr.TextField" multiValued="false">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

在 official documentation it is stated that StrField has a hard limit of slightly less than 32k, as seen in the error aswell. But here we are using TextField, and in 的回答中，很明显 242905 字节应该不是问题（考虑到该字段不是多值的）。

所以我想知道输入这些数据量有什么问题，有没有办法避免给定的异常？

Answer 1

你遇到了这个问题，因为你正在使用 KeywordTokenizerFactory，它按原样存储所有文本（没有标记化），这导致你的情况创建了一个巨大的术语，它无法存储。

Solr/Lucene 中的每个术语仍然限制为 32k，这就是你得到此异常的原因。

与之相关的一段代码：

        if (len2 > BYTE_BLOCK_SIZE) {
          throw new MaxBytesLengthExceededException("bytes can be at most "
              + (BYTE_BLOCK_SIZE - 2) + " in length; got " + bytes.length);
        }

为了避免限制，您应该做的是 select 一些分词器，它将您的文本拆分为多个术语。其中最有用的一个是：WhitespaceTokenizerFactory、StandardTokenizerFactory，当然 Solr/Lucene 有很多 more

获取 TextField 的 MaxBytesLengthExceededException

Getting a MaxBytesLengthExceededException for a TextField

lucene

indexing

solr