通过 java 代码在 elasticsearch 中使用 inguest-attachment 插件索引 pdf/word

Indexing pdf/word using inguest-attachment plugin in elasticsearch via java code

我正在尝试为我的 word/pdf 文档编制索引,以便我使用 java 创建了一个实用程序来将我的文件编码为 base64,然后尝试在 ElasticSearch 中为它们编制索引。

请在下面找到我能够将我的文件编码为 base64 的代码。现在,我不确定如何在 ElasticSearch

中为它们编制索引

请在下面找到我的 java 代码。

public static void main(String args[]) throws IOException {
    String filePath = "D:\\1SearchEngine\testing.pdf";
    String encodedfile = null;
    RestHighLevelClient restHighLevelClient = null;
    File file = new File(filePath);
    try {
        FileInputStream fileInputStreamReader = new FileInputStream(file);
        byte[] bytes = new byte[(int) file.length()];
        fileInputStreamReader.read(bytes);
        encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
        //System.out.println(encodedfile);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }

    try {
        if (restHighLevelClient != null) {
            restHighLevelClient.close();
        }
    } catch (final Exception e) {
        System.out.println("Error closing ElasticSearch client: ");
    }

    try {
        restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }

    IndexRequest request = new IndexRequest( "attach_local", "doc", "103");   
    Map<String, Object> jsonMap = new HashMap<>();
    jsonMap.put("resume", "Karthikeyan");
    jsonMap.put("postDate", new Date());
    jsonMap.put("resume", encodedfile);
    try {
        IndexResponse response = restHighLevelClient.index(request);
    } catch(ElasticsearchException e) {
        if (e.status() == RestStatus.CONFLICT) {

        }
    }
}

我正在使用 ElasticSearch 6.2.3 版本并且我已经安装了 ingest-attachment 插件版本 6.3.0

我正在为 ElasticSearch 客户端使用以下依赖项

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>6.1.2</version>
</dependency>

请查找我的映射详细信息

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}

PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}

现在在创建索引

时从 java 收到以下错误
Exception in thread "main" org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: source is missing;2: content type is missing;
    at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:26)
    at org.elasticsearch.action.index.IndexRequest.validate(IndexRequest.java:153)
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:436)
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:429)
    at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:312)
    at com.es.utility.DocumentIndex.main(DocumentIndex.java:82)

我终于找到了解决方案,如何通过 Java API 在 ElasticSearch 中索引 PDF/WORD 文档

String filePath = "D:\\1SearchEngine\testing.pdf";
String encodedfile = null;
RestHighLevelClient restHighLevelClient = null;
File file = new File(filePath);
try {
    FileInputStream fileInputStreamReader = new FileInputStream(file);
    byte[] bytes = new byte[(int) file.length()];
    fileInputStreamReader.read(bytes);
    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
    e.printStackTrace();
}

try {
    if (restHighLevelClient != null) {
        restHighLevelClient.close();
    }
} catch (final Exception e) {
    System.out.println("Error closing ElasticSearch client: ");
}

try {
    restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
            new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
    System.out.println(e.getMessage());
}


Map<String, Object> jsonMap = new HashMap<>();
jsonMap.put("Name", "Karthikeyan");
jsonMap.put("postDate", new Date());
jsonMap.put("resume", encodedfile);

IndexRequest request = new IndexRequest("attach_local", "doc", "104")
        .source(jsonMap)
        .setPipeline("attach_local");

try {
    IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
    if (e.status() == RestStatus.CONFLICT) {

    }
}

映射详细信息:

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}


PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}