数据源中的新数据如何进行增量索引？

Question

blob 存储有像 new/1.json、new/2.json

这样的 blob

我有一个名为 new-index 的索引，名为 new-indexer 的索引器和名为 new-datasource 的数据源我的数据源体是这样的：

{
    "name" : "new-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "MyStorageConnStrning" },
    "container" : { "name" : "mycontaner", "query" : "new" }
}

"query" : "new" 表示在运行索引器中，它将从 Blob 存储中的虚拟目录 new 中获取所有 Blob。

索引器运行有开始时间和结束时间。而且我知道索引器根据 blob(doc) 的 lastModified 属性进行增量索引。

问题是，在索引器运行的开始时间和结束时间之间，如果像 new/3.json 那样在虚拟目录 new 中创建了一个新的 blob，这个 blob 是否也会被索引由这个索引器运行或另一个运行需要发生才能被索引。

Answer 1

will this blob also get indexed by this indexer run or does another run needs to occur for it to get indexed.

简而言之，是。由于 dataChangeDetectionPolicy.

，它将被该索引器索引

使用 Azure Blob 数据源时，Azure 搜索会根据 blob 的 last-modified 时间戳自动使用高水位线更改检测策略。使用 high watermark，您可以通过仅选取包含 新内容或修订内容.

的那些行，将其用于增量更改检测

更多细节，你可以参考这个article。

Answer 2

Question is, between the start time and the end time of indexer run if a new blob is created like new/3.json in Virtual Directory new, will this blob also get indexed by this indexer run or does another run needs to occur for it to get indexed.

答案比乔伊所说的要复杂一些。由于索引器通过在页面中枚举 blob 来对 blob 进行索引，因此即使具有更新时间戳的新 blob 也可能会或可能不会被索引器拾取，具体取决于它所在的页面。

索引器提供的唯一保证是 -

Indexer 将在索引器开始时间之前使用 LastModified 时间戳对所有 blob 进行索引，确保在相同的运行.

中
由于数据更改检测策略，增量更改将最终编入索引。这意味着它们可能会或可能不会在相同的运行.
中编入索引

不建议做出任何超过高水位线的假设，并且当新 blob 被索引时在技术上是未定义的行为。

查看此 article 了解更多详情。希望对您有所帮助。

数据源中的新数据如何进行增量索引？

How incremental indexing happens for a new data in datasource?

azure

azure-cognitive-search