Elasticsearch "get by index" returns 文档，而 "match_all" returns 没有结果

Question

我正在尝试模拟 elasticsearch 数据以用于托管 CI 单元测试目的。

我已经准备了一些可以用 bulk() 成功加载的固定装置，但是，由于未知原因，我无法匹配 任何东西，即使 test_index 似乎包含数据（因为我可以通过 ID get() 项目）。

fixtures.json 是我从实际生产索引中获取的 ES 文档的子集。使用真实世界的索引，一切都按预期工作并且所有测试都通过了。

奇怪行为的人工示例如下：

class MyTestCase(TestCase):
    es = Elasticsearch()

    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        cls.es.indices.create('test_index', SOME_SCHEMA)

        with open('fixtures.json') as fixtures:
            bulk(cls.es, json.load(fixtures))

    @classmethod
    def tearDownClass(cls):
        super().tearDownClass()
        cls.es.indices.delete('test_index')

    def test_something(self):
        # check all documents are there:
        with open('fixtures.json') as fixtures:
            for f in json.load(fixtures):
                print(self.es.get(index='test_index', id=f['_id']))
                # yes they are!

        # BUT:
        match_all = {"query": {"match_all": {}}}
        print('hits:', self.es.search(index='test_index', body=match_all)['hits']['hits'])
        # prints `hits: []` like there was nothing in

        print('count:', self.es.count(index='test_index', body=match_all)['count'])
        # prints `count: 0`

Answer 1

虽然我完全理解您的痛苦（除了测试之外一切正常），但答案实际上很简单：与您的实验相比，测试太快了。

Elasticsearch 是 near real-time 搜索引擎，这意味着在索引文档和它被索引之间最多有 1 秒的延迟可搜索。
还有不可预测的延迟（取决于实际开销）在创建索引和准备就绪之间。

所以解决方法是 time.sleep() 给 ES 一些 space 来创造它需要给你结果的所有魔法。我会这样做：

@classmethod
def setUpClass(cls):
    super().setUpClass()
    cls.es.indices.create('test_index', SOME_SCHEMA)

    with open('fixtures.json') as fixtures:
        bulk(cls.es, json.load(fixtures))

    cls.wait_until_index_ready()

@classmethod
def wait_until_index_ready(cls, timeout=10):
    for sec in range(timeout):
        time.sleep(1)
        if cls.es.cluster.health().get('status') in ('green', 'yellow'):
            break

Answer 2

虽然@jsmesami 的回答非常正确，但有一种可能更简洁的方法可以做到这一点。如果您注意到，问题是因为 ES 没有重新编制索引。 API 实际上为此目的公开了一些函数。尝试类似的东西，

cls.es.indices.flush(wait_if_ongoing=True)
cls.es.indices.refresh(index='*')

更具体地说，您可以将 index='test_index' 传递给这两个函数。我认为这是一种比使用 sleep(..).

更简洁、更具体的方法

Elasticsearch "get by index" returns 文档，而 "match_all" returns 没有结果

Elasticsearch "get by index" returns the document, while "match_all" returns no results

unit-testing

python-3.x

elasticsearch

elasticsearch-py