将弹性集群数据读入 python 数据框

Question

我是 elasticsearch 的新手。所以，如果我问的是一个非常简单的问题，请原谅。

在我的工作场所，我们正确设置了 ELK。

由于数据量非常大，我们只存储了 14 天的数据，我的问题是如何读取 Python 中的数据，然后将我的分析存储在一些 NOSQL 中。

目前我的主要目标是将原始数据以数据帧的形式或来自弹性集群的任何格式读入python。

我想在不同的时间间隔获取它，例如 1 天、1 周、1 个月等。

过去 1 周我一直在挣扎。

Answer 1

这取决于您希望如何从 Elasticsearch 读取数据。是增量阅读，即阅读每天出现的新数据，还是像批量阅读一样。对于后者，你需要在 python 中使用 Elasticsearch 的批量 API 而对于前者，你可以将自己限制在一个简单的范围查询中。

读取批量数据的原理图代码：https://gist.github.com/dpkshrma/04be6092eda6ae108bfc1ed820621130

如何使用 ES 的批量 API：

How to use Bulk API to store the keywords in ES by using Python

https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.bulk

如何使用范围查询进行增量插入：

https://martinapugliese.github.io/python-for-(some)-elasticsearch-queries/

由于您希望以不同的时间间隔插入数据，因此您还需要执行日期聚合。

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

一旦您发出 Elasticsearch 查询，您的数据将被收集在一个临时变量中，您可以使用 python 库在 NOSQL 数据库上，例如 PyMongo 将 Elasticsearch 数据插入其中。

Answer 2

你可以使用下面的代码来实现

# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='indexname')

获取索引的架构：-

 df.print_schema()

之后就可以对df进行一般的dataframe操作了。

如果要解析结果，请执行以下操作：-

from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="indexname", body={"query": {"match_all": {}}})

然后最后将所有内容都放入您的最终数据框中：-

from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()

希望对你有帮助..

将弹性集群数据读入 python 数据框

Reading Elastic cluster data into python data frame

python

python-2.7

python-3.x

elasticsearch

elastic-stack