如何使用 Apache Spark 仅流式传输文件的一部分

Question

我正在尝试将 Spark Streaming 和 Spark SQL 与 Python API 一起使用。

我有一个不断编辑的文件，每隔 N 秒随机添加一些行。

此文件可以是 JSON、XML、CSV 或 TXT，甚至 SQL table：我完全可以自由选择最佳解决方案我的情况。

我有一定数量的字段，大约4-5个。以此table为例：

+-------+------+-------+--------------------+ 
| event |  id  | alert |      datetime      |
+-------+------+-------+--------------------+
| reg   |  1   | def1  | 06.06.17-17.24.30  |
+-------+------+-------+--------------------+
| alt   |  2   | def2  | 06.06.17-17.25.11  |
+-------+------+-------+--------------------+
| mot   |  3   | def5  | 06.06.17-17.26.01  |
+-------+------+-------+--------------------+
| mot   |  4   | def5  | 06.06.17-17.26.01  |
+-------+------+-------+--------------------+

我想使用 Spark Streaming 进行流式传输仅换行。所以，如果我添加了 2 个新行，下次我只想流式传输这两行而不是整个文件（已经流式传输）

此外，我想在每次找到新行时过滤或计算整个同一文件的 Spark SQL 查询。例如，我想 select 事件 "mot" 仅在 10 分钟内出现两次，并且每次文件更改和新数据到达时都必须重做此查询。

Spark Streaming 和 Spark SQL 可以处理这些情况吗？又如何？

Answer 1

不支持file sources Spark

Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations

和 legacy streaming 类似（注意这个 2.2 文档，但实现没有改变）

The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.

如何使用 Apache Spark 仅流式传输文件的一部分

How to stream only part of a file with Apache Spark

python

apache-spark

spark-streaming

apache-spark-sql

pyspark