Spark Streaming：文本数据源只支持单列

Question

我正在使用 Kafka 数据，然后将数据流式传输到 HDFS。

存储在Kafka主题trial中的数据如下：

hadoop
hive
hive
kafka
hive

但是，当我提交代码时，returns:

线程异常 "main"

org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}

我的问题是：如上所示，Kafka中存储的数据只有一列，为什么程序说有7 columns？

感谢任何帮助。

我的spark-streaming代码：

def main(args: Array[String]): Unit = {
val spark = SparkSession
  .builder.master("local[4]")
  .appName("SpeedTester")
  .config("spark.driver.memory", "3g")
  .getOrCreate()

val ds = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "192.168.95.20:9092")
  .option("subscribe", "trial")
  .option("startingOffsets" , "earliest")
  .load()
  .writeStream
  .format("text")
  .option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
  .option("checkpointLocation", "/tmp/checkpoint")
  .start()
  .awaitTermination()
 }

Answer 1

在Structured Streaming + Kafka Integration Guide中有解释：

Each row in the source has the following schema:

Column Type

key binary

value binary

topic string

partition int

offset long

timestamp long

timestampType int

正好有七列。如果你只想写有效负载（值）select 它并转换为字符串：

spark.readStream
   ...
  .load()
  .selectExpr("CAST(value as string)")
  .writeStream
  ...
  .awaitTermination()

Spark Streaming：文本数据源只支持单列

Spark Streaming: Text data source supports only a single column

hadoop

apache-spark

spark-streaming