Spark Structural Streaming 与 Confluent Cloud Kafka 连接问题
Spark Structural Streaming with Confluent Cloud Kafka connectivity issue
我正在 PySpark 中编写一个 Spark 结构化流应用程序以从 Confluent Cloud 中的 Kafka 读取数据。 spark readstream()
函数的文档太浅,并且没有在可选参数部分特别是在 auth 机制部分指定太多。我不确定哪个参数出错并导致连接崩溃。任何有 Spark 经验的人都可以帮助我开始这种联系吗?
必需参数
> Consumer({'bootstrap.servers':
> 'cluster.gcp.confluent.cloud:9092',
> 'sasl.username':'xxx',
> 'sasl.password': 'xxx',
> 'sasl.mechanisms': 'PLAIN',
> 'security.protocol': 'SASL_SSL',
> 'group.id': 'python_example_group_1',
> 'auto.offset.reset': 'earliest' })
这是我的 pyspark 代码:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "cluster.gcp.confluent.cloud:9092") \
.option("subscribe", "test-topic") \
.option("kafka.sasl.mechanisms", "PLAIN")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.username","xxx")\
.option("kafka.sasl.password", "xxx")\
.option("startingOffsets", "latest")\
.option("kafka.group.id", "python_example_group_1")\
.load()
display(df)
但是,我一直收到错误消息:
kafkashaded.org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
DataBrick Notebook- 用于测试
文档
此错误表明 JAAS 配置对您的 Kafka 消费者不可见。要解决此问题,请根据以下步骤包含 JASS:
Step01:为以下 JAAS 文件创建一个文件:/home/jass/path
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
Step02:根据下面的conf参数在spark-submit中调用那个JASS文件路径。
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path"
完整的 spark 提交命令:
/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf spark.ui.port=4055 --files /home/jass/path,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" pysparkstructurestreaming.py
Pyspark 结构化流示例代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()
我们需要指定kafka.sasl.jaas.config
为Confluent Kafka SASL-SSL auth方法添加用户名和密码。它的参数看起来有点奇怪,但它是有效的。
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-43n10.us-central1.gcp.confluent.cloud:9092") \
.option("subscribe", "wallet_txn_log") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxx" password="xxx";""").load()
display(df)
我正在 PySpark 中编写一个 Spark 结构化流应用程序以从 Confluent Cloud 中的 Kafka 读取数据。 spark readstream()
函数的文档太浅,并且没有在可选参数部分特别是在 auth 机制部分指定太多。我不确定哪个参数出错并导致连接崩溃。任何有 Spark 经验的人都可以帮助我开始这种联系吗?
必需参数
> Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092', > 'sasl.username':'xxx', > 'sasl.password': 'xxx', > 'sasl.mechanisms': 'PLAIN', > 'security.protocol': 'SASL_SSL', > 'group.id': 'python_example_group_1', > 'auto.offset.reset': 'earliest' })
这是我的 pyspark 代码:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "cluster.gcp.confluent.cloud:9092") \
.option("subscribe", "test-topic") \
.option("kafka.sasl.mechanisms", "PLAIN")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.username","xxx")\
.option("kafka.sasl.password", "xxx")\
.option("startingOffsets", "latest")\
.option("kafka.group.id", "python_example_group_1")\
.load()
display(df)
但是,我一直收到错误消息:
kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
DataBrick Notebook- 用于测试
文档
此错误表明 JAAS 配置对您的 Kafka 消费者不可见。要解决此问题,请根据以下步骤包含 JASS:
Step01:为以下 JAAS 文件创建一个文件:/home/jass/path
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
Step02:根据下面的conf参数在spark-submit中调用那个JASS文件路径。
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path"
完整的 spark 提交命令:
/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf spark.ui.port=4055 --files /home/jass/path,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" pysparkstructurestreaming.py
Pyspark 结构化流示例代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()
我们需要指定kafka.sasl.jaas.config
为Confluent Kafka SASL-SSL auth方法添加用户名和密码。它的参数看起来有点奇怪,但它是有效的。
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-43n10.us-central1.gcp.confluent.cloud:9092") \
.option("subscribe", "wallet_txn_log") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxx" password="xxx";""").load()
display(df)