由于 java.lang.ClassNotFoundException 导致的 pyspark 数据帧错误：org.postgresql.Driver

Question

我想使用 JDBC 从 Postgresql 读取数据并将其存储在 pyspark 数据框中。当我想使用 df.show()、df.take() 之类的方法预览数据框中的数据时，它们 return 错误提示：java.lang.ClassNotFoundException: org.postgresql.Driver.但是 df.printschema() 会 return 数据库的信息 table 完美。这是我的代码：

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .master("spark://spark-master:7077")
         .appName("read-postgres-jdbc")
         .config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
         .config("spark.executor.memory", "1g")
         .getOrCreate())
sc = spark.sparkContext

df = (
    spark.read
    .format("jdbc")
    .option("driver", "org.postgresql.Driver")
    .option("url", "jdbc:postgresql://postgres/postgres")    
    .option("table", "public.\"ASSET_DATA\"")
    .option("dbtable", _select_sql)
    .option("user", "airflow")
    .option("password", "airflow")
    .load()
)

df.show(1)

错误日志：

Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver

Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver

2021 年 7 月 24 日编辑该脚本在 JupyterLab 上执行，位于与 Standalone Spark 集群分开的 docker 容器中。

Answer 1

您没有使用正确的选项。阅读 doc 时，您会看到：

Extra classpath entries to prepend to the classpath of the driver. Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.

此选项适用于驱动程序。这就是模式获取有效的原因，它是在驱动程序端完成的操作。但是当你运行一个spark命令时，这个命令是由workers（或者executors）执行的。他们还需要 .jar 才能访问 postgres。

如果您的 postgres 驱动程序（“/opt/workspace/postgresql-42.2.18.jar”）不需要任何依赖项，那么您可以使用 spark.jars 将其添加到 worker - 我知道 mysql 例如不需要依赖项，但我从未尝试过 postgres。如果它需要依赖项，那么最好使用 spark.jars.packages 选项直接从 maven 调用包。（请参阅文档的 link 以获得帮助）

由于 java.lang.ClassNotFoundException 导致的 pyspark 数据帧错误：org.postgresql.Driver

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

postgresql

jdbc

apache-spark

pyspark