在 (Pyspark?
Unsupported Array error when reading JDBC source in (Py)Spark?
正在尝试将 postgreSQL DB 转换为 Dataframe。以下是我的代码:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://XXXXXX"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
我得到的错误:
java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name
如果我硬编码 table 名称值,我不会得到相同的错误
df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)
我检查了 table_name 类型,它是 String,这是正确的方法吗?
我猜你不想要属于 postgres 内部工作的 table 名称,例如 pg_type
、pg_policies
等,其架构类型为 pg_catalog
导致错误
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc.
: java.sql.SQLException: Unsupported type ARRAY
当您尝试将它们读作
spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)
并且有 tables,例如 applicable_roles
、view_table_usage
等,其架构类型为 information_schema
,导致
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc.
: org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist
当您尝试将它们读作
spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)
模式类型为public的table可以使用上述jdbc命令读入table。
I checked table_name type and it is String , is this the correct approach ?
所以你需要过滤掉那些table名字并应用你的逻辑作为
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://hostname:post/"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
应该可行
正在尝试将 postgreSQL DB 转换为 Dataframe。以下是我的代码:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://XXXXXX"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
我得到的错误:
java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name
如果我硬编码 table 名称值,我不会得到相同的错误
df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)
我检查了 table_name 类型,它是 String,这是正确的方法吗?
我猜你不想要属于 postgres 内部工作的 table 名称,例如 pg_type
、pg_policies
等,其架构类型为 pg_catalog
导致错误
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: Unsupported type ARRAY
当您尝试将它们读作
spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)
并且有 tables,例如 applicable_roles
、view_table_usage
等,其架构类型为 information_schema
,导致
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist
当您尝试将它们读作
spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)
模式类型为public的table可以使用上述jdbc命令读入table。
I checked table_name type and it is String , is this the correct approach ?
所以你需要过滤掉那些table名字并应用你的逻辑作为
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://hostname:post/"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
应该可行