如何通过 SQL 查询检查列的数值是否包含字母
How to check if numerical value of a column contains alphabets via SQL query
我在 AWS S3 中有一个 CSV 文件正在加载到 AWS Glue,即用于对来自 S3 的源数据文件应用转换。它提供 PySpark 脚本环境。数据看起来有点像这样:
"ID","CNTRY_CD","SUB_ID","PRIME_KEY","DATE"
"123","IND","25635525","11243749772","2017-10-17"
"123","IND","25632349","112322abcd","2017-10-17"
"123","IND","25635234","11243kjsd434","2017-10-17"
"123","IND","25639822","1124374343","2017-10-17"
预期结果应该是这样的:
"123","IND","25632349","112322abcd","2017-10-17"
"123","IND","25635234","11243kjsd434","2017-10-17"
我在这里按名称 'PRIME_KEY' 处理可能包含导致数据格式错误的字母表的整数类型的字段。
现在的要求是,我需要使用 SQL 查询找出 Integer 类型的主键列是否包含任何字母数字字符,而不仅仅是数字值。到目前为止,我已经尝试了几种正则表达式变体来执行此操作,如下所示,但没有成功:
SELECT *
FROM table_name
WHERE column_name IS NOT NULL AND
CAST(column_name AS VARCHAR(100)) LIKE \'%[0-9a-z0-9]%\'
源脚本:
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# s3 output directory
output_dir = "s3://aws-glue-scripts../.."
# Data Catalog: database and table name
db_name = "sampledb"
glue_tbl_name = "sampleTable"
datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = glue_tbl_name)
datasource_df = datasource.toDF()
datasource_df.registerTempTable("sample_tbl")
invalid_primarykey_values_df = spark.sql("SELECT * FROM sample_tbl WHERE CAST(PRIME_KEY AS STRING) RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'")
invalid_primarykey_values_df.show()
此脚本的输出如下:
+---+--------+--------+------------+-------- -+------------+--------------+
|ID |CNTRY_CD|SUB_ID |PRIME_KEY |日期 |
+---+--------+--------+------------+-------- -+------------+--------------+
|123|IND|25635525|[11243749772,null]|2017-10-17|
|123|IND|25632349|[null,112322ab..|2017-10-17|
|123|IND|25635234|[null,11243kjsd..|2017-10-17|
|123|IND|25639822|[1124374343,null]|2017-10-17|
+--------+--------+--------------------+----- -----+------------+--------------+
我已经突出显示了我正在处理的领域的值。它看起来与源数据有些不同。
如有任何帮助,我们将不胜感激。谢谢
您可以使用RLIKE
SELECT *
FROM table_name
WHERE CAST(PRIME_KEY AS STRING) RLIKE '([0-9]+[a-z]+)'
更通用的字母数字过滤器匹配。
WHERE CAST(PRIME_KEY AS STRING) RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'
编辑:根据评论
必要的导入和 udfs
val spark = SparkSession.builder
.config(conf)
.getOrCreate
import org.apache.spark.sql.functions._
val extract_pkey = udf((x: String) => x.replaceAll("null|\]|\[|,", "").trim)
import spark.implicits._
设置示例数据以使用 UDF 进行测试和清理
val df = Seq(
("123", "IND", "25635525", "[11243749772,null]", "2017-10-17"),
("123", "IND", "25632349", "[null,112322abcd]", "2017-10-17"),
("123", "IND", "25635234", "[null,11243kjsd434]", "2017-10-17"),
("123", "IND", "25639822", "[1124374343,null]", "2017-10-17")
).toDF("ID", "CNTRY_CD", "SUB_ID", "PRIME_KEY", "DATE")
.withColumn("PRIME_KEY", extract_pkey($"PRIME_KEY"))
df.registerTempTable("tbl")
spark.sql("SELECT * FROM tbl WHERE PRIME_KEY RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'")
.show(false)
+---+--------+--------+------------+----------+
|ID |CNTRY_CD|SUB_ID |PRIME_KEY |DATE |
+---+--------+--------+------------+----------+
|123|IND |25632349|112322abcd |2017-10-17|
|123|IND |25635234|11243kjsd434|2017-10-17|
+---+--------+--------+------------+----------+
我在 AWS S3 中有一个 CSV 文件正在加载到 AWS Glue,即用于对来自 S3 的源数据文件应用转换。它提供 PySpark 脚本环境。数据看起来有点像这样:
"ID","CNTRY_CD","SUB_ID","PRIME_KEY","DATE"
"123","IND","25635525","11243749772","2017-10-17"
"123","IND","25632349","112322abcd","2017-10-17"
"123","IND","25635234","11243kjsd434","2017-10-17"
"123","IND","25639822","1124374343","2017-10-17"
预期结果应该是这样的:
"123","IND","25632349","112322abcd","2017-10-17"
"123","IND","25635234","11243kjsd434","2017-10-17"
我在这里按名称 'PRIME_KEY' 处理可能包含导致数据格式错误的字母表的整数类型的字段。
现在的要求是,我需要使用 SQL 查询找出 Integer 类型的主键列是否包含任何字母数字字符,而不仅仅是数字值。到目前为止,我已经尝试了几种正则表达式变体来执行此操作,如下所示,但没有成功:
SELECT *
FROM table_name
WHERE column_name IS NOT NULL AND
CAST(column_name AS VARCHAR(100)) LIKE \'%[0-9a-z0-9]%\'
源脚本:
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# s3 output directory
output_dir = "s3://aws-glue-scripts../.."
# Data Catalog: database and table name
db_name = "sampledb"
glue_tbl_name = "sampleTable"
datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = glue_tbl_name)
datasource_df = datasource.toDF()
datasource_df.registerTempTable("sample_tbl")
invalid_primarykey_values_df = spark.sql("SELECT * FROM sample_tbl WHERE CAST(PRIME_KEY AS STRING) RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'")
invalid_primarykey_values_df.show()
此脚本的输出如下:
+---+--------+--------+------------+-------- -+------------+--------------+
|ID |CNTRY_CD|SUB_ID |PRIME_KEY |日期 |
+---+--------+--------+------------+-------- -+------------+--------------+
|123|IND|25635525|[11243749772,null]|2017-10-17|
|123|IND|25632349|[null,112322ab..|2017-10-17|
|123|IND|25635234|[null,11243kjsd..|2017-10-17|
|123|IND|25639822|[1124374343,null]|2017-10-17|
+--------+--------+--------------------+----- -----+------------+--------------+
我已经突出显示了我正在处理的领域的值。它看起来与源数据有些不同。
如有任何帮助,我们将不胜感激。谢谢
您可以使用RLIKE
SELECT *
FROM table_name
WHERE CAST(PRIME_KEY AS STRING) RLIKE '([0-9]+[a-z]+)'
更通用的字母数字过滤器匹配。
WHERE CAST(PRIME_KEY AS STRING) RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'
编辑:根据评论
必要的导入和 udfs
val spark = SparkSession.builder
.config(conf)
.getOrCreate
import org.apache.spark.sql.functions._
val extract_pkey = udf((x: String) => x.replaceAll("null|\]|\[|,", "").trim)
import spark.implicits._
设置示例数据以使用 UDF 进行测试和清理
val df = Seq(
("123", "IND", "25635525", "[11243749772,null]", "2017-10-17"),
("123", "IND", "25632349", "[null,112322abcd]", "2017-10-17"),
("123", "IND", "25635234", "[null,11243kjsd434]", "2017-10-17"),
("123", "IND", "25639822", "[1124374343,null]", "2017-10-17")
).toDF("ID", "CNTRY_CD", "SUB_ID", "PRIME_KEY", "DATE")
.withColumn("PRIME_KEY", extract_pkey($"PRIME_KEY"))
df.registerTempTable("tbl")
spark.sql("SELECT * FROM tbl WHERE PRIME_KEY RLIKE '([a-z]+[0-9]+)|([0-9]+[a-z]+)'")
.show(false)
+---+--------+--------+------------+----------+
|ID |CNTRY_CD|SUB_ID |PRIME_KEY |DATE |
+---+--------+--------+------------+----------+
|123|IND |25632349|112322abcd |2017-10-17|
|123|IND |25635234|11243kjsd434|2017-10-17|
+---+--------+--------+------------+----------+