使用 Pyspark 或 SQL 从图像列中查找唯一图像名称
Finding Unique Image name from Image column using Pyspark or SQL
我有这样的数据集:
key|StateName_13|lon|lat|col5_13|col6_13|col7_13|ImageName|elevation_13|Counter_13
P00005K9XESU|FL|-80.854196|26.712385|128402000128038||183.30198669433594|USGS_NED_13_n27w081_IMG.img|3.7742109298706055|1
P00005KC31Y7|FL|-80.854196|26.712385|128402000128038||174.34959411621094|USGS_NED_13_n27w082_IMG.img|3.553356885910034|1
P00005KC320M|FL|-80.846966|26.713182|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.2236201763153076|1
P00005KC320M|FL|-80.84617434521485|26.713200344482424|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.7960102558135986|2
P00005KC320M|FL|-80.84538|26.713219|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|1.7564013004302979|3
P00005KC31Y6|FL|-80.854155|26.712083|128402000128038||169.80172729492188|USGS_NED_13_n27w081_IMG.img|3.2237753868103027|1
P00005KATEL2|FL|-80.861664|26.703649|128402000122910||38.789894104003906|USGS_NED_13_n27w081_IMG.img|3.235154628753662|1
在这个数据集中,我想找到重复的经度和纬度,并想要与这些经度和纬度对应的图像名称。
输出应如下所示:
lon|lat|ImageName
-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img,USGS_NED_13_n27w082_IMG.img
因为第 1 行和第 2 行具有相似的经度和纬度值但图像名称不同。
任何 pyspark 代码或 sql 查询都有效。
使用@giser_yugang 评论,我们可以做这样的事情:
from pyspark.sql import functions as F
df = df.groupby(
'lon',
'lat'
).agg(
F.collect_set('ImageName').alias("ImageNames")
).where(
F.size("ImageNames")>1
)
df.show(truncate=False)
+----------+---------+----------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+----------------------------------------------------------+
|-80.854196|26.712385|[USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img]|
+----------+---------+----------------------------------------------------------+
如果需要写成csv,格式不支持ArrayType
,那么可以使用concat_ws
df = df.withColumn(
"ImageNames",
F.concat_ws(
", "
"ImageNames"
)
)
df.show()
+----------+---------+--------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+--------------------------------------------------------+
|-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img|
+----------+---------+--------------------------------------------------------+
我有这样的数据集:
key|StateName_13|lon|lat|col5_13|col6_13|col7_13|ImageName|elevation_13|Counter_13
P00005K9XESU|FL|-80.854196|26.712385|128402000128038||183.30198669433594|USGS_NED_13_n27w081_IMG.img|3.7742109298706055|1
P00005KC31Y7|FL|-80.854196|26.712385|128402000128038||174.34959411621094|USGS_NED_13_n27w082_IMG.img|3.553356885910034|1
P00005KC320M|FL|-80.846966|26.713182|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.2236201763153076|1
P00005KC320M|FL|-80.84617434521485|26.713200344482424|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.7960102558135986|2
P00005KC320M|FL|-80.84538|26.713219|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|1.7564013004302979|3
P00005KC31Y6|FL|-80.854155|26.712083|128402000128038||169.80172729492188|USGS_NED_13_n27w081_IMG.img|3.2237753868103027|1
P00005KATEL2|FL|-80.861664|26.703649|128402000122910||38.789894104003906|USGS_NED_13_n27w081_IMG.img|3.235154628753662|1
在这个数据集中,我想找到重复的经度和纬度,并想要与这些经度和纬度对应的图像名称。
输出应如下所示:
lon|lat|ImageName
-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img,USGS_NED_13_n27w082_IMG.img
因为第 1 行和第 2 行具有相似的经度和纬度值但图像名称不同。 任何 pyspark 代码或 sql 查询都有效。
使用@giser_yugang 评论,我们可以做这样的事情:
from pyspark.sql import functions as F
df = df.groupby(
'lon',
'lat'
).agg(
F.collect_set('ImageName').alias("ImageNames")
).where(
F.size("ImageNames")>1
)
df.show(truncate=False)
+----------+---------+----------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+----------------------------------------------------------+
|-80.854196|26.712385|[USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img]|
+----------+---------+----------------------------------------------------------+
如果需要写成csv,格式不支持ArrayType
,那么可以使用concat_ws
df = df.withColumn(
"ImageNames",
F.concat_ws(
", "
"ImageNames"
)
)
df.show()
+----------+---------+--------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+--------------------------------------------------------+
|-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img|
+----------+---------+--------------------------------------------------------+