pyspark - 合并 2 列集合
pyspark - merge 2 columns of sets
我有一个 spark 数据框,它有 2 列由函数 collect_set 形成。我想将这 2 列集合合并为 1 列集合。我应该怎么做?它们都是一组字符串
例如,我通过调用 collect_set
形成了 2 列
Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
如何把它变成:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
非常感谢您的提前帮助
鉴于您 dataframe
作为
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
您可以编写 udf
函数将两列的集合合并为一个。
import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)
然后调用 udf
函数作为
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)
你应该有你想要的决赛 dataframe
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
假设df
有
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
然后
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
创建一组 Fruits
& Meat
组合成一个集合,即
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
希望这对您有所帮助!
我也在 Python 中解决了这个问题,所以这里是 Ramesh 对 Python 的解决方案的一部分:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
输出:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
向 Ramesh 致敬!
编辑: 请注意,您可能必须手动指定列类型(不确定为什么它只在某些没有明确类型说明的情况下对我有用 - 在其他情况下我是获取字符串类型的列)。
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))
我有一个 spark 数据框,它有 2 列由函数 collect_set 形成。我想将这 2 列集合合并为 1 列集合。我应该怎么做?它们都是一组字符串
例如,我通过调用 collect_set
形成了 2 列Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
如何把它变成:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
非常感谢您的提前帮助
鉴于您 dataframe
作为
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
您可以编写 udf
函数将两列的集合合并为一个。
import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)
然后调用 udf
函数作为
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)
你应该有你想要的决赛 dataframe
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
假设df
有
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
然后
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
创建一组 Fruits
& Meat
组合成一个集合,即
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
希望这对您有所帮助!
我也在 Python 中解决了这个问题,所以这里是 Ramesh 对 Python 的解决方案的一部分:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
输出:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
向 Ramesh 致敬!
编辑: 请注意,您可能必须手动指定列类型(不确定为什么它只在某些没有明确类型说明的情况下对我有用 - 在其他情况下我是获取字符串类型的列)。
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))