计算 pyspark df 列中子字符串列表的出现次数
Count occurrences of a list of substrings in a pyspark df column
我想计算子字符串列表的出现次数,并根据 pyspark df 中包含长字符串的列创建一个列。
Input:
ID History
1 USA|UK|IND|DEN|MAL|SWE|AUS
2 USA|UK|PAK|NOR
3 NOR|NZE
4 IND|PAK|NOR
lst=['USA','IND','DEN']
Output :
ID History Count
1 USA|UK|IND|DEN|MAL|SWE|AUS 3
2 USA|UK|PAK|NOR 1
3 NOR|NZE 0
4 IND|PAK|NOR 1
# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History |
+---+--------------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |
|3 |NOR|NZE |
|4 |IND|PAK|NOR |
+---+--------------------------+
思路是根据这三个delimiters
:lst=['USA','IND','DEN']
拆分字符串,然后统计产生的子串个数。
例如;字符串 USA|UK|IND|DEN|MAL|SWE|AUS
被拆分为 - ,
、|UK|
、|
、|MAL|SWE|AUS
。由于创建了 4 个子字符串并且匹配了 3 个定界符,因此 4-1 = 3
给出了这些字符串出现在列字符串中的计数。
我不确定 Spark 是否支持多字符定界符,因此作为第一步,我们将列表 ['USA','IND','DEN']
中的这 3 个 sub-strings 中的任何一个替换为 flag/dummy值 %
。您也可以使用其他东西。以下代码执行此操作 replacement
-
df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:
df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History |History_X |
+---+--------------------------+--------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |%|UK|PAK|NOR |
|3 |NOR|NZE |NOR|NZE |
|4 |IND|PAK|NOR |%|PAK|NOR |
+---+--------------------------+--------------------+
最后,我们统计splitting
it first with %
being the delimiter, then counting the number of substrings created with size
函数创建的子串的个数,最后减1。
df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History |Count|
+---+--------------------------+-----+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|3 |
|2 |USA|UK|PAK|NOR |1 |
|3 |NOR|NZE |0 |
|4 |IND|PAK|NOR |1 |
+---+--------------------------+-----+
如果你使用的是Spark 2.4+,你可以试试SPARKSQL高阶函数filter()
:
from pyspark.sql import functions as F
>>> df.show(5,0)
+---+--------------------------+
|ID |History |
+---+--------------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |
|3 |NOR|NZE |
|4 |IND|PAK|NOR |
+---+--------------------------+
df_new = df.withColumn('data', F.split('History', '\|')) \
.withColumn('cnt', F.expr('size(filter(data, x -> x in ("USA", "IND", "DEN")))'))
>>> df_new.show(5,0)
+---+--------------------------+----------------------------------+---+
|ID |History |data |cnt|
+---+--------------------------+----------------------------------+---+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|[USA, UK, IND, DEN, MAL, SWE, AUS]|3 |
|2 |USA|UK|PAK|NOR |[USA, UK, PAK, NOR] |1 |
|3 |NOR|NZE |[NOR, NZE] |0 |
|4 |IND|PAK|NOR |[IND, PAK, NOR] |1 |
+---+--------------------------+----------------------------------+---+
其中我们先把字段History
拆分成数组列叫data
然后使用过滤函数:
filter(data, x -> x in ("USA", "IND", "DEN"))
只检索满足条件IN ("USA", "IND", "DEN")
的数组元素,然后用size()
函数对结果数组进行计数。
更新: 添加了另一种使用 array_contains() 的方法,它适用于旧版本的 Spark:
lst = ["USA", "IND", "DEN"]
df_new = df.withColumn('data', F.split('History', '\|')) \
.withColumn('Count', sum([F.when(F.array_contains('data',e),1).otherwise(0) for e in lst]))
注意:数组中的重复条目将被跳过,此方法只计算唯一国家代码。
我想计算子字符串列表的出现次数,并根据 pyspark df 中包含长字符串的列创建一个列。
Input:
ID History
1 USA|UK|IND|DEN|MAL|SWE|AUS
2 USA|UK|PAK|NOR
3 NOR|NZE
4 IND|PAK|NOR
lst=['USA','IND','DEN']
Output :
ID History Count
1 USA|UK|IND|DEN|MAL|SWE|AUS 3
2 USA|UK|PAK|NOR 1
3 NOR|NZE 0
4 IND|PAK|NOR 1
# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History |
+---+--------------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |
|3 |NOR|NZE |
|4 |IND|PAK|NOR |
+---+--------------------------+
思路是根据这三个delimiters
:lst=['USA','IND','DEN']
拆分字符串,然后统计产生的子串个数。
例如;字符串 USA|UK|IND|DEN|MAL|SWE|AUS
被拆分为 - ,
、|UK|
、|
、|MAL|SWE|AUS
。由于创建了 4 个子字符串并且匹配了 3 个定界符,因此 4-1 = 3
给出了这些字符串出现在列字符串中的计数。
我不确定 Spark 是否支持多字符定界符,因此作为第一步,我们将列表 ['USA','IND','DEN']
中的这 3 个 sub-strings 中的任何一个替换为 flag/dummy值 %
。您也可以使用其他东西。以下代码执行此操作 replacement
-
df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:
df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History |History_X |
+---+--------------------------+--------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |%|UK|PAK|NOR |
|3 |NOR|NZE |NOR|NZE |
|4 |IND|PAK|NOR |%|PAK|NOR |
+---+--------------------------+--------------------+
最后,我们统计splitting
it first with %
being the delimiter, then counting the number of substrings created with size
函数创建的子串的个数,最后减1。
df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History |Count|
+---+--------------------------+-----+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|3 |
|2 |USA|UK|PAK|NOR |1 |
|3 |NOR|NZE |0 |
|4 |IND|PAK|NOR |1 |
+---+--------------------------+-----+
如果你使用的是Spark 2.4+,你可以试试SPARKSQL高阶函数filter()
:
from pyspark.sql import functions as F
>>> df.show(5,0)
+---+--------------------------+
|ID |History |
+---+--------------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |
|3 |NOR|NZE |
|4 |IND|PAK|NOR |
+---+--------------------------+
df_new = df.withColumn('data', F.split('History', '\|')) \
.withColumn('cnt', F.expr('size(filter(data, x -> x in ("USA", "IND", "DEN")))'))
>>> df_new.show(5,0)
+---+--------------------------+----------------------------------+---+
|ID |History |data |cnt|
+---+--------------------------+----------------------------------+---+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|[USA, UK, IND, DEN, MAL, SWE, AUS]|3 |
|2 |USA|UK|PAK|NOR |[USA, UK, PAK, NOR] |1 |
|3 |NOR|NZE |[NOR, NZE] |0 |
|4 |IND|PAK|NOR |[IND, PAK, NOR] |1 |
+---+--------------------------+----------------------------------+---+
其中我们先把字段History
拆分成数组列叫data
然后使用过滤函数:
filter(data, x -> x in ("USA", "IND", "DEN"))
只检索满足条件IN ("USA", "IND", "DEN")
的数组元素,然后用size()
函数对结果数组进行计数。
更新: 添加了另一种使用 array_contains() 的方法,它适用于旧版本的 Spark:
lst = ["USA", "IND", "DEN"]
df_new = df.withColumn('data', F.split('History', '\|')) \
.withColumn('Count', sum([F.when(F.array_contains('data',e),1).otherwise(0) for e in lst]))
注意:数组中的重复条目将被跳过,此方法只计算唯一国家代码。