PySpark 中的 LEFT 和 RIGHT 函数 SQL

LEFT and RIGHT function in PySpark SQL

我是 PySpark 的新手。我使用 pandas 提取了一个 csv 文件。 并使用 registerTempTable 函数创建了一个临时 table。

from pyspark.sql import SQLContext
from pyspark.sql import Row
import pandas as pd
sqlc = SQLContext(sc)

aa1 = pd.read_csv("D:\mck1.csv")

aa2 = sqlc.createDataFrame(aa1)

aa2.show()

+--------+-------+----------+------------+---------+------------+-------------------+
|    City|     id|First_Name|Phone_Number|new_date|new      code|           New_date|
+--------+-------+----------+------------+---------+------------+-------------------+
|KOLKATTA|9000007|       AAA|  1111119411| 20080714|          13|2016-08-16 00:00:00|
|KOLKATTA|9000007|       BBB|  1111119421| 20080714|          13|2016-08-06 00:00:00|
|KOLKATTA|9000007|       CCC|  1111119461| 20080714|          13|2016-08-13 00:00:00|
|KOLKATTA|9000007|       DDD|  1111119471| 20080714|          13|2016-08-27 00:00:00|
|KOLKATTA|9000007|       EEE|  1111119491| 20080714|          13|2016-08-15 00:00:00|
|KOLKATTA|9111147|       FFF|  1111119401| 20080714|          13|2016-08-24 00:00:00|
|KOLKATTA|9585458|   FORMULA|  1111110112| 19990930|          13|2016-08-16 00:00:00|
|KOLKATTA|9569878|   APPLEII|  1111110132| 19990930|          13|2016-08-06 00:00:00|

aa3 = aa2.registerTempTable("mytable1")

sqlc.sql(""" select right(phone_number,4) from mytable1 """).show()

现在我尝试使用 right(phone_number,4) 使用 phone 数字的右侧拉出最后四个字符并面临后续错误

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-18-07f08e3d0a8f> in <module>()
----> 1 sqlc.sql(""" select right(Phone_number,4) from mytable1 """).show()

C:\spark-1.4.1-bin-hadoop2.6\python\pyspark\sql\context.pyc in sql(self, sqlQuery)
    500         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
    501         """
--> 502         return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
    503 
    504     @since(1.0)

C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: [1.9] failure: ``union'' expected but `right' found

 select right(Phone_number,4) from mytable1 
        ^
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
    at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
    at org.apache.spark.sql.SQLContext$$anonfun.apply(SQLContext.scala:145)

为什么pyspark不支持RIGHT和LEFT函数? 我如何取权一列的四个字符?

看看documentation,你试过substring函数吗?

pyspark.sql.functions.substring(str, pos, len)[source]

编辑

根据您的评论,您可以像这样得到最后四个:

from pyspark.sql.functions import substring

df = sqlContext.createDataFrame([('abcdefg',)], ['s',])
df.select(substring(df.s, -4, 4).alias('s')).collect()

用 rpad 试试:

sqlc.sql(""" select rpad(phone_number, 4, phone_number) from mytable1 """).show()

我知道这是一个老问题,但这也可以直接从“aa2”pyspark 数据帧使用“expr”函数来完成:

from pyspark.sql.functions import expr

aa2.select(expr('RIGHT(phone_number, 4)')).show()

|right(phone_number, 4)|
|----------------------|
|                  9411|
|                  9421|
|                  9461|
|                  9471|
|                  9491|
|                  9401|
|                  0112|
|                  0132|