PySpark 中的 LEFT 和 RIGHT 函数 SQL
LEFT and RIGHT function in PySpark SQL
我是 PySpark 的新手。我使用 pandas 提取了一个 csv 文件。
并使用 registerTempTable 函数创建了一个临时 table。
from pyspark.sql import SQLContext
from pyspark.sql import Row
import pandas as pd
sqlc = SQLContext(sc)
aa1 = pd.read_csv("D:\mck1.csv")
aa2 = sqlc.createDataFrame(aa1)
aa2.show()
+--------+-------+----------+------------+---------+------------+-------------------+
| City| id|First_Name|Phone_Number|new_date|new code| New_date|
+--------+-------+----------+------------+---------+------------+-------------------+
|KOLKATTA|9000007| AAA| 1111119411| 20080714| 13|2016-08-16 00:00:00|
|KOLKATTA|9000007| BBB| 1111119421| 20080714| 13|2016-08-06 00:00:00|
|KOLKATTA|9000007| CCC| 1111119461| 20080714| 13|2016-08-13 00:00:00|
|KOLKATTA|9000007| DDD| 1111119471| 20080714| 13|2016-08-27 00:00:00|
|KOLKATTA|9000007| EEE| 1111119491| 20080714| 13|2016-08-15 00:00:00|
|KOLKATTA|9111147| FFF| 1111119401| 20080714| 13|2016-08-24 00:00:00|
|KOLKATTA|9585458| FORMULA| 1111110112| 19990930| 13|2016-08-16 00:00:00|
|KOLKATTA|9569878| APPLEII| 1111110132| 19990930| 13|2016-08-06 00:00:00|
aa3 = aa2.registerTempTable("mytable1")
sqlc.sql(""" select right(phone_number,4) from mytable1 """).show()
现在我尝试使用 right(phone_number,4) 使用 phone 数字的右侧拉出最后四个字符并面临后续错误
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-18-07f08e3d0a8f> in <module>()
----> 1 sqlc.sql(""" select right(Phone_number,4) from mytable1 """).show()
C:\spark-1.4.1-bin-hadoop2.6\python\pyspark\sql\context.pyc in sql(self, sqlQuery)
500 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
501 """
--> 502 return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
503
504 @since(1.0)
C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: [1.9] failure: ``union'' expected but `right' found
select right(Phone_number,4) from mytable1
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun.apply(SQLContext.scala:145)
为什么pyspark不支持RIGHT和LEFT函数?
我如何取权一列的四个字符?
看看documentation,你试过substring函数吗?
pyspark.sql.functions.substring(str, pos, len)[source]
编辑
根据您的评论,您可以像这样得到最后四个:
from pyspark.sql.functions import substring
df = sqlContext.createDataFrame([('abcdefg',)], ['s',])
df.select(substring(df.s, -4, 4).alias('s')).collect()
用 rpad 试试:
sqlc.sql(""" select rpad(phone_number, 4, phone_number) from mytable1 """).show()
我知道这是一个老问题,但这也可以直接从“aa2”pyspark 数据帧使用“expr”函数来完成:
from pyspark.sql.functions import expr
aa2.select(expr('RIGHT(phone_number, 4)')).show()
|right(phone_number, 4)|
|----------------------|
| 9411|
| 9421|
| 9461|
| 9471|
| 9491|
| 9401|
| 0112|
| 0132|
我是 PySpark 的新手。我使用 pandas 提取了一个 csv 文件。 并使用 registerTempTable 函数创建了一个临时 table。
from pyspark.sql import SQLContext
from pyspark.sql import Row
import pandas as pd
sqlc = SQLContext(sc)
aa1 = pd.read_csv("D:\mck1.csv")
aa2 = sqlc.createDataFrame(aa1)
aa2.show()
+--------+-------+----------+------------+---------+------------+-------------------+
| City| id|First_Name|Phone_Number|new_date|new code| New_date|
+--------+-------+----------+------------+---------+------------+-------------------+
|KOLKATTA|9000007| AAA| 1111119411| 20080714| 13|2016-08-16 00:00:00|
|KOLKATTA|9000007| BBB| 1111119421| 20080714| 13|2016-08-06 00:00:00|
|KOLKATTA|9000007| CCC| 1111119461| 20080714| 13|2016-08-13 00:00:00|
|KOLKATTA|9000007| DDD| 1111119471| 20080714| 13|2016-08-27 00:00:00|
|KOLKATTA|9000007| EEE| 1111119491| 20080714| 13|2016-08-15 00:00:00|
|KOLKATTA|9111147| FFF| 1111119401| 20080714| 13|2016-08-24 00:00:00|
|KOLKATTA|9585458| FORMULA| 1111110112| 19990930| 13|2016-08-16 00:00:00|
|KOLKATTA|9569878| APPLEII| 1111110132| 19990930| 13|2016-08-06 00:00:00|
aa3 = aa2.registerTempTable("mytable1")
sqlc.sql(""" select right(phone_number,4) from mytable1 """).show()
现在我尝试使用 right(phone_number,4) 使用 phone 数字的右侧拉出最后四个字符并面临后续错误
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-18-07f08e3d0a8f> in <module>()
----> 1 sqlc.sql(""" select right(Phone_number,4) from mytable1 """).show()
C:\spark-1.4.1-bin-hadoop2.6\python\pyspark\sql\context.pyc in sql(self, sqlQuery)
500 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
501 """
--> 502 return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
503
504 @since(1.0)
C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: [1.9] failure: ``union'' expected but `right' found
select right(Phone_number,4) from mytable1
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun.apply(SQLContext.scala:145)
为什么pyspark不支持RIGHT和LEFT函数? 我如何取权一列的四个字符?
看看documentation,你试过substring函数吗?
pyspark.sql.functions.substring(str, pos, len)[source]
编辑
根据您的评论,您可以像这样得到最后四个:
from pyspark.sql.functions import substring
df = sqlContext.createDataFrame([('abcdefg',)], ['s',])
df.select(substring(df.s, -4, 4).alias('s')).collect()
用 rpad 试试:
sqlc.sql(""" select rpad(phone_number, 4, phone_number) from mytable1 """).show()
我知道这是一个老问题,但这也可以直接从“aa2”pyspark 数据帧使用“expr”函数来完成:
from pyspark.sql.functions import expr
aa2.select(expr('RIGHT(phone_number, 4)')).show()
|right(phone_number, 4)|
|----------------------|
| 9411|
| 9421|
| 9461|
| 9471|
| 9491|
| 9401|
| 0112|
| 0132|