pyspark.sql.utils.ParseException: PYSPARK SQL 中的“\n输入不匹配”

Question

我需要使用 pivot.I 转换 table，我已经在 pyspark sql 中尝试过，但遇到问题。

Store      Week     xCount
-------    ----     ------
102        1        96
101        1        138
105        1        37
109        1        59
101        2        282
102        2        212
105        2        78
109        2        97
105        3        60
102        3        123
101        3        220
109        3        87

我希望它作为支点出现 table，像这样：

Store        1          2          3       .
----- 
101        138        282        220
102         96        212        123
105         37        
109

我使用的代码如下

from pyspark import SparkContext,SparkConf
from builtins import int
#from org.spark.com.PySparkDemo import data
from pyspark.sql import Row
from pyspark.sql.context import SQLContext




conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
a = sc.textFile("C:/Users/ashok.darsi/Desktop/Data1.txt")
b = a.map(lambda x:x.split(",")).map(lambda x :Row(store=int(x[0]),week=int(x[1]),xcount=int(x[2])))
c = sqlContext.createDataFrame(b)
#c.show()

c.registerTempTable("d")

e = sqlContext.sql("select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt")

e.show()

我遇到的错误

select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt
---------------------^^^

pyspark.sql.utils.ParseException: "\nmismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 21)\n\n== SQL ==\nselect * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt\n---------------------^^^\n"
SUCCESS: The process with PID 8668 (child process of PID 2192) has been terminated.
SUCCESS: The process with PID 2192 (child process of PID 4940) has been terminated.
SUCCESS: The process with PID 4940 (child process of PID 11192) has been terminated.

Answer 1

我认为 PIVOT 不支持 HiveQL/Spark SQL。

您可以在此处使用 Dataframe 特定函数 pivot：

df.groupBy("store").pivot("week").max("xcount").show()

+-----+---+---+---+
|store|  1|  2|  3|
+-----+---+---+---+
|  101|138|282|220|
|  102| 96|212|123|
|  105| 37| 78| 60|
|  109| 59| 97| 87|
+-----+---+---+---+

使用 Spark-SQL，您必须手动指定每周值，然后从中找到 max :

select 
  store,
  max(case when week = 1 then xcount else null end) as `1`,
  max(case when week = 2 then xcount else null end) as `2`,
  max(case when week = 3 then xcount else null end) as `3`
from
  d
group by 
  store

pyspark.sql.utils.ParseException: PYSPARK SQL 中的“\n输入不匹配”

pyspark.sql.utils.ParseException: "\nmismatched input" in PYSPARKSQL

sql

apache-spark

apache-spark-sql

pyspark

pyspark-sql