PySpark 中的 Proc 排名替代方案

Question

我想在 PySpark 中转换以下 SAS 代码：

SAS：

    proc rank data = INP(where = (col= 1)) 
              out = RESULT groups = 3 descending ;
       var Col1
           Col2
           Col3
           Col4;
       ranks R_Col1 F_Col2 M_Col3 O_Col4 ;
    run ;

我正在尝试使用以下 PySpark 代码实现上述目标，但收到 'DataFrame' 对象没有属性 'apply' 的错误派斯帕克：

def grouping(data):
    dec=pd.qcut(data['Col1','Col2','Col3','Col4'],3,labels=False)
    data['ranks']=dec
    return data
RESULT =INP.apply(grouping)

非常感谢对此的任何帮助！

谢谢

Answer 1

尝试了以下解决方案：-

RESULT = sqlContext.sql(
"""
SELECT  *,
     ntile(3) OVER (order by Col1 desc) AS R_Col1,
     ntile(3) OVER (order by Col2 desc) AS F_Col2,
     ntile(3) OVER (order by Col3 desc) AS M_Col3,
     ntile(3) OVER (order by Col4 desc) AS O_Col4
FROM INP
WHERE col=1
"""
)

PySpark 中的 Proc 排名替代方案

Proc rank alternative in PySpark

sas

proc

pyspark