'Column' 对象在函数 spark 中的行为

Behavior of 'Column' object within function spark

我正在编写代码以使用以下路径替换字符:[^\w | ] 和 '' 。关键是当在我的函数 'removePunctuation' 中使用 DataFrame 'sentenceDF' 时,我得到以下错误 'column' object is not callable'.

from pyspark.sql.functions import regexp_replace, trim, col, lower

    def removePunctuation(column):
        cleanString = column
        cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^\w | ]','').alias('sentence'))
        cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence'))
        cleanString = cleanString.select(lower(cleanString['sentence']))

        return cleanString



    sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                             (' No under_score!',),
                                             (' *      Remove punctuation then spaces  * ',)], ['sentence'])

    result = sentenceDF.select(removePunctuation(col('sentence')))
    result.show()

回溯:

    TypeError: 'Column' object is not callable
    --------------------------------------------------------------------------- TypeError Traceback (most recent call last) 
    <ipython-input-50-aa978fac8bae> in <module>() 
         15 (' * Remove punctuation then spaces * ',)], ['sentence']) 
         16 
    ---> 17 result = sentenceDF.select(removePunctuation(col('sentence')))  

         18 result.show() 

    <ipython-input-50-aa978fac8bae> in removePunctuation(column) 
         4 def removePunctuation(column): 
         5 cleanString = column 
   ----> 6 cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^\w | ]','').alias('sentence')) 
         7 cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence')) 
         8 cleanString = cleanString.select(lower(cleanString['sentence'])) TypeError: 'Column' object is not callable 

    Command took 0.09 seconds -- by andres.velez.e@gmail.com at 10/30/2016, 2:48:17 PM on My Cluster (6 GB)

只需这样做 - 您会遇到同样的错误。

col('sentence').select()

建议:在重构为函数之前,始终尝试写出代码。

无论如何,这就是你想要的,我想。

def removePunctuation(df, column):
    cleanString = df.select(trim(lower(col('sentence'))).alias('sentence'))
    cleanString = cleanString.select(regexp_replace('sentence','[^\w]|\s+|_','').alias('sentence'))

    return cleanString

result = removePunctuation(sentenceDF, 'sentence')
result.show()

+--------------------+
|            sentence|
+--------------------+
|               hiyou|
|        nounderscore|
|removepunctuation...|
+--------------------+