将函数应用到 spark RDD

Question

我正在尝试对推文进行一些分析。我想将 .lower() 应用于推文中的每个 text。我使用了以下代码

    actual_tweets = actual_tweets.map(lambda line: line["text"].lower() and line["quoted_status"]["text"].lower() if 'quoted_status' in line else line["text"].lower()).collect()

问题是因为我正在使用 map，这行代码将 text 属性转换为小写，而 returns 我唯一的 text 属性忽略所有其他不是我想要的。我只是想知道 spark transformations 中的任何一个是否可以帮助我实现我想要的。

Answer 1

例如，您可以 return 元组（输入，transformed_input）：

def transform(line):
    if 'quoted_status' in line:
        return (
            # Is `and` what you really want here?
            line, line["text"].lower() and line["quoted_status"]["text"].lower() 
        )
    else:
        return line, line["text"].lower()

actual_tweets.map(transform)

将函数应用到 spark RDD

Apply function to spark RDD

python

apache-spark

pyspark