将管道 RDD 转换为 Spark 数据帧
Convert a Pipeline RDD into a Spark dataframe
从这里开始:
items.take(2)
[['home', 'alone', 'apparently'], ['st','louis','plant','close','die','old','age','workers','making','cars','since','onset','mass','automotive','production','1920s']]
type(items)
pyspark.rdd.PipelinedRDD
我想将它转换成一个 Spark 数据框,每个单词列表有一列和一行。
您可以使用 toDF
创建数据框,但请记住先将每个列表包装在一个列表中,这样 Spark 才能理解您的每一行只有一列。
df = items.map(lambda x: [x]).toDF(['words'])
df.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------+
|words |
+------------------------------------------------------------------------------------------------------------------+
|[home, alone, apparently] |
|[st, louis, plant, close, die, old, age, workers, making, cars, since, onset, mass, automotive, production, 1920s]|
+------------------------------------------------------------------------------------------------------------------+
df.printSchema()
root
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
从这里开始:
items.take(2)
[['home', 'alone', 'apparently'], ['st','louis','plant','close','die','old','age','workers','making','cars','since','onset','mass','automotive','production','1920s']]
type(items)
pyspark.rdd.PipelinedRDD
我想将它转换成一个 Spark 数据框,每个单词列表有一列和一行。
您可以使用 toDF
创建数据框,但请记住先将每个列表包装在一个列表中,这样 Spark 才能理解您的每一行只有一列。
df = items.map(lambda x: [x]).toDF(['words'])
df.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------+
|words |
+------------------------------------------------------------------------------------------------------------------+
|[home, alone, apparently] |
|[st, louis, plant, close, die, old, age, workers, making, cars, since, onset, mass, automotive, production, 1920s]|
+------------------------------------------------------------------------------------------------------------------+
df.printSchema()
root
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)