列表到 pyspark 中的 DataFrame
List to DataFrame in pyspark
有人能告诉我如何在 pyspark 中将包含字符串的列表转换为 Dataframe。我正在使用 python 3.6 和 spark 2.2.1。我刚开始学习 spark 环境,我的数据如下所示
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
现在,我想创建一个Dataframe如下
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
我什至想添加数据中没有关联的 ID 列
您可以将该列表转换为 Row 对象的列表,然后使用 spark.createDataFrame
从您的数据中推断架构:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
试试这个 -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
试试这个——最简单的方法
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
简单方法:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+
有人能告诉我如何在 pyspark 中将包含字符串的列表转换为 Dataframe。我正在使用 python 3.6 和 spark 2.2.1。我刚开始学习 spark 环境,我的数据如下所示
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
现在,我想创建一个Dataframe如下
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
我什至想添加数据中没有关联的 ID 列
您可以将该列表转换为 Row 对象的列表,然后使用 spark.createDataFrame
从您的数据中推断架构:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
试试这个 -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
试试这个——最简单的方法
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
简单方法:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+