来自 Python 字典的 PySpark Dataframe 没有 Pandas
PySpark Dataframe from Python Dictionary without Pandas
我正在尝试将以下 Python dict
转换为 PySpark DataFrame,但没有得到预期的输出。
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected
df_dict.show()
有没有不使用 Pandas 的方法来做到这一点?
试试这个:
dict_lst = [{'letters': 'a', 'numbers': 10},
{'letters': 'b', 'numbers': 20},
{'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF() # Result as expected
输出:
>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
最有效的方法是使用Pandas
import pandas as pd
spark.createDataFrame(pd.DataFrame(dict_lst))
您的 dict_lst
并不是您想要用来创建数据框的格式。如果你有一个字典列表而不是列表字典会更好。
此代码根据您的列表字典创建一个 DataFrame :
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key
row_lst = []
columns = dict_lst.keys()
for i in range(nb_rows[0]):
row_values = [lst[i] for lst in values_lst]
row_dict = {column: value for column, value in zip(columns, row_values)}
row = Row(**row_dict)
row_lst.append(row)
df = sqlContext.createDataFrame(row_lst)
引用 :
I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
所以最简单的事情就是将你的字典转换成这种格式。您可以使用 zip()
:
轻松完成此操作
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#+-------+-------+
以上假定所有列表的长度都相同。如果不是这种情况,则必须使用 itertools.izip_longest
(python2) or itertools.zip_longest
(python3).
from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30, 40]}
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#| null| 40|
#+-------+-------+
使用上面的 pault's
答案,我在我的数据框上强加了一个特定的模式,如下所示:
import pyspark
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dictToDF').getOrCreate()
获取数据:
dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()
创建架构:
from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
,StructField("numbers", IntegerType(), True)\
])
从字典创建 df - 使用模式:
df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
显示 df 模式:
df.printSchema()
root
|-- letters: string (nullable = true)
|-- numbers: integer (nullable = true)
您也可以使用 Python List to quickly prototype a DataFrame. The idea is based from Databricks 的教程。
df = spark.createDataFrame(
[(1, "a"),
(1, "a"),
(1, "b")],
("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| a|
| 1| b|
+---+-----+
我正在尝试将以下 Python dict
转换为 PySpark DataFrame,但没有得到预期的输出。
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected
df_dict.show()
有没有不使用 Pandas 的方法来做到这一点?
试试这个:
dict_lst = [{'letters': 'a', 'numbers': 10},
{'letters': 'b', 'numbers': 20},
{'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF() # Result as expected
输出:
>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
最有效的方法是使用Pandas
import pandas as pd
spark.createDataFrame(pd.DataFrame(dict_lst))
您的 dict_lst
并不是您想要用来创建数据框的格式。如果你有一个字典列表而不是列表字典会更好。
此代码根据您的列表字典创建一个 DataFrame :
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key
row_lst = []
columns = dict_lst.keys()
for i in range(nb_rows[0]):
row_values = [lst[i] for lst in values_lst]
row_dict = {column: value for column, value in zip(columns, row_values)}
row = Row(**row_dict)
row_lst.append(row)
df = sqlContext.createDataFrame(row_lst)
引用
I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
所以最简单的事情就是将你的字典转换成这种格式。您可以使用 zip()
:
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#+-------+-------+
以上假定所有列表的长度都相同。如果不是这种情况,则必须使用 itertools.izip_longest
(python2) or itertools.zip_longest
(python3).
from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30, 40]}
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#| null| 40|
#+-------+-------+
使用上面的 pault's
答案,我在我的数据框上强加了一个特定的模式,如下所示:
import pyspark
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dictToDF').getOrCreate()
获取数据:
dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()
创建架构:
from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
,StructField("numbers", IntegerType(), True)\
])
从字典创建 df - 使用模式:
df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
显示 df 模式:
df.printSchema()
root
|-- letters: string (nullable = true)
|-- numbers: integer (nullable = true)
您也可以使用 Python List to quickly prototype a DataFrame. The idea is based from Databricks 的教程。
df = spark.createDataFrame(
[(1, "a"),
(1, "a"),
(1, "b")],
("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| a|
| 1| b|
+---+-----+