从 pyspark 数据框创建 Numpy 矩阵
Creating Numpy Matrix from pyspark dataframe
我有一个 pyspark 数据框 child
,其中的列如下:
lat1 lon1
80 70
65 75
我正在尝试使用 IndexedRowMatrix 将其转换为 numpy 矩阵,如下所示:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))
但它向我抛出错误。我想避免转换为 pandas 数据帧来获取矩阵。
错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
process()
您想避免 pandas,但您尝试转换为一个严重次优的 RDD...
无论如何,假设您可以 collect
您的 child
数据框的选定列(一个合理的假设,因为您打算将它们放在一个 Numpy 数组中),它可以用普通的 Numpy 完成:
import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70],
# [65, 75]])
我有一个 pyspark 数据框 child
,其中的列如下:
lat1 lon1
80 70
65 75
我正在尝试使用 IndexedRowMatrix 将其转换为 numpy 矩阵,如下所示:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))
但它向我抛出错误。我想避免转换为 pandas 数据帧来获取矩阵。
错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
process()
您想避免 pandas,但您尝试转换为一个严重次优的 RDD...
无论如何,假设您可以 collect
您的 child
数据框的选定列(一个合理的假设,因为您打算将它们放在一个 Numpy 数组中),它可以用普通的 Numpy 完成:
import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70],
# [65, 75]])