无法将 pyspark 数据框加载到决策树算法。它说不能使用 pyspark 数据框

Couldn't load pyspark data frame to decision tree algorithm. It says can't work with pyspark data frame

我在 IBM 的数据平台上工作。我能够将数据加载到 pyspark 数据框中并生成一个 spark SQL table。拆分数据集后,将其输入分类算法。它会引发诸如 spark SQL data can't load 之类的错误。需要 ndarrays.

from sklearn.ensemble import RandomForestRegressor`
from sklearn.model_selection import train_test_split`
from sklearn import preprocessing`
import numpy as np`

X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
RM = RandomForestRegressor()
RM.fit(X_train.reshape(1,-1),y_train)`

错误:

TypeError: Expected sequence or array-like, got {<}class 'pyspark.sql.dataframe.DataFrame'>

在这个错误之后,我做了这样的事情:

x = spark.sql('select Id,YearBuilt,MoSold,YrSold,Fireplaces FROM Train').toPandas()
y = spark.sql('Select SalePrice FROM Train where SalePrice is not null').toPandas()

错误:

AttributeError Traceback (most recent call last) in () 5 X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42) 6 RM = RandomForestRegressor() ----> 7 RM.fit(X_train.reshape(1,-1),y_train) /opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name) 5065 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5066 return self[name] -> 5067 return object.getattribute(self, name) 5068 5069 def setattr(self, name, value): AttributeError: 'DataFrame' object has no attribute 'reshape'

正如 sklearn 文档所说:

"""
    X : array-like or sparse matrix, shape = [n_samples, n_features]
"""
regr = RandomForestRegressor()
regr.fit(X, y)

因此,首先您尝试将 pandas.DataFrame 作为 X 参数,而不是 array

其次,reshape() 方法不是 DataFrame 对象的属性,而是 numpy array.

import numpy as np
x = np.array([[2,3,4], [5,6,7]]) 
np.reshape(x, (3, -1))

希望对您有所帮助。