Spark SQL - Select yields AttributeError: 'module' object has no attribute 'api'

Question

美好的一天，

我正在使用 Azure HDinsight cluster 的基本安装，为 spark 配置。我在 Jupyter Notebook, PySpark.

在提供的 00 - [首先阅读我] PySpark Kernel Features.ipynb file 中工作，我在执行 spark sql 'SELECT' 时发现了以下 error/bug:

AttributeError: 'module' object has no attribute 'api'

执行代码：

%%sql -o query1

SELECT clientid, querytime, deviceplatform, querydwelltime

FROM hivesampletable

WHERE state = 'Washington' AND devicemake = 'Microsoft'

我在其他代码中使用 SELECT 时出现了同样的错误。由于此处记录的代码出现在提供的基线 'tutorial' 中，我希望它不是编码错误。我在 PySpark 和 PySpark 3 kernel.

中遇到了同样的错误

有人experience/advice/suggestions可以分享吗？

跟踪：

AttributeErrorTraceback (most recent call last) /usr/bin/anaconda/lib/python2.7/site-packages/IPython/core/formatters.pyc in call(self, obj)

 902                 pass

 903             else:

 --> 904                 printer(obj)

 905                 return True

 906             # Finally look for special method names

/usr/bin/anaconda/lib/python2.7/site-packages/autovizwidget/widget/utils.pyc 在 display_dataframe(df)

114 

115 def display_dataframe(df):

 --> 116     selected_x = select_x(df)

117     selected_y = select_y(df, selected_x)

118     encoding = Encoding(chart_type=Encoding.chart_type_table, x=selected_x, y=selected_y,

/usr/bin/anaconda/lib/python2.7/site-packages/autovizwidget/widget/utils.pyc 在 select_x(data, order)

 70         _validate_custom_order(order)

 71 

 ---> 72     d = _classify_data_by_type(data, order)

 73 

 74     chosen_x = None

/usr/bin/anaconda/lib/python2.7/site-packages/autovizwidget/widget/utils.pyc 在 _classify_data_by_type(data, order, skip)

 48     for column_name in data:

 49         if column_name not in skip:

 ---> 50             typ = infer_vegalite_type(data[column_name])

 51             d[typ].append(column_name)

 52

/usr/bin/anaconda/lib/python2.7/site-packages/autovizwidget/widget/utils.pyc 在 infer_vegalite_type(data)

 14     """

 15 

 ---> 16     typ = pd.api.types.infer_dtype(data)

 17 

 18     if typ in ['floating', 'mixed-integer-float', 'integer',

AttributeError: 'module' object has no attribute 'api'

Answer 1

笔记本使用的是 pandas 的 0.17.1 版，但 autovizwidget 依赖于具有 'api' 模块的更高版本的 pandas。有人告诉我，这将在 HDInsight 配置的后续版本中得到解决。

ssh 进入集群并运行以下内容：

sudo -HE /usr/bin/anaconda/bin/conda install pandas

Answer 2

我运行从这些说明中遇到了同样的问题：https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query 并且刚刚换成使用 pySpark3 一切都很好

Answer 3

有同样的问题。我用过：

pip install pandas --upgrade --user

通过 jupyter notebook 中可用的终端。

Spark SQL - Select yields AttributeError: 'module' object has no attribute 'api'

Spark SQL - Select yields AttributeError: 'module' object has no attribute 'api'

azure

pyspark-sql

azure-hdinsight