在 Cloudera VM 中读取教程 CSV 文件时出现异常

Question

我正在尝试编写 Cloudera 虚拟机附带的 Spark 教程。但是即使我使用了正确的行结束编码，我也无法执行脚本，因为我遇到了很多错误。该教程是 Coursera Introduction to Big Data Analytics course. The assignment can be found here.

的一部分

这就是我所做的。安装 IPython shell（如果尚未安装）：

sudo easy_install ipython==1.2.1

Open/Start shell（使用 1.2.0 或 1.4.0）：

PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.2.0

将行尾设置为 windows 样式。这是因为文件是 windows-encoding 并且在课程中说要这样做。如果你不这样做，你会得到其他错误。

sc._jsc.hadoopConfiguration().set('textinputformat.record.delimiter','\r\n')

正在尝试加载 CSV 文件：

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',header = 'true',inferSchema = 'true',path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

但是得到一个很长的错误列表，开头是这样的：

Py4JJavaError: An error occurred while calling o23.load.: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)

完整的错误消息 can be seen here。这是 /etc/hive/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
  <!-- that are implied by Hadoop setup variables.                                                -->
  <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
  <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
  <!-- resource).                                                                                 -->

  <!-- Hive Execution Parameters -->

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>cloudera</value>
  </property>

  <property>
    <name>hive.hwi.war.file</name>
    <value>/usr/lib/hive/lib/hive-hwi-0.8.1-cdh4.0.0.jar</value>
    <description>This is the WAR file with the jsp content for Hive Web Interface</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>
</configuration>

任何帮助或想法如何解决？我想这是一个很常见的错误。但是我还没有找到任何解决方案。

还有一件事：有没有办法将如此长的错误消息转储到单独的日志文件中？

Answer 1

好像有两个问题。首先，hive-metastore 在某些情况下会离线。其次，无法推断模式。因此，我手动创建了一个架构，并在加载 CSV 文件时将其添加为参数。无论如何，我很想知道这是否适用于 schemaInfer=true.

这是我使用手动定义架构的版本。因此，确保配置单元已启动：

sudo service hive-metastore restart

然后，查看 CSV 文件的第一部分以了解其结构。我使用了这个命令行：

head /usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv

现在，打开 python shell。有关如何执行此操作的信息，请参阅原始帖子。然后定义模式：

from pyspark.sql.types import *
schema = StructType([
    StructField("business_id", StringType(), True),
    StructField("cool", IntegerType(), True),
    StructField("date", StringType(), True),
    StructField("funny", IntegerType(), True),
    StructField("id", StringType(), True),
    StructField("stars", IntegerType(), True),
    StructField("text", StringType(), True),
    StructField("type", StringType(), True),
    StructField("useful", IntegerType(), True),
    StructField("user_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("full_address", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("neighborhood", StringType(), True),
    StructField("open", StringType(), True),
    StructField("review_count", IntegerType(), True),
    StructField("state", StringType(), True)])

然后通过指定架构加载 CSV 文件。请注意，无需设置 windows 行结尾：

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
schema = schema,
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

在数据集上执行任何方法的结果。我尝试获取计数，效果很好。

yelp_df.count()

感谢@yaron 的帮助，我们可以弄清楚如何使用 inferSchema 加载 CSV。首先，您必须正确设置 hive-metastore：

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

然后，开始 Python shell 并且不要将行尾更改为 Windows 编码。请记住，更改是持久的（会话不变）。所以，如果你之前改成Windows风格，你需要重新设置它'\n'。然后加载 inferSchema 设置为 true 的 CSV 文件：

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
inferSchema = 'true',
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

Answer 2

讨论总结：执行以下命令解决了问题：

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

有关详细信息，请参阅 https://www.coursera.org/learn/bigdata-analytics/supplement/tyH3p/setup-pyspark-for-dataframes。

在 Cloudera VM 中读取教程 CSV 文件时出现异常

Exceptions when reading tutorial CSV file in the Cloudera VM

python

csv

hadoop

pyspark

cloudera-quickstart-vm