如何加载包含多行记录的 CSV 文件？

Question

我用的是 Spark 2.3.0.

作为 Apache Spark 的项目，我正在使用 this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) 文件。代码如下所示：

answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)

输出

+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  Id|  OwnerUserId|        CreationDate|ParentId|Score|                Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  92|           61|2008-08-01T14:45:37Z|      90|   13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"|                null|    null| null|                null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows

但是，当我使用 pandas 时，它就像一个魅力。

df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3)

输出

Index Id    OwnerUserId CreationDate    ParentId    Score   Body
0   92  61  2008-08-01T14:45:37Z    90  13  <p><a href="http://svnbook.red-bean.com/">Vers...
1   124 26  2008-08-01T16:09:47Z    80  12  <p>I wound up using this. It is a kind of a ha...

我的观察： Apache spark 将 csv 文件中的每一行都视为数据帧的记录（这是合理的），但另一方面，pandas 智能地（不确定基于哪些参数）计算出记录实际结束的位置。

问题我想知道，如何指示 Spark 正确加载数据帧。

加载的数据如下，92和124开头的行为两条记录。

Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>

<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>

<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);      
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i &lt; strSqlSplit.length; i++){
    NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"

Answer 1

经过几个小时的努力，我找到了解决方案。

分析： Whosebug 提供的数据转储有 quote(") 被另一个 quote(") 转义。而且由于 spark 使用 slash(\) 作为转义字符的默认值，我没有传递它，因此它最终会给出无意义的输出。

更新代码

answer_df = sparkSession.read.\
    csv('./stacksample/Answers_sample.csv', 
        inferSchema=True, header=True, multiLine=True, escape='"');

answer_df.show(2)

注意在csv()中使用escape参数。

输出

+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId|       CreationDate|ParentId|Score|                Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92|         61|2008-08-01 20:15:37|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01 21:39:47|      80|   12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+

希望它能帮助到其他人并为他们节省一些时间。

Answer 2

我认为你应该使用option("escape", "\"")因为看起来"被用作所谓的quote escape characters.

val q = spark.read
  .option("multiLine", true)
  .option("header", true)
  .option("escape", "\"")
  .csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId|        CreationDate|ParentId|Score|                Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92|         61|2008-08-01T14:45:37Z|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01T16:09:47Z|      80|   12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+

如何加载包含多行记录的 CSV 文件？

How to load CSV file with records on multiple lines?

csv

apache-spark

pyspark

pyspark-sql