如何将具有大量列数的 csv 文件导入 Apache Spark 2.0

Question

我运行将多个超过 250000 columns of float64 的小 csv 文件作为 Google Dataproc 集群导入 Apache Spark 2.0 运行时遇到问题。有一些字符串列，但真正感兴趣的只有 1 作为 class 标签。

当我在pyspark运行下面

csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")

我得到一个

File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings:

Where/how 我是否要设置解析器的最大列数以使用机器学习的数据。
是否有更好的方法来摄取数据以用于 Apache mllib？

定义一个 class 供数据框使用，但是否可以定义这么大的 class 而不必创建 210,000 个条目？

Answer 1

使用option:

spark.read.option("maxColumns", n).csv(...)

其中 n 是列数。

如何将具有大量列数的 csv 文件导入 Apache Spark 2.0

How to import csv files with massive column count into Apache Spark 2.0

csv

apache-spark

pyspark

apache-spark-mllib

google-cloud-dataproc