来自 Scala 和 Apache Spark 上的 csv 的空值
Null values from a csv on Scala and Apache Spark
我正在使用 Apache Spark 2.3.0。当我上传一个 csv 文件然后输入 df.show 时,它显示了 table 和所有空值,我想知道为什么,因为 csv
中的一切看起来都很好
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
Rank,Grade,Channelname,VideoUploads,Subscribers,Videoviews
1st,A++ ,Zee TV,82757,18752951,20869786591
2nd,A++ ,T-Series,12661,61196302,47548839843
3rd,A++ ,Cocomelon - Nursery Rhymes,373,19238251,9793305082
4th,A++ ,SET India,27323,31180559,22675948293
5th,A++ ,WWE,36756,32852346,26273668433
6th,A++ ,Movieclips,30243,17149705,16618094724
7th,A++ ,netd müzik,8500,11373567,23898730764
8th,A++ ,ABS-CBN Entertainment,100147,12149206,17202609850
9th,A++ ,Ryan ToysReview,1140,16082927,24518098041
10th,A++ ,Zee Marathi,74607,2841811,2591830307
11th,A+ ,5-Minute Crafts,2085,33492951,8587520379
12th,A+ ,Canal KondZilla,822,39409726,19291034467
13th,A+ ,Like Nastya Vlog,150,7662886,2540099931
14th,A+ ,Ozuna,50,18824912,8727783225
15th,A+ ,Wave Music,16119,15899764,10989179147
16th,A+ ,Ch3Thailand,49239,11569723,9388600275
17th,A+ ,WORLDSTARHIPHOP,4778,15830098,11102158475
18th,A+ ,Vlad and Nikita,53,-- ,1428274554
因此,如果我们在没有模式的情况下加载,我们会看到以下内容:
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+------------+-----------+-----------+
|Rank|Grade| Channelname|VideoUploads|Subscribers| Videoviews|
+----+-----+--------------------+------------+-----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+------------+-----------+-----------+
如果我们应用您的架构,我们会看到:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+-----------+-------------+----------+----------+
|Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
+----+-----+-----------+-------------+----------+----------+
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
+----+-----+-----------+-------------+----------+----------+
现在,如果我们查看您的数据,我们会发现订阅者包含非整数值(“--”)并且视频观看包含超过整数最大值 (2,147,483,647) 的值
因此,如果我们更改模式以符合数据:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+-------------+----------+-----------+
|Rank|Grade| Channelname|Video Uploads|Suscribers| Videoviews|
+----+-----+--------------------+-------------+----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+-------------+----------+-----------+
null
值的原因是因为 csv API 的默认 "mode" 是 PERMISSIVE
:
mode (default PERMISSIVE): allows a mode for dealing with corrupt
records during parsing. It supports the following case-insensitive
modes.
- PERMISSIVE : sets other fields to null when it meets a
corrupted record, and puts the malformed string into a field
configured by columnNameOfCorruptRecord. To keep corrupt records, an
user can set a string type field named columnNameOfCorruptRecord in an
user-defined schema. If a schema does not have the field, it drops
corrupt records during parsing. When a length of parsed CSV tokens is
shorter than an expected length of a schema, it sets null for extra
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records
我正在使用 Apache Spark 2.3.0。当我上传一个 csv 文件然后输入 df.show 时,它显示了 table 和所有空值,我想知道为什么,因为 csv
中的一切看起来都很好val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
Rank,Grade,Channelname,VideoUploads,Subscribers,Videoviews
1st,A++ ,Zee TV,82757,18752951,20869786591
2nd,A++ ,T-Series,12661,61196302,47548839843
3rd,A++ ,Cocomelon - Nursery Rhymes,373,19238251,9793305082
4th,A++ ,SET India,27323,31180559,22675948293
5th,A++ ,WWE,36756,32852346,26273668433
6th,A++ ,Movieclips,30243,17149705,16618094724
7th,A++ ,netd müzik,8500,11373567,23898730764
8th,A++ ,ABS-CBN Entertainment,100147,12149206,17202609850
9th,A++ ,Ryan ToysReview,1140,16082927,24518098041
10th,A++ ,Zee Marathi,74607,2841811,2591830307
11th,A+ ,5-Minute Crafts,2085,33492951,8587520379
12th,A+ ,Canal KondZilla,822,39409726,19291034467
13th,A+ ,Like Nastya Vlog,150,7662886,2540099931
14th,A+ ,Ozuna,50,18824912,8727783225
15th,A+ ,Wave Music,16119,15899764,10989179147
16th,A+ ,Ch3Thailand,49239,11569723,9388600275
17th,A+ ,WORLDSTARHIPHOP,4778,15830098,11102158475
18th,A+ ,Vlad and Nikita,53,-- ,1428274554
因此,如果我们在没有模式的情况下加载,我们会看到以下内容:
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+------------+-----------+-----------+
|Rank|Grade| Channelname|VideoUploads|Subscribers| Videoviews|
+----+-----+--------------------+------------+-----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+------------+-----------+-----------+
如果我们应用您的架构,我们会看到:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+-----------+-------------+----------+----------+
|Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
+----+-----+-----------+-------------+----------+----------+
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
+----+-----+-----------+-------------+----------+----------+
现在,如果我们查看您的数据,我们会发现订阅者包含非整数值(“--”)并且视频观看包含超过整数最大值 (2,147,483,647) 的值
因此,如果我们更改模式以符合数据:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+-------------+----------+-----------+
|Rank|Grade| Channelname|Video Uploads|Suscribers| Videoviews|
+----+-----+--------------------+-------------+----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+-------------+----------+-----------+
null
值的原因是因为 csv API 的默认 "mode" 是 PERMISSIVE
:
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
- PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records