使用 Apache Spark 提取文件中的子字符串数据
Extract data which is a substring in a file using Apache Spark
我正在研究 spark 案例研究,我在 hdfs 中有 csv 文件,我正在 spark 上处理数据。其中一列中的数据被合并。
例如标题列有数据:
"EMS: BACK PAINS/INJURY"。 EMS 代表紧急情况,在 : 之后代表紧急情况的类型。将 csv 加载到 DF 时,我只需要在 (:) 之前加载数据(在本例中为 EMS)。这是我的代码片段,但它加载了完整的标题列。你能帮我看看如何对其进行子字符串化吗?
代码:
val schema = StructType(Array(StructField("latitude", DoubleType, true), StructField("longitude", DoubleType, true), StructField("desc", StringType, true), StructField("zip", StringType, true), StructField("title", StringType, true), StructField("timeStamp", StringType, true), StructField ("twp", StringType, true),StructField("addr", StringType, true), StructField("e", IntegerType, true)))
val df = spark.read.option("header","true").schema(schema).csv("hdfs://filepath/filename.csv")
#
示例数据:
lat|lng|desc|zip|title|timeStamp|twp|addr|e
40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;|19525|EMS: BACK PAINS/INJURY|12/10/2015 17:40|NEW HANOVER|REINDEER CT & DEAD END|1
40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1
40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;|19401|Fire: GAS-ODOR/LEAK|12/10/2015 17:40|NORRISTOWN|HAWS AVE|1
加载您的 csv
与 delimiter
一样 |
val data = spark.read
.option("delimiter", "|")
.option("header", true)
.schema(schema)
.csv(path)
//split the column title and get only befor : part
.withColumn("title", split($"title", ":")(0))
data.show(false)
输出:
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|latitude |longitude |desc |zip |title|timeStamp |twp |addr |e |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52; |19525|EMS |12/10/2015 17:40|NEW HANOVER |REINDEER CT & DEAD END |1 |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS |12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1 |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27; |19401|Fire |12/10/2015 17:40|NORRISTOWN |HAWS AVE |1 |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
希望对您有所帮助!
我正在研究 spark 案例研究,我在 hdfs 中有 csv 文件,我正在 spark 上处理数据。其中一列中的数据被合并。
例如标题列有数据:
"EMS: BACK PAINS/INJURY"。 EMS 代表紧急情况,在 : 之后代表紧急情况的类型。将 csv 加载到 DF 时,我只需要在 (:) 之前加载数据(在本例中为 EMS)。这是我的代码片段,但它加载了完整的标题列。你能帮我看看如何对其进行子字符串化吗?
代码:
val schema = StructType(Array(StructField("latitude", DoubleType, true), StructField("longitude", DoubleType, true), StructField("desc", StringType, true), StructField("zip", StringType, true), StructField("title", StringType, true), StructField("timeStamp", StringType, true), StructField ("twp", StringType, true),StructField("addr", StringType, true), StructField("e", IntegerType, true)))
val df = spark.read.option("header","true").schema(schema).csv("hdfs://filepath/filename.csv")
#
示例数据:
lat|lng|desc|zip|title|timeStamp|twp|addr|e
40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;|19525|EMS: BACK PAINS/INJURY|12/10/2015 17:40|NEW HANOVER|REINDEER CT & DEAD END|1
40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1
40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;|19401|Fire: GAS-ODOR/LEAK|12/10/2015 17:40|NORRISTOWN|HAWS AVE|1
加载您的 csv
与 delimiter
一样 |
val data = spark.read
.option("delimiter", "|")
.option("header", true)
.schema(schema)
.csv(path)
//split the column title and get only befor : part
.withColumn("title", split($"title", ":")(0))
data.show(false)
输出:
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|latitude |longitude |desc |zip |title|timeStamp |twp |addr |e |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52; |19525|EMS |12/10/2015 17:40|NEW HANOVER |REINDEER CT & DEAD END |1 |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS |12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1 |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27; |19401|Fire |12/10/2015 17:40|NORRISTOWN |HAWS AVE |1 |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
希望对您有所帮助!