将 Python 转换为 Scala

Convert Python to Scala

我是 Scala 的新手,我曾经使用 python。

我想将程序从 Python 转换为 Scala,但在执行以下 2 行时遇到困难(创建 sql 数据帧)

python代码

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

data = dataset.map(lambda (filepath, text): (filepath.split("/")[-1],text, filepath.split("/")[-2]))
df = sqlContext.createDataFrame(data, schema)

我做了这个

scala 代码

val category = dataset.map { case (filepath, text) => filepath.split("/")(6) }

val id = dataset.map { case (filepath, text) => filepath.split("/")(7) }

val text = dataset.map { case (filepath, text) => text }

val schema = StructType(Seq(
  StructField(id.toString(), StringType, true), 
  StructField(category.toString(), StringType, true), 
  StructField(text.toString(), StringType, true)
))

现在我被堵在那里了!

值得一提的是,我已经按字面意思转换了您的代码,并在我的机器上使用 spark 2.3.2 编译了以下内容

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.implicits._

// Introduced to make code clearer
case class FileRecord(name: String, text: String)

// Whatever data set you have (a single record dataset is hard coded, replace with your data)
val dataSet =  Seq(FileRecord("/a/b/c/d/e/f/g/h/i", "example contents")).toDS()

// Whatever you need with path length 6 and 7 hardcoded (you might want to change this)
// you may be able to do the following three map operations more efficiently
val category = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(6) }

val id = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(7) }

val text = dataSet.map { case FileRecord(filepath, text) => text }

val schema = StructType(Seq(
  StructField(id.toString(), StringType, true),
  StructField(category.toString(), StringType, true),
  StructField(text.toString(), StringType, true)
))