加入两个数据帧后，它没有给出准确的值

Question

我有两个数据框 1 是 books1 with Schema

root
|-- asin: string (nullable = true)
|-- helpful: array (nullable = true)
|    |-- element: long (containsNull = true)
|-- overall: double (nullable = true)
|-- reviewText: string (nullable = true)
|-- reviewTime: string (nullable = true)
|-- reviewerID: string (nullable = true)
|-- reviewerName: string (nullable = true)
|-- summary: string (nullable = true)
|-- unixReviewTime: long (nullable = true)

另一个是带有架构的标签

root
 |-- value: integer (nullable = false)

books1 和标签包含

但现在当我使用加入命令加入他们时，

var bookdf = books1.join(label) 输出不正确

value 字段应该包含 2,6,0 但它只包含 2 为什么它不会发生。两个数据框中的行数相同

Answer 1

你不能 join 两个数据框而不提供连接表达式

如果两个数据框的行数相同，那么您可以创建一个新列作为 id，这对于两个数据框都是 row number 作为

val newBookDF = spark.sqlContext.createDataFrame(
  book1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(book1.schema.fields :+ StructField("index", LongType, false))
)

label 数据框也一样

val newLabelDF = spark.sqlContext.createDataFrame(
  label.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(label.schema.fields :+ StructField("index", LongType, false))
)

现在您可以 join 最终两个数据帧，例如

newBookDF.join(newLabelDF, Seq("id")).drop("id")

这会给你预期的结果

加入两个数据帧后，它没有给出准确的值

after joining two dataframes it is not giving the accurate values

apache-spark

apache-spark-sql

spark-dataframe