scala/spark -- 如何使用指定的任意字节数组创建一个字符串？

Question

当我使用spark访问HBase时，我需要在HBaseConfiguration中指定正确的扫描范围，然后我使用HBaseConfiguration创建一个RDD。好像是下面这样：

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, inputTable)
conf.set(TableInputFormat.SCAN_ROW_START, start_row_string)
conf.set(TableInputFormat.SCAN_ROW_STOP, end_row_string)
val hBaseRDD = sc.newAPIHadoopRDD(
    conf,
    classOf[TableInputFormat],
    classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
    classOf[org.apache.hadoop.hbase.client.Result]
  )

在那之后我可以以任何方式操作 hBaseRDD。但是代码中的 start_row_string 必须是字符串会导致问题。在 HBase 中，我的行键是由一个以 Int.That 开头的字节数组创建的，也就是说：

 val row_key = byte array of Int ++ arbitrary byte array

我创建了row_key字节数组，并将row key转成字符串传给上面提到的HBaseConfiguration后，发现我错了：

val row_key_string = new String(row_key)

因为row_key_string.getBytes不等于row_key.所以HBase没有得到正确的开始行键，HBase returns 我一个错误的数据。示例：

val arr = Array(0,0,15,-77) //which is the array byte of 4018 val str = new String(arr) arr.getBytes //return Array(0, 0, 15, -17, -65, -67) arr.getBytes("UTF-16BE") //return Array(0,0,15,-77)

arr.getBytes("UTF-16BE") returns 正确的 answer.since 方法 getBytes 由 spark 调用，我无法指定 getBytes.
的字符集

如果我不能解决问题。我必须放弃 NewAPIHadoopRDD。我可以在每个执行器中建立连接，并使用 Scan ，它采用字节数组来指定 HBase 提供的起始行键 Client.But 很难看。

Answer 1

我已经通过使用 TableInputFormat.SCAN.It 是一个 BASE64 字符串解决了我的问题。将任意字节数组转换为字符串是错误的，因为它的行为不受控制。

scala/spark -- 如何使用指定的任意字节数组创建一个字符串？

scala/spark -- How to create a string which using a specified arbitrary byte array?

java

hadoop

hbase

scala

apache-spark