使用spark read from hbase时序列化的一些问题

Question

我想实现一个 class 有一个通过 spark 从 hbase 读取的函数，像这样：

public abstract class QueryNode implements Serializable{
  private static final long serialVersionUID = -2961214832101500548L;
  private int id;
  private int parent;
  protected static Configuration hbaseConf;
  protected static Scan scan;
  protected static JavaSparkContext sc;
  public abstract RDDResult query();
  public int getParent() {
      return parent;
  }

  public void setParent(int parent) {
      this.parent = parent;
  }

  public int getId() {
      return id;
  }

  public void setId(int id) {
      this.id = id;
  }
  public void setScanToConf() {
     try {
          ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
          String scanToString = Base64.encodeBytes(proto.toByteArray());
          hbaseConf.set(TableInputFormat.SCAN, scanToString);
      } catch (IOException e) {
        e.printStackTrace();
      }
  }}

这是父 class，我有一些子 classes 实现了从 hbase 读取的方法 query()，但是如果我设置 Configuration，Scan 和 JavaSparkContext 不是静态的，我会得到一些错误 ：这些 class 没有序列化 。

为什么这些 classes 必须是静态的？我还有其他方法可以解决这个问题吗？谢谢。

Answer 1

您可以尝试为这些字段设置transient以避免像

这样的序列化异常

Caused by: java.io.NotSerializableException: org.apache.spark.streaming.api.java.JavaStreamingContext

所以你对 java 说你只是不想序列化这些字段：

  protected transient Configuration hbaseConf;
  protected transient Scan scan;
  protected transient JavaSparkContext sc;

您是否在 main 或任何静态方法中初始化 JavaSparkContext、Configuration 和 Scan？使用静态，您的字段将在所有实例中共享。但是否应使用 static 取决于您的用例。

但是使用 transient 方式它比 static 更好，因为 JavaSparkCOntext 的序列化没有意义 因为这是在驱动程序上创建的.

--评论讨论后编辑:

java doc for newAPIHadoopRDD

public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairRDD<K,V> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
                                                                                            Class<F> fClass,
                                                                                            Class<K> kClass,
                                                                                            Class<V> vClass)

conf - Configuration for setting up the dataset. Note: This will be put into a Broadcast. Therefore if you plan to reuse this conf to create multiple RDDs, you need to make sure you won't modify the conf. A safe approach is always creating a new conf for a new RDD.

广播：

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

所以基本上我认为 static 是可以的（你只创建一次 hbaceConf），但如果你想避免 static，你可以按照 javadoc 中的建议始终为新的 RDD 创建新的 conf。

使用spark read from hbase时序列化的一些问题

Some problems about serialization when i use spark read from hbase

hbase

apache-spark