诊断 Dataproc 创建集群操作中的错误(Java 库)
Diagnosing Errors in Dataproc Create Cluster operation (Java library)
尝试使用 Google Dataproc 创建集群时,结果似乎最初 return 成功,但集群的后续 "get" 告诉我集群立即运行从 "Creating" 到 "Error" 状态。不幸的是,尝试调用诊断调用似乎没有帮助。
这就是我正在做的事情(已经采取了一些自由来使用硬编码字符串而不是通过 api 或通过配置属性获得的值来呈现代码):
String projectId = "wide-isotope-147019";
String region = "us-central1-f"
GceClusterConfig computeEngineConfig = new GceClusterConfig();
computeEngineConfig.setZoneUri(
String.format(ZONE_URI_FORMAT, config.getProjectid(),
config.getRegion())
List<String> tagList = new ArrayList<>();
tagList.add("ClusterName: mrfoo");
computeEngineConfig.setTags(tagList);
String machineType = String.format(MACHINE_TYPE_URI_FORMAT,
projectId, region, "n1-standard-1");
InstanceGroupConfig masterConfig = new InstanceGroupConfig();
masterConfig.setMachineTypeUri(machineType)
.setNumInstances(1);
InstanceGroupConfig workerConfig = new InstanceGroupConfig();
workerConfig.setMachineTypeUri(machineType)
.setNumInstances(1);
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setMasterConfig(masterConfig);
clusterConfig.setWorkerConfig(workerConfig);
List<NodeInitializationAction> installActions = new ArrayList<>();
// no init actions yet. want to get basics working first.
clusterConfig.setInitializationActions(installActions);
Cluster cluster = new Cluster();
cluster.setProjectId();
cluster.setConfig(clusterConfig);
cluster.setClusterName("mrfoo");
Dataproc.Projects.Regions.Clusters.Create createOp = null;
Operation result = null;
try {
createOp = dataproc.projects().regions().clusters()
.create(projectId, "global", cluster);
createOp.setBearerToken(...);
} catch (IOException ex) {
// handle ...
}
try {
result = createOp.execute();
} catch (IOExceptions ex) {
// handle.
}
return result;
以上生成 "reasonable" 结果没有错误。但是,稍后,当我执行 get 操作时:
Dataproc.Projects.RegoinsClsuters.Get getOp = null;
Cluster result = null;
try {
getOp = dataproc.projects().regions().clusters()
.get("wide-isotope-147019", "global", "mrfoo");
getOp.setBearerToken(...);
} catch (IOException ioEx) {
...
}
try {
result = getOp.execute();
} catch (IOException ioEx) {
...
}
该过程不会产生错误,但它告诉我们集群的状态是:(很抱歉转储很长。请参阅最后,它显示历史记录为正在创建,但当前状态为错误)。
{"clusterName":"mrfoo","clusterUuid":"<id string>","config":
{"configBucket":"dataproc-<idstring>",
"gceClusterConfig":"projectId":"wide-isotope-147019",
<lots of stuff deleted>
"status":{"state":"ERROR",
"stateStartTime":"2016-12-13T00:27:11.143Z"},
"statusHistory":[
{"state":"CREATING",
"stateStartTime":"2016-12-13T00:27:09.947Z"}]}
创建 Dataproc 集群的一般模式是:
Operaiton op = createCluster(...);
while(!op.getDone()) {
sleep(10);
op = getOperation(op.getName());
}
if (op.hasError()) {
// Display op.getError();
}
通过查看日志,在这种特殊情况下,我可以说问题是 Compute Engine 拒绝传递的实例标签,因为它们与 Compute Engine 的有效标签正则表达式不匹配:'(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)'
。我已经提交了一个错误,这样 Dataproc 将更快地验证实例标签并在您尝试创建集群时立即引发错误,而不是在操作中设置错误。
尝试使用 Google Dataproc 创建集群时,结果似乎最初 return 成功,但集群的后续 "get" 告诉我集群立即运行从 "Creating" 到 "Error" 状态。不幸的是,尝试调用诊断调用似乎没有帮助。
这就是我正在做的事情(已经采取了一些自由来使用硬编码字符串而不是通过 api 或通过配置属性获得的值来呈现代码):
String projectId = "wide-isotope-147019";
String region = "us-central1-f"
GceClusterConfig computeEngineConfig = new GceClusterConfig();
computeEngineConfig.setZoneUri(
String.format(ZONE_URI_FORMAT, config.getProjectid(),
config.getRegion())
List<String> tagList = new ArrayList<>();
tagList.add("ClusterName: mrfoo");
computeEngineConfig.setTags(tagList);
String machineType = String.format(MACHINE_TYPE_URI_FORMAT,
projectId, region, "n1-standard-1");
InstanceGroupConfig masterConfig = new InstanceGroupConfig();
masterConfig.setMachineTypeUri(machineType)
.setNumInstances(1);
InstanceGroupConfig workerConfig = new InstanceGroupConfig();
workerConfig.setMachineTypeUri(machineType)
.setNumInstances(1);
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setMasterConfig(masterConfig);
clusterConfig.setWorkerConfig(workerConfig);
List<NodeInitializationAction> installActions = new ArrayList<>();
// no init actions yet. want to get basics working first.
clusterConfig.setInitializationActions(installActions);
Cluster cluster = new Cluster();
cluster.setProjectId();
cluster.setConfig(clusterConfig);
cluster.setClusterName("mrfoo");
Dataproc.Projects.Regions.Clusters.Create createOp = null;
Operation result = null;
try {
createOp = dataproc.projects().regions().clusters()
.create(projectId, "global", cluster);
createOp.setBearerToken(...);
} catch (IOException ex) {
// handle ...
}
try {
result = createOp.execute();
} catch (IOExceptions ex) {
// handle.
}
return result;
以上生成 "reasonable" 结果没有错误。但是,稍后,当我执行 get 操作时:
Dataproc.Projects.RegoinsClsuters.Get getOp = null;
Cluster result = null;
try {
getOp = dataproc.projects().regions().clusters()
.get("wide-isotope-147019", "global", "mrfoo");
getOp.setBearerToken(...);
} catch (IOException ioEx) {
...
}
try {
result = getOp.execute();
} catch (IOException ioEx) {
...
}
该过程不会产生错误,但它告诉我们集群的状态是:(很抱歉转储很长。请参阅最后,它显示历史记录为正在创建,但当前状态为错误)。
{"clusterName":"mrfoo","clusterUuid":"<id string>","config":
{"configBucket":"dataproc-<idstring>",
"gceClusterConfig":"projectId":"wide-isotope-147019",
<lots of stuff deleted>
"status":{"state":"ERROR",
"stateStartTime":"2016-12-13T00:27:11.143Z"},
"statusHistory":[
{"state":"CREATING",
"stateStartTime":"2016-12-13T00:27:09.947Z"}]}
创建 Dataproc 集群的一般模式是:
Operaiton op = createCluster(...);
while(!op.getDone()) {
sleep(10);
op = getOperation(op.getName());
}
if (op.hasError()) {
// Display op.getError();
}
通过查看日志,在这种特殊情况下,我可以说问题是 Compute Engine 拒绝传递的实例标签,因为它们与 Compute Engine 的有效标签正则表达式不匹配:'(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)'
。我已经提交了一个错误,这样 Dataproc 将更快地验证实例标签并在您尝试创建集群时立即引发错误,而不是在操作中设置错误。