纱线在 1 小时后自动杀死所有作业，没有错误

Question

我们的纱线在整整 1 小时后杀死了所有运行个作业。不管是 spark 还是 Sqoop 作业 (mapreduce)。

正在寻找有关潜在原因的建议。

我们在 4 节点集群上使用 HDP 2。5.x hadoop 分布。

我就是这样运行 sqoop 工作

nohup sqoop-import -D mapred.task.timeout=0 --direct --connect jdbc:oracle:thin:@HOST:Port:DB --username USERNAME --password PASS --target-dir /prod/directory  --table TABLE_NAME --verbose -m 25 --split-by TABLE_NAME.COLUMN --as-parquetfile --fields-terminated-by "\t" > temp.log 2>&1 &

全文如下

16/11/26 01:40:49 INFO mapreduce.Job:  map 42% reduce 0%
16/11/26 01:41:44 INFO mapreduce.Job:  map 0% reduce 0%
16/11/26 01:41:44 INFO mapreduce.Job: Job job_1480141487938_0001 failed with state KILLED due to: Application killed by user.
16/11/26 01:41:44 INFO mapreduce.Job: Counters: 0
16/11/26 01:41:44 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
16/11/26 01:41:44 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 3,628.6498 seconds (0 bytes/sec)
16/11/26 01:41:44 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/11/26 01:41:44 INFO mapreduce.ImportJobBase: Retrieved 0 records.
16/11/26 01:41:44 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@131276c2
16/11/26 01:41:44 ERROR tool.ImportTool: Error during import: Import job failed!

Yarn 应用日志

yarn logs -applicationId application_1480141487938_0001|grep -B2 -A10 "ERROR "
16/11/26 03:05:39 INFO impl.TimelineClientImpl: Timeline service address: http://HostName:8188/ws/v1/timeline/
16/11/26 03:05:39 INFO client.RMProxy: Connecting to ResourceManager at HostName/HostIp:8050
16/11/26 03:05:39 INFO client.AHSProxy: Connecting to Application History server at HostName/HostIp:10200
16/11/26 03:05:40 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/11/26 03:05:40 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
2016-11-26 00:41:33,284 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1480141487938_0001: ask=1 release= 2 newContainers=0 finishedContainers=2 resourcelimit=<memory:20480, vCores:1> knownNMs=4
2016-11-26 00:41:33,285 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e09_1480141487938_0001_01_000028
2016-11-26 00:41:33,285 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e09_1480141487938_0001_01_000028
2016-11-26 00:41:33,285 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e09_1480141487938_0001_01_000029
2016-11-26 00:41:33,285 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e09_1480141487938_0001_01_000029
2016-11-26 00:41:33,686 INFO [Socket Reader #1 for port 41553] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1480141487938_0001 (auth:SIMPLE)
2016-11-26 00:41:33,697 INFO [IPC Server handler 6 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1480141487938_0001_m_9895604650011 asked for a task
2016-11-26 00:41:33,698 INFO [IPC Server handler 6 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1480141487938_0001_m_9895604650011 given task: attempt_1480141487938_0001_m_000024_0
2016-11-26 00:41:37,542 INFO [IPC Server handler 19 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000000_0 is : 0.0
2016-11-26 00:41:38,793 INFO [IPC Server handler 22 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000001_0 is : 0.0
2016-11-26 00:41:38,811 INFO [IPC Server handler 23 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000006_0 is : 0.0
2016-11-26 00:41:38,939 INFO [IPC Server handler 28 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000007_0 is : 0.0
2016-11-26 00:41:40,568 INFO [IPC Server handler 22 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000000_0 is : 0.0
2016-11-26 00:41:41,812 INFO [IPC Server handler 24 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000001_0 is : 0.0
2016-11-26 00:41:41,832 INFO [IPC Server handler 25 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000006_0 is : 0.0

RM 审计日志

2016-11-26 01:41:43,359 INFO resourcemanager.RMAuditLogger: USER=yarn   IP=HostIp   OPERATION=Kill Application Request  TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1480141487938_0001    CALLERCONTEXT=CLI

我已经将我在 Ambari 中找到的每个值从 3600 修改为更大的值，重新启动集群并重新运行脚本。对于 sqoop 和 spark 作业，1 小时后作业仍然被杀死。

编辑：

yarn logs -show_application_log_info -applicationId application_1480141487938_0001

仅显示从 1 到 27 的容器 ID。那么，我在哪里可以找到容器 28 和 29 的 log/error？

Answer 1

我们一直无法完全隔离问题，只是它与网络有关。事实证明，即使我将所有可能的参数从 3600 增加到更多，在 client/node 方面，某种心跳被设置为 3600 秒并且没有得到更新。

所以，基本上在将近一个小时后，心跳会尝试通信，但失败了，AM 会终止整个工作。

由于 hadoop、Hortonworks 和 Cloudera 的文档确实缺少每个版本都启用的特定端口和协议规范 needs/should，我们最终不得不关闭 iptables 来解决这个问题。

纱线在 1 小时后自动杀死所有作业，没有错误

Yarn Automatically killing all Jobs exactly after 1 hour with no error

hadoop

hdfs

sqoop

hadoop-yarn

hortonworks-data-platform