Spark Cluster 无法理解的输出
Spark Cluster incomprehensible output
我正在执行 third-part tool,在 Spark 中以集群模式实现。
在单机上执行时,在执行过程中产生了可理解的输出,但在集群模式下执行时,几分钟后我可以观察到这种输出:
...
INFO scheduler.TaskSetManager: Starting task 95.0 in stage 1.0 (TID 199, 10.0.0.13, executor 5, partition 95, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 87.0 in stage 1.0 (TID 191) in 442674 ms on 10.0.0.13 (executor 5) (80/104)
INFO scheduler.TaskSetManager: Starting task 96.0 in stage 1.0 (TID 200, 10.0.0.13, executor 4, partition 96, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 88.0 in stage 1.0 (TID 192) in 427022 ms on 10.0.0.13 (executor 4) (81/104)
INFO scheduler.TaskSetManager: Starting task 97.0 in stage 1.0 (TID 201, 10.0.0.13, executor 6, partition 97, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 89.0 in stage 1.0 (TID 193) in 434826 ms on 10.0.0.13 (executor 6) (82/104)
INFO scheduler.TaskSetManager: Starting task 98.0 in stage 1.0 (TID 202, 10.0.0.13, executor 5, partition 98, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 90.0 in stage 1.0 (TID 194) in 428479 ms on 10.0.0.13 (executor 5) (83/104)
INFO scheduler.TaskSetManager: Starting task 99.0 in stage 1.0 (TID 203, 10.0.0.13, executor 4, partition 99, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 92.0 in stage 1.0 (TID 196) in 421363 ms on 10.0.0.13 (executor 4) (84/104)
INFO scheduler.TaskSetManager: Starting task 100.0 in stage 1.0 (TID 204, 10.0.0.13, executor 6, partition 100, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 91.0 in stage 1.0 (TID 195) in 436868 ms on 10.0.0.13 (executor 6) (85/104)
INFO scheduler.TaskSetManager: Starting task 101.0 in stage 1.0 (TID 205, 10.0.0.13, executor 7, partition 101, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 93.0 in stage 1.0 (TID 197) in 423796 ms on 10.0.0.13 (executor 7) (86/104)
INFO scheduler.TaskSetManager: Starting task 102.0 in stage 1.0 (TID 206, 10.0.0.13, executor 5, partition 102, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 95.0 in stage 1.0 (TID 199) in 431473 ms on 10.0.0.13 (executor 5) (87/104)
INFO scheduler.TaskSetManager: Starting task 103.0 in stage 1.0 (TID 207, 10.0.0.13, executor 7, partition 103, ANY, 5335 bytes)
INFO scheduler.TaskSetManager: Finished task 94.0 in stage 1.0 (TID 198) in 448226 ms on 10.0.0.13 (executor 7) (88/104)
INFO scheduler.TaskSetManager: Finished task 96.0 in stage 1.0 (TID 200) in 435101 ms on 10.0.0.13 (executor 4) (89/104)
INFO scheduler.TaskSetManager: Finished task 97.0 in stage 1.0 (TID 201) in 423836 ms on 10.0.0.13 (executor 6) (90/104)
INFO scheduler.TaskSetManager: Finished task 98.0 in stage 1.0 (TID 202) in 415700 ms on 10.0.0.13 (executor 5) (91/104)
INFO scheduler.TaskSetManager: Finished task 99.0 in stage 1.0 (TID 203) in 410550 ms on 10.0.0.13 (executor 4) (92/104)
INFO scheduler.TaskSetManager: Finished task 100.0 in stage 1.0 (TID 204) in 420337 ms on 10.0.0.13 (executor 6) (93/104)
INFO scheduler.TaskSetManager: Finished task 103.0 in stage 1.0 (TID 207) in 318385 ms on 10.0.0.13 (executor 7) (94/104)
INFO scheduler.TaskSetManager: Finished task 101.0 in stage 1.0 (TID 205) in 421965 ms on 10.0.0.13 (executor 7) (95/104)
INFO scheduler.TaskSetManager: Finished task 102.0 in stage 1.0 (TID 206) in 425816 ms on 10.0.0.13 (executor 5) (96/104)
...
没有提供太多信息。有没有办法查看可以在本地执行中观察到的输出?
此外,在几十分钟后,我可以观察到 CPU 工作负载在两台机器上几乎减少到 0%,而就在几分钟前,它们几乎是 100% 忙。可能是 spark-submit
期间分配的资源很少?我不知道,既然这个输出没有给出任何线索,我能做些什么来调查或获得一些更有价值的信息?
例如,我尝试连接到 http://localhost:4040 as suggested here,但没有收到任何响应
找到了我平时在本地模式下观察到的正常stack-trace,对于Spark是如何设计的,在Spark Worker的中找到它确实很自然stack-trace。
为了访问它,我连接了一个浏览器(因为我在 SSH 中连接了一个虚拟机,我在作业执行期间使用包含 Spark Worker 的节点的 lynx) to http://localhost:8081:点击 "stderr"显示所需的 stack-trace。或者,在 Worker 节点文件系统中,类似于 /spark/work/app-20180216182621-0001/2/stderr 的内容保存了 stack-trace 输出。
我之前无法访问 http://localhost:8081 and http://localhost:4040,因为我使用 Docker 配置 Spark 并且我没有在 docker-compose 文件中指定这些端口。但这与这个问题无关。
我正在执行 third-part tool,在 Spark 中以集群模式实现。
在单机上执行时,在执行过程中产生了可理解的输出,但在集群模式下执行时,几分钟后我可以观察到这种输出:
...
INFO scheduler.TaskSetManager: Starting task 95.0 in stage 1.0 (TID 199, 10.0.0.13, executor 5, partition 95, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 87.0 in stage 1.0 (TID 191) in 442674 ms on 10.0.0.13 (executor 5) (80/104)
INFO scheduler.TaskSetManager: Starting task 96.0 in stage 1.0 (TID 200, 10.0.0.13, executor 4, partition 96, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 88.0 in stage 1.0 (TID 192) in 427022 ms on 10.0.0.13 (executor 4) (81/104)
INFO scheduler.TaskSetManager: Starting task 97.0 in stage 1.0 (TID 201, 10.0.0.13, executor 6, partition 97, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 89.0 in stage 1.0 (TID 193) in 434826 ms on 10.0.0.13 (executor 6) (82/104)
INFO scheduler.TaskSetManager: Starting task 98.0 in stage 1.0 (TID 202, 10.0.0.13, executor 5, partition 98, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 90.0 in stage 1.0 (TID 194) in 428479 ms on 10.0.0.13 (executor 5) (83/104)
INFO scheduler.TaskSetManager: Starting task 99.0 in stage 1.0 (TID 203, 10.0.0.13, executor 4, partition 99, ANY, 5586 bytes)
INFO scheduler.TaskSetManager: Finished task 92.0 in stage 1.0 (TID 196) in 421363 ms on 10.0.0.13 (executor 4) (84/104)
INFO scheduler.TaskSetManager: Starting task 100.0 in stage 1.0 (TID 204, 10.0.0.13, executor 6, partition 100, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 91.0 in stage 1.0 (TID 195) in 436868 ms on 10.0.0.13 (executor 6) (85/104)
INFO scheduler.TaskSetManager: Starting task 101.0 in stage 1.0 (TID 205, 10.0.0.13, executor 7, partition 101, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 93.0 in stage 1.0 (TID 197) in 423796 ms on 10.0.0.13 (executor 7) (86/104)
INFO scheduler.TaskSetManager: Starting task 102.0 in stage 1.0 (TID 206, 10.0.0.13, executor 5, partition 102, ANY, 5585 bytes)
INFO scheduler.TaskSetManager: Finished task 95.0 in stage 1.0 (TID 199) in 431473 ms on 10.0.0.13 (executor 5) (87/104)
INFO scheduler.TaskSetManager: Starting task 103.0 in stage 1.0 (TID 207, 10.0.0.13, executor 7, partition 103, ANY, 5335 bytes)
INFO scheduler.TaskSetManager: Finished task 94.0 in stage 1.0 (TID 198) in 448226 ms on 10.0.0.13 (executor 7) (88/104)
INFO scheduler.TaskSetManager: Finished task 96.0 in stage 1.0 (TID 200) in 435101 ms on 10.0.0.13 (executor 4) (89/104)
INFO scheduler.TaskSetManager: Finished task 97.0 in stage 1.0 (TID 201) in 423836 ms on 10.0.0.13 (executor 6) (90/104)
INFO scheduler.TaskSetManager: Finished task 98.0 in stage 1.0 (TID 202) in 415700 ms on 10.0.0.13 (executor 5) (91/104)
INFO scheduler.TaskSetManager: Finished task 99.0 in stage 1.0 (TID 203) in 410550 ms on 10.0.0.13 (executor 4) (92/104)
INFO scheduler.TaskSetManager: Finished task 100.0 in stage 1.0 (TID 204) in 420337 ms on 10.0.0.13 (executor 6) (93/104)
INFO scheduler.TaskSetManager: Finished task 103.0 in stage 1.0 (TID 207) in 318385 ms on 10.0.0.13 (executor 7) (94/104)
INFO scheduler.TaskSetManager: Finished task 101.0 in stage 1.0 (TID 205) in 421965 ms on 10.0.0.13 (executor 7) (95/104)
INFO scheduler.TaskSetManager: Finished task 102.0 in stage 1.0 (TID 206) in 425816 ms on 10.0.0.13 (executor 5) (96/104)
...
没有提供太多信息。有没有办法查看可以在本地执行中观察到的输出?
此外,在几十分钟后,我可以观察到 CPU 工作负载在两台机器上几乎减少到 0%,而就在几分钟前,它们几乎是 100% 忙。可能是 spark-submit
期间分配的资源很少?我不知道,既然这个输出没有给出任何线索,我能做些什么来调查或获得一些更有价值的信息?
例如,我尝试连接到 http://localhost:4040 as suggested here,但没有收到任何响应
找到了我平时在本地模式下观察到的正常stack-trace,对于Spark是如何设计的,在Spark Worker的中找到它确实很自然stack-trace。
为了访问它,我连接了一个浏览器(因为我在 SSH 中连接了一个虚拟机,我在作业执行期间使用包含 Spark Worker 的节点的 lynx) to http://localhost:8081:点击 "stderr"显示所需的 stack-trace。或者,在 Worker 节点文件系统中,类似于 /spark/work/app-20180216182621-0001/2/stderr 的内容保存了 stack-trace 输出。
我之前无法访问 http://localhost:8081 and http://localhost:4040,因为我使用 Docker 配置 Spark 并且我没有在 docker-compose 文件中指定这些端口。但这与这个问题无关。