在 Hortonworks Hadoop (AWS EC2) 上访问 WebHDFS

Access WebHDFS on Hortonworks Hadoop (AWS EC2)

我在 Amazon EC2 计算机上遇到 WebHDFS 访问问题。我已经安装了 Hortonworks HDP 2.3 btw。

我可以通过以下 http 请求在浏览器 (chrome) 的本地计算机上检索文件状态:

http://<serverip>:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS

这工作正常,但如果我尝试使用 ?op=OPEN 打开文件,它会将我重定向到我无法访问的机器的私有 DNS:

http://<privatedns>:50075/webhdfs/v1/user/admin/file.csv?op=OPEN&namenoderpcaddress=<privatedns>:8020&offset=0

我还尝试使用以下命令从 AWS 机器本身访问 WebHDFS:

[ec2-user@<ip> conf]$ curl -i http://localhost:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
curl: (7) couldn't connect to host

有谁知道为什么我无法连接到本地主机或者为什么我的本地计算机上的 OPEN 不起作用? 不幸的是,我找不到任何为亚马逊机器配置 WebHDFS 的教程。

提前致谢

发生的事情是名称节点将您重定向到数据节点。似乎您安装了一个单节点集群,但从概念上讲,名称节点和数据节点是不同的,并且在您的配置中,数据节点 live/listen 在您的 EC2 VPC 的私有端。

您可以重新配置集群以在 public IP/DNS 上托管数据节点(请参阅 HDFS Support for Multihomed Networks), but I would not go that way. I think the proper solution is to add a Know gateway, which is a specialized component for accessing a private cluster from a public API. Specifically, you will have to configure the datanode URLs, see Chapter 5. Mapping the Internal Nodes to External URLs。该示例似乎适合您的情况:

For example, when uploading a file with WebHDFS service:

  • The external client sends a request to the gateway WebHDFS service.

  • The gateway proxies the request to WebHDFS using the service URL.

  • WebHDFS determines which DataNodes to create the file on and returns the path for the upload as a Location header in a HTTP redirect, which contains the datanode host information.

  • The gateway augments the routing policy based on the datanode hostname in the redirect by mapping it to the externally resolvable hostname.

  • The external client continues to upload the file through the gateway.

  • The gateway proxies the request to the datanode by using the augmented routing policy.

  • The datanode returns the status of the upload and the gateway again translates the information without exposing any internal cluster details.