使用 distcp 将数据从 cloudera 集群复制到 google cloud hdfs 集群
Copying data from cloudera cluster to google cloud hdfs cluster using distcp
我正在使用 cloudera quickstart 虚拟机。我昨天开始玩 google 云平台。我正在尝试将 cloudera hdfs 中的数据复制到
1. google云存储(gs://bucket_name/)
2. google云hdfs集群(使用hdfs://google_cluster_namenode:8020/)
我按照this post
中的说明设置了服务帐户身份验证并配置了我的cloudera core-site.xml
hadoop fs -cp hdfs://quickstart.cloudera:8020/path_to_copy/ gs://bucket_name/
工作正常。但是,我无法使用 distcp 复制到 google 云存储。我收到以下错误。我知道这不是 URI 问题。还有什么我想念的吗?
Error: java.io.IOException: File copy failed: hdfs://quickstart.cloudera:8020/path_to_copy/file --> gs://bucket_name/file
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://quickstart.cloudera:8020/path_to_copy/file to gs://bucket_name/file
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
... 10 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: gs://bucket_name.distcp.tmp.attempt_1461777569169_0002_m_000001_2
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:116)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getTmpFile(RetriableFileCopyCommand.java:233)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:107)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more
- 我无法让 distcp 连接到 google cloud hdfs namenode;我得到 "Retrying connect to server"。我找不到任何文档来配置 cloudera hdfs 集群和 google cloud hdfs 集群之间的连接。我假设服务帐户 auth 也应该与 google hdfs 一起使用。是否有我可以用来在集群之间设置副本的参考文档?我是否缺少任何其他身份验证设置?
事实证明,我必须修改防火墙规则以允许 tcp/http 来自我 运行 distcp 所在的 ip。检查 GCP 计算实例上的网络防火墙。
我正在使用 cloudera quickstart 虚拟机。我昨天开始玩 google 云平台。我正在尝试将 cloudera hdfs 中的数据复制到 1. google云存储(gs://bucket_name/) 2. google云hdfs集群(使用hdfs://google_cluster_namenode:8020/)
我按照this post
中的说明设置了服务帐户身份验证并配置了我的cloudera core-site.xmlhadoop fs -cp hdfs://quickstart.cloudera:8020/path_to_copy/ gs://bucket_name/
工作正常。但是,我无法使用 distcp 复制到 google 云存储。我收到以下错误。我知道这不是 URI 问题。还有什么我想念的吗?
Error: java.io.IOException: File copy failed: hdfs://quickstart.cloudera:8020/path_to_copy/file --> gs://bucket_name/file
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://quickstart.cloudera:8020/path_to_copy/file to gs://bucket_name/file
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
... 10 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: gs://bucket_name.distcp.tmp.attempt_1461777569169_0002_m_000001_2
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:116)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getTmpFile(RetriableFileCopyCommand.java:233)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:107)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more
- 我无法让 distcp 连接到 google cloud hdfs namenode;我得到 "Retrying connect to server"。我找不到任何文档来配置 cloudera hdfs 集群和 google cloud hdfs 集群之间的连接。我假设服务帐户 auth 也应该与 google hdfs 一起使用。是否有我可以用来在集群之间设置副本的参考文档?我是否缺少任何其他身份验证设置?
事实证明,我必须修改防火墙规则以允许 tcp/http 来自我 运行 distcp 所在的 ip。检查 GCP 计算实例上的网络防火墙。