Hadoop连接文件以递归方式维护目录结构

Question

我想在一个目录下添加2个目录的文件，同时保持目录结构。

我有目录 1 和目录 2，每个目录都有大约 80 个子目录，结构如下所示。

HDFS 上的目录 1：

/user/hadoop/1/abc/file11
/user/hadoop/1/def/file12
/user/hadoop/1/ghi/file13
/user/hadoop/1/jkl/file14
/user/hadoop/1/mno/file15

HDFS 上的目录 2：

/user/hadoop/2/abc/file26
/user/hadoop/2/ghi/file27
/user/hadoop/2/mno/file28

我想将目录 1 的文件 11 和目录 2 的文件 26 合并到一个目录下，目录 1 的文件 13 和目录 27 等等。目标目录是目录 1.

从目录 2 添加到目录 1 的文件必须与子目录的路径匹配。

期望的输出：

/user/hadoop/1/abc/file11 , /user/hadoop/1/abc/file26
/user/hadoop/1/def/file12
/user/hadoop/1/ghi/file13 , /user/hadoop/1/ghi/file27
/user/hadoop/1/jkl/file14
/user/hadoop/1/mno/file15 , /user/hadoop/1/mno/file28

感谢任何帮助。

Answer 1

使用org.apache.hadoop.fs.FileUtilAPI

你得到 FileSystem 低于 API

 final FileSystem fs = FileSystem.get(conf);

复制

public 静态布尔复制（文件系统 srcFS，路径[] srcs，文件系统 dstFS，路径 dst, 布尔删除源，布尔覆盖，配置会议）抛出 IOException 投掷： IOException

This method Copy files between FileSystems.

替换文件 FileUtil.replaceFile(File src, File target) 也应该有效

see documentation for this method "Move the src file to the name specified by target."

无论哪种情况，您都需要通过比较列出您的常用文件夹/user/hadoop/2/abc/ /user/hadoop/1/abc/在第三个斜线字符之后，如果它们匹配复制源到目标或根据您的要求开发逻辑（这个我会留给你:-)）

复制到所需目标后：您可以使用以下示例方法在流程中看到它们

/**
         * Method listFileStats.
         * 
         * @param destination
         * @param fs
         * @throws FileNotFoundException
         * @throws IOException
         */
        public static void listFileStats(final String destination, final FileSystem fs) throws FileNotFoundException, IOException {
            final FileStatus[] statuss = fs.listStatus(new Path(destination));
            for (final FileStatus status : statuss) {
///below log are sl4j you can use other loggers..
                LOG.info("--  status {}    ", status.toString());
            }
        }

Answer 2

我正在为目录 2 下的每个文件获取一个唯一的文件名，并将其添加到目录 1 下的正确子目录中。以下是脚本：

for file in $(hadoop fs -ls /user/hadoop/2/* | grep -o -e "/user/hadoop/2/.*") ; do

subDir=$(echo $file | cut -d '/' -f 5)
fileName=$(echo $file | cut -d '/' -f 6)
uuid=$(uuidgen)
newFileName=$fileName"_"$uuid

    hadoop fs -cp $file /user/hadoop/1/$subDir/$newFileName
done

Hadoop连接文件以递归方式维护目录结构

Hadoop concatenate files recursively maintaining directory structure

filesystems

hadoop

file

path

hdfs