如何使用 webhdfs 列出 HDFS 目录内容?

How to list HDFS directory contents using webhdfs?

是否可以使用 webhdfs 检查 HDFS 目录的内容?

这将像 hdfs dfs -ls 通常那样工作,但使用 webhdfs.

如何使用 Python 2.6 列出 webhdfs 目录?

您可以使用 LISTSTATUS 动词。文档位于 List a Directory, and the following code can be found on the WebHDFS REST API 文档:

curl 中,它是这样的:

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS"

响应是一个 FileStatuses JSON 对象:

{
  "name"      : "FileStatuses",
  "properties":
  {
    "FileStatuses":
    {
      "type"      : "object",
      "properties":
      {
        "FileStatus":
        {
          "description": "An array of FileStatus",
          "type"       : "array",
          "items"      : fileStatusProperties
        }
      }
    }
  }
}

fileStatusProperties(对于 items 字段)具有此 JSON 架构:

var fileStatusProperties =
{
  "type"      : "object",
  "properties":
  {
    "accessTime":
    {
      "description": "The access time.",
      "type"       : "integer",
      "required"   : true
    },
    "blockSize":
    {
      "description": "The block size of a file.",
      "type"       : "integer",
      "required"   : true
    },
    "group":
    {
      "description": "The group owner.",
      "type"       : "string",
      "required"   : true
    },
    "length":
    {
      "description": "The number of bytes in a file.",
      "type"       : "integer",
      "required"   : true
    },
    "modificationTime":
    {
      "description": "The modification time.",
      "type"       : "integer",
      "required"   : true
    },
    "owner":
    {
      "description": "The user who is the owner.",
      "type"       : "string",
      "required"   : true
    },
    "pathSuffix":
    {
      "description": "The path suffix.",
      "type"       : "string",
      "required"   : true
    },
    "permission":
    {
      "description": "The permission represented as a octal string.",
      "type"       : "string",
      "required"   : true
    },
    "replication":
    {
      "description": "The number of replication of a file.",
      "type"       : "integer",
      "required"   : true
    },
   "type":
    {
      "description": "The type of the path object.",
      "enum"       : ["FILE", "DIRECTORY"],
      "required"   : true
    }
  }
};

您可以使用 pywebhdfs 处理 Python 中的文件名,如下所示:

import json
from pprint import pprint
from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host='host',port='50070', user_name='hdfs')  # Use your own host/port/user_name config

data = hdfs.list_dir("dir/dir")  # Use your preferred directory, without the leading "/"

file_statuses = data["FileStatuses"]
pprint file_statuses   # Display the dict

for item in file_statuses["FileStatus"]:
    print item["pathSuffix"]   # Display the item filename

您可以根据需要实际处理项目,而不是 print 处理每个对象。 file_statuses 的结果只是一个 Python dict,因此它可以像任何其他 dict 一样使用,前提是您使用正确的键。