尝试访问 Azure Databricks 中的 Azure DBFS 文件系统时出现装载错误

Question

我能够与我的 Databricks FileStore 建立连接 DBFS 并访问该文件存储。

使用 Pyspark 读取、写入和转换数据是可能的，但是当我尝试使用本地 Python API 例如 pathlib 或 OS 模块时，我我无法通过 DBFS 文件系统的第一级

我可以使用魔法命令：

%fs ls dbfs:\mnt\my_fs\... 哪个能完美运行并列出所有子目录？

但如果我这样做 os.listdir('\dbfs\mnt\my_fs\') 它 returns ['mount.err'] 作为 return 值

我已经在一个新的集群上测试过了，结果是一样的

我在带有 Apache Spark 2.4.4 的 Databricks Runtine 版本 6.1 上使用 Python

有没有人可以指点一下。

编辑：

连接脚本：

我使用 Databricks CLI 库来存储我的凭据，这些凭据根据 databricks 文档进行了格式化：

 def initialise_connection(secrets_func):
  configs = secrets_func()
  # Check if the mount exists
  bMountExists = False
  for item in dbutils.fs.ls("/mnt/"):
      if str(item.name) == r"WFM/":
          bMountExists = True
      # drop if exists to refresh credentials
      if bMountExists:
        dbutils.fs.unmount("/mnt/WFM")
        bMountExists = False

      # Mount a drive
      if not (bMountExists):
          dbutils.fs.mount(
              source="adl://test.azuredatalakestore.net/WFM",
              mount_point="/mnt/WFM",
              extra_configs=configs
          )
          print("Drive mounted")
      else:
          print("Drive already mounted")

Answer 1

更新答案： 使用 Azure Data Lake Gen1 存储帐户：dbutils 可以访问 adls gen1 tokens/access creds，因此 mnt 点中的文件列表在 std py 中工作api 调用无权访问 creds/spark conf，您看到的第一个调用是列出文件夹并且它没有对 adls api 进行任何调用。

我已经在 Databricks Runtime 6.1 版（包括 Apache Spark 2.4.4、Scala 2.11）中进行了测试

命令正常运行，没有任何错误消息。

更新： 内部文件夹的输出。

希望这对您有所帮助。你能不能试试让我们知道。

Answer 2

我们在连接到 Azure Generation2 存储帐户（没有分层名称空间）时遇到了同样的问题。

将 Databricks Runtime Environment 从 5.5 切换到 6.x 时似乎会发生错误。但是，我们无法查明造成这种情况的确切原因。我们假设某些功能可能已被弃用。

Answer 3

当同一个容器安装到工作区中的两个不同路径时，我们遇到了这个问题。卸载所有并重新安装解决了我们的问题。我们使用的是 Databricks 6.2 版（Spark 2.4.4、Scala 2.11）。我们的 blob 存储容器配置：

Performance/Access 等级：Standard/Hot
复制：读取访问异地冗余存储 (RA-GRS)
帐户类型：StorageV2（通用 v2）

Notebook 脚本运行卸载 /mnt 中的所有挂载：

# Iterate through all mounts and unmount 
print('Unmounting all mounts beginning with /mnt/')
dbutils.fs.mounts()
for mount in dbutils.fs.mounts():
  if mount.mountPoint.startswith('/mnt/'):
    dbutils.fs.unmount(mount.mountPoint)

# Re-list all mount points
print('Re-listing all mounts')
dbutils.fs.mounts()

要在自动化作业集群上测试的最少作业

假设您有一个单独的过程来创建坐骑。在自动化集群上创建作业定义 (job.json) 到运行 Python 脚本：

{
  "name": "Minimal Job",
  "new_cluster": {
    "spark_version": "6.2.x-scala2.11",
    "spark_conf": {},
    "node_type_id": "Standard_F8s",
    "driver_node_type_id": "Standard_F8s",
    "num_workers": 2,
    "enable_elastic_disk": true,
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    }
  },
  "timeout_seconds": 14400,
  "max_retries": 0,
  "spark_python_task": {
    "python_file": "dbfs:/minimal/job.py"
  }
}

Python 文件 (job.py) 打印出坐骑：

import os

path_mounts = '/dbfs/mnt/'
print(f"Listing contents of {path_mounts}:")
print(os.listdir(path_mounts))

path_mount = path_mounts + 'YOURCONTAINERNAME'
print(f"Listing contents of {path_mount }:")
print(os.listdir(path_mount))

运行 databricks CLI 命令到运行作业。查看 Spark Driver 日志的输出，确认 mount.err 不存在。

databricks fs mkdirs dbfs:/minimal
databricks fs cp job.py dbfs:/minimal/job.py --overwrite
databricks jobs create --json-file job.json
databricks jobs run-now --job-id <JOBID FROM LAST COMMAND>

尝试访问 Azure Databricks 中的 Azure DBFS 文件系统时出现装载错误

mount error when trying to access the Azure DBFS file system in Azure Databricks

python

azure

databricks

azure-databricks

编辑：

要在自动化作业集群上测试的最少作业