脚本来获取文件的最后修改日期和文件名pyspark

script to get the file last modified date and file name pyspark

我有一个挂载点位置,它指向我们有多个文件的 blob 存储。我们需要找到文件的最后修改日期和文件名。我正在使用下面的脚本 文件列表如下:

/mnt/schema_id=na/184000-9.jsonl
/mnt/schema_id=na/185000-0.jsonl
/mnt/schema_id=na/185000-22.jsonl
/mnt/schema_id=na/185000-25.jsonl
import os
import time
# Path to the file/directory
path = "/mnt/schema_id=na"
         
ti_c = os.path.getctime(path)
ti_m = os.path.getmtime(path)
        
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
          
print(f"The file located at the path {path} was created at {c_ti} and was last modified at {m_ti}")

这是实现它的一种方法:

import os
import time
# Path to the file/directory
path = "/dbfs/mnt/schema_id=na"

for file_item in os.listdir(path):
    file_path = os.path.join(path, file_item)
    ti_c = os.path.getctime(file_path)
    ti_m = os.path.getmtime(file_path)
        
    c_ti = time.ctime(ti_c)
    m_ti = time.ctime(ti_m)
          
    print(f"The file {file_item} located at the path {path} was created at {c_ti} and was last modified at {m_ti}")

如果您使用操作系统级命令获取文件信息,那么您无法访问该确切位置 - 在 Databricks 上,它位于 Databricks 文件系统 (DBFS) 上。

要在 Python 层获得它,您需要在路径前添加 /dbfs,因此它将是:

...
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
    file_path = os.path.join(path, file_item)[:5]
    ti_c = os.path.getctime(file_path)
    ...

注意 [:5] - 它用于从路径中去除 /dbfs 前缀以使其与 DBFS

兼容