脚本来获取文件的最后修改日期和文件名pyspark
script to get the file last modified date and file name pyspark
我有一个挂载点位置,它指向我们有多个文件的 blob 存储。我们需要找到文件的最后修改日期和文件名。我正在使用下面的脚本
文件列表如下:
/mnt/schema_id=na/184000-9.jsonl
/mnt/schema_id=na/185000-0.jsonl
/mnt/schema_id=na/185000-22.jsonl
/mnt/schema_id=na/185000-25.jsonl
import os
import time
# Path to the file/directory
path = "/mnt/schema_id=na"
ti_c = os.path.getctime(path)
ti_m = os.path.getmtime(path)
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
print(f"The file located at the path {path} was created at {c_ti} and was last modified at {m_ti}")
这是实现它的一种方法:
import os
import time
# Path to the file/directory
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
file_path = os.path.join(path, file_item)
ti_c = os.path.getctime(file_path)
ti_m = os.path.getmtime(file_path)
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
print(f"The file {file_item} located at the path {path} was created at {c_ti} and was last modified at {m_ti}")
如果您使用操作系统级命令获取文件信息,那么您无法访问该确切位置 - 在 Databricks 上,它位于 Databricks 文件系统 (DBFS) 上。
要在 Python 层获得它,您需要在路径前添加 /dbfs
,因此它将是:
...
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
file_path = os.path.join(path, file_item)[:5]
ti_c = os.path.getctime(file_path)
...
注意 [:5]
- 它用于从路径中去除 /dbfs
前缀以使其与 DBFS
兼容
我有一个挂载点位置,它指向我们有多个文件的 blob 存储。我们需要找到文件的最后修改日期和文件名。我正在使用下面的脚本 文件列表如下:
/mnt/schema_id=na/184000-9.jsonl
/mnt/schema_id=na/185000-0.jsonl
/mnt/schema_id=na/185000-22.jsonl
/mnt/schema_id=na/185000-25.jsonl
import os
import time
# Path to the file/directory
path = "/mnt/schema_id=na"
ti_c = os.path.getctime(path)
ti_m = os.path.getmtime(path)
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
print(f"The file located at the path {path} was created at {c_ti} and was last modified at {m_ti}")
这是实现它的一种方法:
import os
import time
# Path to the file/directory
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
file_path = os.path.join(path, file_item)
ti_c = os.path.getctime(file_path)
ti_m = os.path.getmtime(file_path)
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
print(f"The file {file_item} located at the path {path} was created at {c_ti} and was last modified at {m_ti}")
如果您使用操作系统级命令获取文件信息,那么您无法访问该确切位置 - 在 Databricks 上,它位于 Databricks 文件系统 (DBFS) 上。
要在 Python 层获得它,您需要在路径前添加 /dbfs
,因此它将是:
...
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
file_path = os.path.join(path, file_item)[:5]
ti_c = os.path.getctime(file_path)
...
注意 [:5]
- 它用于从路径中去除 /dbfs
前缀以使其与 DBFS