解析 pdf 元数据日期不适用于所有 pdf
parsing pdf metadata date does not work for all pdfs
我正在尝试使用 pdfminer 获取多个 pdf 的修改日期
import os
import re
from datetime import datetime
from pdfminer3.pdfparser import PDFParser
from pdfminer3.pdfdocument import PDFDocument
# This function convers the date string to a datetime object
def get_pdf_date(pd):
dtformat = "%Y%m%d%H%M%S"
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
return datetime.strptime(re.sub('[^0-9]', '', clean), dtformat)
path = "C:\Users\asus\Desktop\storage"
for file in os.listdir(path):
try:
fp = open(os.path.join(path, file), "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
pdf_creation_date = doc.info[0]["CreationDate"]
print(str(pdf_creation_date) + ", " + str(get_pdf_date(pdf_creation_date)))
except Exception as e:
print(str(e) + " => " + str(pdf_creation_date))
这是我得到的输出:
b"D:20151004081456+01'00'", 2015-10-04 08:14:56
b'D:20161029124239', 2016-10-29 12:42:39
b"D:20160727173724+05'30'", 2016-07-27 17:37:24
b"D:20170526150059+05'30'", 2017-05-26 15:00:59
b'D:20190218122459', 2019-02-18 12:24:59
未转换的数据仍然存在:0600 => b"D:20151017020552-06'00'"
b"D:20180302120823+00'00'", 2018-03-02 12:08:23
b"D:20150317171945+05'30'", 2015-03-17 17:19:45
b"D:20140405150714+01'00'", 2014-04-05 15:07:14
b'D:20190313161243Z', 2019-03-13 16:12:43
b'D:20160523204913', 2016-05-23 20:49:13
b"D:20150716000009+05'30'", 2015-07-16 00:00:09
b"D:20150923145114+05'30'", 2015-09-23 14:51:14
b"D:20150703193510+05'30'", 2015-07-03 19:35:10
b"D:20170907220317+16'33'", 2017-09-07 22:03:17
未转换的数据仍然存在:1200 => b"D:20160407192544-12'00'"
如您所见,我使用的解析函数并非一直有效,那是因为每个 pdf 似乎都有自己的日期语法。但是我注意到 Foxit Reader 总是正确获取元数据,如下图所示
所以我想知道如何实现这样的东西
失败的日期在时区偏移量中有一个减号:
D:20160407192544-12'00'
代码中的这一行只需要一个加号(或者,隐含地,没有时区偏移):
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
您的代码需要处理正负时区偏移量。
我正在尝试使用 pdfminer 获取多个 pdf 的修改日期
import os
import re
from datetime import datetime
from pdfminer3.pdfparser import PDFParser
from pdfminer3.pdfdocument import PDFDocument
# This function convers the date string to a datetime object
def get_pdf_date(pd):
dtformat = "%Y%m%d%H%M%S"
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
return datetime.strptime(re.sub('[^0-9]', '', clean), dtformat)
path = "C:\Users\asus\Desktop\storage"
for file in os.listdir(path):
try:
fp = open(os.path.join(path, file), "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
pdf_creation_date = doc.info[0]["CreationDate"]
print(str(pdf_creation_date) + ", " + str(get_pdf_date(pdf_creation_date)))
except Exception as e:
print(str(e) + " => " + str(pdf_creation_date))
这是我得到的输出:
b"D:20151004081456+01'00'", 2015-10-04 08:14:56
b'D:20161029124239', 2016-10-29 12:42:39
b"D:20160727173724+05'30'", 2016-07-27 17:37:24
b"D:20170526150059+05'30'", 2017-05-26 15:00:59
b'D:20190218122459', 2019-02-18 12:24:59
未转换的数据仍然存在:0600 => b"D:20151017020552-06'00'"
b"D:20180302120823+00'00'", 2018-03-02 12:08:23
b"D:20150317171945+05'30'", 2015-03-17 17:19:45
b"D:20140405150714+01'00'", 2014-04-05 15:07:14
b'D:20190313161243Z', 2019-03-13 16:12:43
b'D:20160523204913', 2016-05-23 20:49:13
b"D:20150716000009+05'30'", 2015-07-16 00:00:09
b"D:20150923145114+05'30'", 2015-09-23 14:51:14 b"D:20150703193510+05'30'", 2015-07-03 19:35:10
b"D:20170907220317+16'33'", 2017-09-07 22:03:17
未转换的数据仍然存在:1200 => b"D:20160407192544-12'00'"
如您所见,我使用的解析函数并非一直有效,那是因为每个 pdf 似乎都有自己的日期语法。但是我注意到 Foxit Reader 总是正确获取元数据,如下图所示
所以我想知道如何实现这样的东西
失败的日期在时区偏移量中有一个减号:
D:20160407192544-12'00'
代码中的这一行只需要一个加号(或者,隐含地,没有时区偏移):
clean = pd.decode("utf-8").replace("D:", "").split('+')[0]
您的代码需要处理正负时区偏移量。