从下载的文件中提取源元数据
Extract source metadata from downloaded file
我下载了一堆 pdf 文件。现在我想从文件的元数据中提取下载 url。我如何以编程方式执行此操作?我更喜欢 R 中的解决方案,我正在使用 MacOS Mojave。
如果你想复制你可以 [use this file]。
我尝试在 Ask Different
中搜索从 Terminal.app 命令行模拟选择 "Get Info" 的方法。
我找到了使用命令 mdls
的建议,我从 R system
-调用中得到了这个:
system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf")
#kMDItemWhereFroms = (
# "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
# ""
#)
要将多行结果输入 R(而不是仅仅出现在控制台),您需要将 intern=TRUE
参数添加到 system
调用:
> res <- system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf", intern=TRUE)
> res
[1] "kMDItemWhereFroms = ("
[2] " \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","
[3] " \"\""
[4] ")"
> res[2]
[1] " \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","
获取所有属性:
system("mdls ~/0.-miljoenennota.pdf")
#-----------
_kMDItemOwnerUserID = 501
kMDItemAuthors = (
"Tweede Kamer der Staten-Generaal"
)
kMDItemContentCreationDate = 2018-10-08 23:45:35 +0000
kMDItemContentModificationDate = 2018-10-08 23:45:46 +0000
kMDItemContentType = "com.adobe.pdf"
kMDItemContentTypeTree = (
"com.adobe.pdf",
"public.data",
"public.item",
"public.composite-content",
"public.content"
)
kMDItemCreator = "XPP"
kMDItemDateAdded = 2018-10-08 23:45:46 +0000
kMDItemDisplayName = "0.-miljoenennota.pdf"
kMDItemEncodingApplications = (
"Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"
)
kMDItemFSContentChangeDate = 2018-10-08 23:45:46 +0000
kMDItemFSCreationDate = 2018-10-08 23:45:35 +0000
kMDItemFSCreatorCode = ""
kMDItemFSFinderFlags = 0
kMDItemFSHasCustomIcon = (null)
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 0
kMDItemFSIsStationery = (null)
kMDItemFSLabel = 0
kMDItemFSName = "0.-miljoenennota.pdf"
kMDItemFSNodeCount = (null)
kMDItemFSOwnerGroupID = 20
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 4004668
kMDItemFSTypeCode = ""
kMDItemKind = "Portable Document Format (PDF)"
kMDItemLogicalSize = 4004668
kMDItemNumberOfPages = 196
kMDItemPageHeight = 841.89
kMDItemPageWidth = 595.276
kMDItemPhysicalSize = 4005888
kMDItemSecurityMethod = "None"
kMDItemVersion = "1.6"
kMDItemWhereFroms = (
"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
""
)
我还能够得到 "metadata" 的不同定义:
install.packages("tabulizer", dependencies=TRUE)
tabulizer::extract_metadata("~/0.-miljoenennota.pdf")
#---------
$pages
[1] 196
$title
NULL
$author
[1] "Tweede Kamer der Staten-Generaal"
$subject
[1] ""
$keywords
[1] ""
$creator
[1] "XPP"
$producer
[1] "Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"
$created
[1] "Thu Sep 15 05:11:50 PDT 2016"
$modified
[1] "Thu Sep 15 05:34:06 PDT 2016"
$trapped
NULL
虽然您可以通过使用 R 以编程方式下载 PDF 来避免此需求,但我们可以使用 xattrs
包来获取您要查找的数据:
library(xattrs) # https://gitlab.com/hrbrmstr/xattrs (not on CRAN)
让我们看看这个文件有哪些扩展属性:
xattrs::list_xattrs("~/Downloads/0.-miljoenennota.pdf")
## [1] "com.apple.metadata:kMDItemWhereFroms"
## [2] "com.apple.quarantine"
com.apple.metadata:kMDItemWhereFroms
看起来是个不错的目标:
xattrs::get_xattr(
path = "~/Downloads/forso/0.-miljoenennota.pdf",
name = "com.apple.metadata:kMDItemWhereFroms"
) -> from_where
from_where
## [1] "bplist00\xa2[=12=]1[=12=]2_0}https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdfP\b\v\x8b"
但是,它是二进制 plist 格式的(yay Apple #sigh)。但是,因为那是 "a thing",所以 xattrs
包有一个 read_bplist()
函数,但我们必须使用 get_xattr_raw()
才能使用它:
xattrs::read_bplist(
xattrs::get_xattr_raw(
path = "~/Downloads/forso/0.-miljoenennota.pdf",
name = "com.apple.metadata:kMDItemWhereFroms"
)
) -> from_where
str(from_where)
## List of 1
## $ plist:List of 1
## ..$ array:List of 2
## .. ..$ string:List of 1
## .. .. ..$ : chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf"
## .. ..$ string: list()
## ..- attr(*, "version")= chr "1.0"
丑陋的嵌套列表是真正愚蠢的二进制 plist 文件格式的错误,但来源 URL 就在那里。
我们可以通过使用 lapply
以这种方式获得所有这些文件(我为此将一堆随机交互式下载的 PDF 扔到一个目录中)。还有一个 in this blog post 的示例,但它使用 reticulate
和 Python 包来读取二进制 plist 数据,而不是内置包函数来执行此操作(内置包函数是 macOS plutil
实用程序或 linux plistutil
实用程序的包装器;Windows 用户如果想使用该功能,可以切换到真实的操作系统)。
fils <- list.files("~/Downloads/forso", pattern = "\.pdf", full.names = TRUE)
do.call(
rbind.data.frame,
lapply(fils, function(.x) {
xattrs::read_bplist(
xattrs::get_xattr_raw(
path = .x,
name = "com.apple.metadata:kMDItemWhereFroms"
)
) -> tmp
from_where <- if (length(tmp$plist$array$string) > 0) {
tmp$plist$array$string[[1]]
} else {
NA_character_
}
data.frame(
fil = basename(.x),
url = from_where,
stringsAsFactors=FALSE
)
})
) -> files_with_meta
str(files_with_meta)
## 'data.frame': 9 obs. of 2 variables:
## $ fil: chr "0.-miljoenennota.pdf" "19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "Codebook.pdf" "Elementary-Lunch-Menu.pdf" ...
## $ url: chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf" "http://eprint.ncl.ac.uk/file_store/production/230123/19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "http://apps.start.umd.edu/gtd/downloads/dataset/Codebook.pdf" "http://www.msad60.org/wp-content/uploads/2017/01/Elementary-February-Lunch-Menu.pdf" ...
注意:IRL 你应该在示例中做更多的防弹工作 lapply
。
我下载了一堆 pdf 文件。现在我想从文件的元数据中提取下载 url。我如何以编程方式执行此操作?我更喜欢 R 中的解决方案,我正在使用 MacOS Mojave。
如果你想复制你可以 [use this file]。
我尝试在 Ask Different
中搜索从 Terminal.app 命令行模拟选择 "Get Info" 的方法。
我找到了使用命令 mdls
的建议,我从 R system
-调用中得到了这个:
system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf")
#kMDItemWhereFroms = (
# "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
# ""
#)
要将多行结果输入 R(而不是仅仅出现在控制台),您需要将 intern=TRUE
参数添加到 system
调用:
> res <- system("mdls -name kMDItemWhereFroms ~/0.-miljoenennota.pdf", intern=TRUE)
> res
[1] "kMDItemWhereFroms = ("
[2] " \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","
[3] " \"\""
[4] ")"
> res[2]
[1] " \"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf\","
获取所有属性:
system("mdls ~/0.-miljoenennota.pdf")
#-----------
_kMDItemOwnerUserID = 501
kMDItemAuthors = (
"Tweede Kamer der Staten-Generaal"
)
kMDItemContentCreationDate = 2018-10-08 23:45:35 +0000
kMDItemContentModificationDate = 2018-10-08 23:45:46 +0000
kMDItemContentType = "com.adobe.pdf"
kMDItemContentTypeTree = (
"com.adobe.pdf",
"public.data",
"public.item",
"public.composite-content",
"public.content"
)
kMDItemCreator = "XPP"
kMDItemDateAdded = 2018-10-08 23:45:46 +0000
kMDItemDisplayName = "0.-miljoenennota.pdf"
kMDItemEncodingApplications = (
"Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"
)
kMDItemFSContentChangeDate = 2018-10-08 23:45:46 +0000
kMDItemFSCreationDate = 2018-10-08 23:45:35 +0000
kMDItemFSCreatorCode = ""
kMDItemFSFinderFlags = 0
kMDItemFSHasCustomIcon = (null)
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 0
kMDItemFSIsStationery = (null)
kMDItemFSLabel = 0
kMDItemFSName = "0.-miljoenennota.pdf"
kMDItemFSNodeCount = (null)
kMDItemFSOwnerGroupID = 20
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 4004668
kMDItemFSTypeCode = ""
kMDItemKind = "Portable Document Format (PDF)"
kMDItemLogicalSize = 4004668
kMDItemNumberOfPages = 196
kMDItemPageHeight = 841.89
kMDItemPageWidth = 595.276
kMDItemPhysicalSize = 4005888
kMDItemSecurityMethod = "None"
kMDItemVersion = "1.6"
kMDItemWhereFroms = (
"https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf",
""
)
我还能够得到 "metadata" 的不同定义:
install.packages("tabulizer", dependencies=TRUE)
tabulizer::extract_metadata("~/0.-miljoenennota.pdf")
#---------
$pages
[1] 196
$title
NULL
$author
[1] "Tweede Kamer der Staten-Generaal"
$subject
[1] ""
$keywords
[1] ""
$creator
[1] "XPP"
$producer
[1] "Acrobat Distiller Server 8.1.0 (Pentium Linux, Built: 2007-09-07)"
$created
[1] "Thu Sep 15 05:11:50 PDT 2016"
$modified
[1] "Thu Sep 15 05:34:06 PDT 2016"
$trapped
NULL
虽然您可以通过使用 R 以编程方式下载 PDF 来避免此需求,但我们可以使用 xattrs
包来获取您要查找的数据:
library(xattrs) # https://gitlab.com/hrbrmstr/xattrs (not on CRAN)
让我们看看这个文件有哪些扩展属性:
xattrs::list_xattrs("~/Downloads/0.-miljoenennota.pdf")
## [1] "com.apple.metadata:kMDItemWhereFroms"
## [2] "com.apple.quarantine"
com.apple.metadata:kMDItemWhereFroms
看起来是个不错的目标:
xattrs::get_xattr(
path = "~/Downloads/forso/0.-miljoenennota.pdf",
name = "com.apple.metadata:kMDItemWhereFroms"
) -> from_where
from_where
## [1] "bplist00\xa2[=12=]1[=12=]2_0}https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdfP\b\v\x8b"
但是,它是二进制 plist 格式的(yay Apple #sigh)。但是,因为那是 "a thing",所以 xattrs
包有一个 read_bplist()
函数,但我们必须使用 get_xattr_raw()
才能使用它:
xattrs::read_bplist(
xattrs::get_xattr_raw(
path = "~/Downloads/forso/0.-miljoenennota.pdf",
name = "com.apple.metadata:kMDItemWhereFroms"
)
) -> from_where
str(from_where)
## List of 1
## $ plist:List of 1
## ..$ array:List of 2
## .. ..$ string:List of 1
## .. .. ..$ : chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf"
## .. ..$ string: list()
## ..- attr(*, "version")= chr "1.0"
丑陋的嵌套列表是真正愚蠢的二进制 plist 文件格式的错误,但来源 URL 就在那里。
我们可以通过使用 lapply
以这种方式获得所有这些文件(我为此将一堆随机交互式下载的 PDF 扔到一个目录中)。还有一个 in this blog post 的示例,但它使用 reticulate
和 Python 包来读取二进制 plist 数据,而不是内置包函数来执行此操作(内置包函数是 macOS plutil
实用程序或 linux plistutil
实用程序的包装器;Windows 用户如果想使用该功能,可以切换到真实的操作系统)。
fils <- list.files("~/Downloads/forso", pattern = "\.pdf", full.names = TRUE)
do.call(
rbind.data.frame,
lapply(fils, function(.x) {
xattrs::read_bplist(
xattrs::get_xattr_raw(
path = .x,
name = "com.apple.metadata:kMDItemWhereFroms"
)
) -> tmp
from_where <- if (length(tmp$plist$array$string) > 0) {
tmp$plist$array$string[[1]]
} else {
NA_character_
}
data.frame(
fil = basename(.x),
url = from_where,
stringsAsFactors=FALSE
)
})
) -> files_with_meta
str(files_with_meta)
## 'data.frame': 9 obs. of 2 variables:
## $ fil: chr "0.-miljoenennota.pdf" "19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "Codebook.pdf" "Elementary-Lunch-Menu.pdf" ...
## $ url: chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf" "http://eprint.ncl.ac.uk/file_store/production/230123/19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "http://apps.start.umd.edu/gtd/downloads/dataset/Codebook.pdf" "http://www.msad60.org/wp-content/uploads/2017/01/Elementary-February-Lunch-Menu.pdf" ...
注意:IRL 你应该在示例中做更多的防弹工作 lapply
。