匹配接近但不准确的文件名

Question

我有多个列表，除了扩展名外名称相似。我能够使用简单的括号方法为它们编制索引 - L1[1] 和 L2[2] 将是正确的匹配项。但是，我有很多文件要查看，其中一些文件与 index 编号与 index 编号不匹配。

在我的示例中，一种类型缺少一些文件。在我的 real-world 第一个案例中，我有 122 个 .json 文件和 119 个 .description 文件。这抛弃了我使用的索引方法。在这种情况下，如何将正确的列表元素匹配在一起？我使用字符串匹配和字符串拆分尝试了几个不同的选项，但我没有运气。

以防万一，是的，这个 meta-data 是使用 youtube-dl 提取的，但我是视频的作者。

最终目标是有两个变量 VTT 和 DESC，我可以稍后在我的 R 脚本中使用它们。例如，VTT 等于 L1[2]，DESC 等于 L2[标题紧密匹配的索引]，例如不带扩展名的文件名或 L2[3]。

两个列表都来自使用list.files()；但是，在程序的后面，我没有 full.names = TRUE，我只使用文件名本身。

L1 <- c("c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/01 - Statistical Programming with R - Estimating f (Notation)/Statistical Programming with R - Estimating f (Notation).mp4.en.txt", 
+         "c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/02 - Statistical Programming - Expected Value/Statistical Programming - Expected Value.mp4.en.txt", 
+         "c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/03 - Linear Regression with R 01/Linear Regression with R 01.mp4.en.txt"
+ )

L2 <- c("c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/01 - Statistical Programming with R - Estimating f (Notation)/Statistical Programming with R - Estimating f (Notation).mp4.info.json", 
, 
"c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/03 - Linear Regression with R 01/Linear Regression with R 01.mp4.info.json",
"c:/ytdl//CradleToGraveR/Absolute Beginners Guide to Statistical Programming/02 - Statistical Programming - Expected Value/Statistical Programming - Expected Value.mp4.info.json"
)

总的来说，也许我的方法是错误的。我想我的下一个方法是将列表放在 data.frame 中并去掉扩展名。然后只解析目录路径后的结尾。最后，对两个 data.frame 进行连接或合并？我觉得我让这种方式变得比它应该的更复杂。

建议？

Answer 1

我认为最好只保留字符串中彼此完全匹配的那部分并进行比较。

对于共享的例子，如果我们只保留文件名而没有完整路径，则将"."之后的所有内容剥离并比较它是否有效。

inds <- match(sub('\..*', '', basename(L1)), sub('\..*', '', basename(L2)))
inds
#[1] 1 3 2

您可以使用正确顺序的两个文件名创建数据框

data.frame(L1 = L1, L2 = L2[inds])

匹配接近但不准确的文件名

Match file names that are close, but not exact

string

filenames

r