解析 r 中 sub 和 gsub 的正则表达式

Question

我无法理解以下代码行中正则表达式的含义。

author = "10_1 A Kumar; Ahmed Hemani ; Johnny &Ouml;berg<"

# after some experiment, it looks like this line captures whatever is in
# front of the underscore.
authodid =  sub("_.*","",author)

# this line extracts the number after the underscore, but I don't know 
# how this is achieved
paperno <- sub(".*_(\w*)\s.*", "\1", author)

# this line extracts the string after the numbers
# I also have no idea how this is achieved through the code
coauthor <- gsub("<","",sub("^.*?\s","", author))

我在网上看到第一个参数是模式，第二个是替换，第三个是要操作的对象。我还在 SO 上看到了一些 post 并了解到 \w 表示 a word 而 \s 是 space。

但是，还有一些事情还不清楚。 \w表示单词，是表示下一个单词吗？如果不是，我应该如何解释它？我了解到 ^ 匹配字符串的开头，但是 ^ 之后的句点呢？

更重要的是_.*的解释是什么.*_呢^.*?\s呢？我该如何阅读它们？

谢谢！

Answer 1

嗯。有相当多的问题。要事第一。

sub("_.*","",author) 查找 _ 以及之后的所有其他内容。所以在你的情况下 _.* 对应于 _1 A Kumar; Ahmed Hemani ; Johnny Öberg<。函数 sub 用 '' 重复它（所以，事实上它删除了它），所以你最终得到 10.

sub(".*_(\w*)\s.*", "\1", author) 更棘手（没有任何理由）。它不提取任何东西。如果将代码替换为 sub(".*_(\w*)\s.*", "222", author)，结果将是 222（而不是 1）。所以无论你在第二个参数中输入什么，你都会得到结果。为什么会这样？嗯，因为".*_(\w*)\s.*"对应的是整个字符串，即：.*_对应的是10_； (\w*) 对应于 1 最后 \s.* 表示 space 和它后面的所有内容（因此，字符串的其余部分）。

gsub("<","",sub("^.*?\s","", author))有两个函数。第一个sub("^.*?\s","", author)。它看起来从头到 space 的一切。所以 ^.*?\s 代表 10_1 并删除它。所以，你最终得到 A Kumar; Ahmed Hemani ; Johnny Öberg<。第二个删除整个地方的“<”。

希望对您有所帮助。

解析 r 中 sub 和 gsub 的正则表达式

parsing regular expression for sub and gsub in r

regex

r

gsub