R sub with perl - 开始向后搜索？

Question

我有如下所示的 a 字符串。我需要提取 first // 和第一个后续 / 之间的部分字符串。我将 sub 与 perl = F 一起使用，但它比 perl = T 慢大约 4 倍。所以我尝试了 perl = T ，发现搜索从字符串的 END 开始??

    a = "https://moo.com/meh/woof//A.ds.serving/hgtht//ghhg/tjtke"
    print(gsub(".*//(.*?)/.*","\1",a))

    "moo.com"

    print(gsub(".*//(.*?)/.*","\1",a,perl=T))

    "ghhg"

moo.com 是我需要的。我很惊讶地看到这个 - 它是否记录在某处？我如何用 perl 重写它 - 我有 2000 万行要处理，速度很重要。谢谢！

编辑：并不是每个字符串都以 http

开头

Answer 1

您可以尝试 .*?//(.*?)/.* 使第一个 .* 也变得惰性，这样 // 将匹配第一个 // 实例：

gsub(".*?//(.*?)/.*","\1",a,perl=T)
# [1] "moo.com"

并且 ?gsub 说：

The standard regular-expression code has been reported to be very slow when applied to extremely long character strings (tens of thousands of characters or more): the code used when perl = TRUE seems much faster and more reliable for such usages.

The standard version of gsub does not substitute correctly repeated word-boundaries (e.g. pattern = "\b"). Use perl = TRUE for such matches.

R sub with perl - 开始向后搜索？

R sub with perl - starts search backwards?

regex

r

gsub