提取主要 URL 地址

Question

我有一个 URL 的列表，我想提取主要 URL 以查看每个 URL 被使用了多少次。可以想象，有很多 URL 具有不同的符号。我尝试并编写了以下代码来提取主要 URL:

library(stringr)
library(rebus)

# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))

#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]  

#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)

#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"

对于大多数情况，此代码的结果是可以接受的。例如 //www.google.com，它 returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ 它只 returns www.real。看来我的代码处理 - 很困难

您对如何改进此代码有什么建议吗？或者我们有什么软件包可以帮助我更快更好地提取 URLs 吗？

谢谢

Answer 1

我想你可以简单地使用

library(stringr)
URL_var<-df[["URLs"]] 
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"

此处，stringr::str_extract 方法搜索输入中的第一个匹配项，并获取找到的子字符串。与 stringr::str_match 不同，它不能 return 子匹配，因此在正则表达式模式中使用了后视，(?<=...):

(?<=//)[^\s/:]+

意思是：

(?<=//) - 匹配字符串中紧接在 // string
[^\s/:]+ - 除了空格 / 和 : 之外的任何字符出现一次或多次 (+)。冒号是为了确保端口号不包含在匹配中。 / 确保匹配在第一个 / 之前停止，\s（空格）确保匹配在第一个空格之前停止。

提取主要 URL 地址

extracting main URL address

url

r

stringr