R sub-删除行终止符
R sub- removing line terminators
我正在尝试从我抓取的一段代码中检索深度和宽度等信息,但在执行时遇到了问题。
obtain_url <- html(# Some url)
test <- obtain_url %>% html_node("#specifications") %>% html_text()
edit(test)
Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n
dn<-sub(".*Width (in.):\n(.*)\n .*","\1",test) # My attempt at retrieving width info
我的尝试只是简单地吐出相同的文本。我感兴趣的所有信息总是以相同的模式出现 Info:\n #36 blank spaces# Information\n
。有时它是一个数字,有时它只是普通文本。如果有人能帮我检索,例如,宽度和深度的数值,我就可以将它应用到其他所有方面。
text <- "Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n "
no_spaces <- gsub("\n|\s","",text)
width <- as.numeric(sub(".+Width\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #30
depth <- as.numeric(sub(".+Depth\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #24.2
正则表达式有点麻烦,因为您必须引用括号、缩写句点、可选的小数点等。但它似乎有效。 HTH
我会为此尝试 strsplit
。
clean <- function(x) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
paste(s2[indx], s2[indx+1])
}
clean(x)
[1] "Dimensions: " "Width (in.): 30" "Depth (in.): 24.25"
[4] "Width: 30 inches" "Weight (lbs.): 320" "Height (in.): 50.5"
如果您不需要文字,请试试这个:
clean2 <- function(x, measure) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
num
}
clean2(x)
[1] NA 30.00 24.25 30.00 320.00 50.50
或者在我看来更好:
clean3 <- function(x, measure) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
df <- data.frame(Measure=s2[indx], Value=num)
df
}
# clean3(x)
# Measure Value
# 1 Dimensions: NA
# 2 Width (in.): 30.00
# 3 Depth (in.): 24.25
# 4 Width: 30.00
# 5 Weight (lbs.): 320.00
# 6 Height (in.): 50.50
我正在尝试从我抓取的一段代码中检索深度和宽度等信息,但在执行时遇到了问题。
obtain_url <- html(# Some url)
test <- obtain_url %>% html_node("#specifications") %>% html_text()
edit(test)
Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n
dn<-sub(".*Width (in.):\n(.*)\n .*","\1",test) # My attempt at retrieving width info
我的尝试只是简单地吐出相同的文本。我感兴趣的所有信息总是以相同的模式出现 Info:\n #36 blank spaces# Information\n
。有时它是一个数字,有时它只是普通文本。如果有人能帮我检索,例如,宽度和深度的数值,我就可以将它应用到其他所有方面。
text <- "Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n "
no_spaces <- gsub("\n|\s","",text)
width <- as.numeric(sub(".+Width\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #30
depth <- as.numeric(sub(".+Depth\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #24.2
正则表达式有点麻烦,因为您必须引用括号、缩写句点、可选的小数点等。但它似乎有效。 HTH
我会为此尝试 strsplit
。
clean <- function(x) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
paste(s2[indx], s2[indx+1])
}
clean(x)
[1] "Dimensions: " "Width (in.): 30" "Depth (in.): 24.25"
[4] "Width: 30 inches" "Weight (lbs.): 320" "Height (in.): 50.5"
如果您不需要文字,请试试这个:
clean2 <- function(x, measure) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
num
}
clean2(x)
[1] NA 30.00 24.25 30.00 320.00 50.50
或者在我看来更好:
clean3 <- function(x, measure) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
df <- data.frame(Measure=s2[indx], Value=num)
df
}
# clean3(x)
# Measure Value
# 1 Dimensions: NA
# 2 Width (in.): 30.00
# 3 Depth (in.): 24.25
# 4 Width: 30.00
# 5 Weight (lbs.): 320.00
# 6 Height (in.): 50.50