R sub-删除行终止符

Question

我正在尝试从我抓取的一段代码中检索深度和宽度等信息，但在执行时遇到了问题。

obtain_url <- html(# Some url)
test <-  obtain_url %>% html_node("#specifications") %>% html_text()
edit(test)


Dimensions:\n                            \n                                    Width (in.):\n                                    30\n                                \n                                \n                                \n                                    Depth (in.):\n                                    24.25\n                                \n                                \n                                \n                                    Width:\n                                    30 inches\n                                \n                                \n                                \n                                    Weight (lbs.):\n                                    320\n                                \n                                \n                                \n                                    Height (in.):\n                                    50.5\n 

dn<-sub(".*Width (in.):\n(.*)\n .*","\1",test) # My attempt at retrieving width info

我的尝试只是简单地吐出相同的文本。我感兴趣的所有信息总是以相同的模式出现 Info:\n #36 blank spaces# Information\n。有时它是一个数字，有时它只是普通文本。如果有人能帮我检索，例如，宽度和深度的数值，我就可以将它应用到其他所有方面。

Answer 1

text <- "Dimensions:\n                            \n                                    Width (in.):\n                                    30\n                                \n                                \n                                \n                                    Depth (in.):\n                                    24.25\n                                \n                                \n                                \n                                    Width:\n                                    30 inches\n                                \n                                \n                                \n                                    Weight (lbs.):\n                                    320\n                                \n                                \n                                \n                                    Height (in.):\n                                    50.5\n "

no_spaces <- gsub("\n|\s","",text)

width <- as.numeric(sub(".+Width\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #30
depth <- as.numeric(sub(".+Depth\(in\.\)\:(\d+\.?\d?).*",("\1"),no_spaces)) #24.2

正则表达式有点麻烦，因为您必须引用括号、缩写句点、可选的小数点等。但它似乎有效。 HTH

Answer 2

我会为此尝试 strsplit。

clean <- function(x) {
  s <- strsplit(x, '\n')
  s2 <- gsub('\s{2,}', '', s[[1]])
  indx <- grep(':', s2)
  paste(s2[indx], s2[indx+1])
}

clean(x)
[1] "Dimensions: "       "Width (in.): 30"    "Depth (in.): 24.25"
[4] "Width: 30 inches"   "Weight (lbs.): 320" "Height (in.): 50.5"

如果您不需要文字，请试试这个：

clean2 <- function(x, measure) {
  s <- strsplit(x, '\n')
  s2 <- gsub('\s{2,}', '', s[[1]])
  indx <- grep(':', s2)
  res <- s2[indx+1]
  num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
  num
}

clean2(x)
[1]     NA  30.00  24.25  30.00 320.00  50.50

或者在我看来更好：

clean3 <- function(x, measure) {
s <- strsplit(x, '\n')
s2 <- gsub('\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\.]', '', res, perl=T))
df <- data.frame(Measure=s2[indx], Value=num)
df
}

# clean3(x)
#          Measure  Value
# 1    Dimensions:     NA
# 2   Width (in.):  30.00
# 3   Depth (in.):  24.25
# 4         Width:  30.00
# 5 Weight (lbs.): 320.00
# 6  Height (in.):  50.50

R sub-删除行终止符

R sub- removing line terminators

substring

r

web-scraping