我如何清理和组织这个抓取数据列表？

Question

我的问题是，在这个社区的大力帮助下，我已经设法抓取了大部分我想要的数据；但是，我没有设法以任何有意义的方式组织它。我在 source 中使用的链接是我为这个项目拥有的许多链接中的一个示例，这些链接代表了所有链接

library(rvest)
library(tidyverse)

#source links
source<-c("http://www.ufcstats.com/fighter-details/f2688492b9a525a3","http://www.ufcstats.com/fighter-details/f1fac969a1d70b08")

fp_e<-map(source, function(career_data){
  read_html(career_data)%>%
    html_nodes("div ul li")%>%
    html_text()%>%
    #cleans up the data a bit
    str_replace_all(.,"\n\s+\n\s+","")%>%
    as.data.frame(.)
})

我想用这个列表做的是把它变成一个可用的数据框。我最初的想法是transpose()它在as.data.frame()之后；但是，它所做的只是将所有内容都放在一行中。此外，我无法索引数据框。这让我相信数据框并没有像我想象的那样设置。我想在这里说得更具体一些，但老实说，我现在很困惑。

四处搜索，我发现了这个并且 neilfws 的回答给了我构建数据框并将数据插入其中的想法；但是，我什至不知道从哪里开始。我也不确定是否有必要在已经以我喜欢的格式设置时这样做。

这是我尝试过的第一个真实 R 应用程序，我真的很迷茫如何组织这些数据。感谢您的帮助和建议！

Answer 1

您可以使用 tidyverse 库进行一些数据清理：

library(tidyverse)
library(rvest)

map(source, function(career_data){
  read_html(career_data) %>%
    html_nodes("div ul li")%>%
    html_text() %>%
    trimws(whitespace = '[\s\n]') %>%
    tibble(data = .) %>%
    separate(data, c('Property', 'Value'), sep = ':') %>%
    na.omit() %>%
    mutate(Value = trimws(Value, whitespace = '[\n\s]'))
})

这个returns:

#[[1]]
# A tibble: 13 x 2
#   Property  Value         
#   <chr>     <chr>         
# 1 Height    "5' 6\""      
# 2 Weight    "135 lbs."    
# 3 Reach     "68\""        
# 4 STANCE    "Orthodox"    
# 5 DOB       "Oct 16, 1981"
# 6 SLpM      "3.70"        
# 7 Str. Acc. "39%"         
# 8 SApM      "2.70"        
# 9 Str. Def  "66%"         
#10 TD Avg.   "2.28"        
#11 TD Acc.   "31%"         
#12 TD Def.   "65%"         
#13 Sub. Avg. "0.3"         

#[[2]]
# A tibble: 13 x 2
#   Property  Value         
#   <chr>     <chr>         
# 1 Height    "6' 2\""      
# 2 Weight    "170 lbs."    
# 3 Reach     "74\""        
# 4 STANCE    "Southpaw"    
# 5 DOB       "Aug 25, 1991"
# 6 SLpM      "2.53"        
# 7 Str. Acc. "47%"         
# 8 SApM      "2.05"        
# 9 Str. Def  "55%"         
#10 TD Avg.   "1.39"        
#11 TD Acc.   "31%"         
#12 TD Def.   "70%"         
#13 Sub. Avg. "0.4"

我如何清理和组织这个抓取数据列表？

How do I clean and organize this list of scraped data?

r

web-scraping

data-structures

rvest