在 R 中,将杂乱的数据抓取并组织到数据框中
In R, get messy data scraped and organized into data frame
我们正在努力收集有关大学篮球教练的一般信息。这是我试图抓取的两个示例页面:
- https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index
- https://uofoathletics.com/sports/wbkb/coaches/index
我们的理想输出是:
data.frame(
name = c('Mark Schmidt', 'Sean Neal', 'Matt Pappano', 'Steve Curran', 'Tray Woodall', NA, 'Dominique Broadus'),
title = c("Head Men's Basketball Coach", "Assistant Men's Basketball Coach", "Director Of Basketball Operations", "Associate Head Coach, Men's Basketball", "Assistant Men's Basketball Coach", "Head Women's Basketball Coach", "Assistant Women's Basketball Coach"),
email = c(NA, 'sneal@sbu.edu', 'mpappano@sbu.edu', 'scurran@sbu.edu', 'twoodall@sbu.edu', NA, 'dbroadus@ozarks.edu'),
phone = c('716-375-2207', '716-375-2257', '716-375-2218', '716-375-2258', '716-375-2259', '479-979-1325', '479-979-1325'),
stringsAsFactors = FALSE
)
name title email phone
1 Mark Schmidt Head Men's Basketball Coach <NA> 716-375-2207
2 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu 716-375-2257
3 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu 716-375-2218
4 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu 716-375-2258
5 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu 716-375-2259
6 <NA> Head Women's Basketball Coach <NA> 479-979-1325
7 Dominique Broadus Assistant Women's Basketball Coach dbroadus@ozarks.edu 479-979-1325
由于以下几个原因,这给我们带来了问题:
- 在这两个页面上,数据都没有保存在 table 中,而是保存在每个人的个人
divs
中。
- 缺少一些数据。还缺少 2 个电子邮件地址和一个姓名。
到目前为止我们得到的是:
# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page1_bios_df <- unlist(page1_list) %>% as.data.frame()
page2_list <- lapply(page2_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page2_bios_df <- unlist(page2_list) %>% as.data.frame()
我们并没有那么接近,事实上我们甚至不确定这是否有可能做到。我认为我们需要先将数据放入数据框中,即使列名是错误的,然后检查列的内容(例如,为电子邮件寻找@符号,为 phone 数字寻找#s,为标题等的单词“教练”)尝试正确命名它们。
实现您想要的结果的一个选项可能如下所示。基本上,我的方法是使用特定的 CSS 选择器逐条提取所需的信息:
library(rvest)
library(magrittr)
# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
get_bios <- function(x) {
data.frame(
name = x %>% html_node("span.name") %>% html_text(),
title = x %>% html_node("p:nth-of-type(2)") %>% html_text(),
email = x %>% html_node("p.email a") %>% html_attr("href"),
phone = x %>% html_node("p:last-of-type") %>% html_text()
)
}
# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, get_bios)
page2_list <- lapply(page2_bios, get_bios)
bios_df <- do.call("rbind", c(page1_list, page2_list))
bios_df$email <- gsub("^mailto:(.*)$", "\1", bios_df$email)
bios_df$phone <- gsub("^Phone:\s(.*)$", "\1", bios_df$phone)
bios_df
#> name title email
#> 1 Mark Schmidt Head Men's Basketball Coach <NA>
#> 2 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu
#> 3 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu
#> 4 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu
#> 5 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu
#> 6 Head Women's Basketball Coach <NA>
#> 7 Dominique Broadus Assistant Women's Basketball Coach dbroadus@ozarks.edu
#> phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218
#> 6 479-979-1325
#> 7 479-979-1325
我只能打开第一个url,所以我的解决方法如下
library(rvest)
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
name <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix span.name") %>%
html_text()
title <- page1 %>% html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div > p:nth-child(2)") %>%
html_text()
email <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
html_text() %>%
gsub(".*\n(.*@.*)\nPhone.*","\1",.)
email[grep("@",email,invert = T)] <- NA
phone <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
html_text() %>%
gsub(".*\nPhone: (.*)\n.*","\1",.)
df <- data.frame(name,title,email,phone)
# df$email[which(!grepl("@",df$email))] <- NA
df
#> name title email
#> 1 Mark Schmidt Head Men's Basketball Coach <NA>
#> 2 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu
#> 3 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu
#> 4 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu
#> 5 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu
#> phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218
由 reprex package (v2.0.0)
于 2021-07-17 创建
我们正在努力收集有关大学篮球教练的一般信息。这是我试图抓取的两个示例页面:
- https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index
- https://uofoathletics.com/sports/wbkb/coaches/index
我们的理想输出是:
data.frame(
name = c('Mark Schmidt', 'Sean Neal', 'Matt Pappano', 'Steve Curran', 'Tray Woodall', NA, 'Dominique Broadus'),
title = c("Head Men's Basketball Coach", "Assistant Men's Basketball Coach", "Director Of Basketball Operations", "Associate Head Coach, Men's Basketball", "Assistant Men's Basketball Coach", "Head Women's Basketball Coach", "Assistant Women's Basketball Coach"),
email = c(NA, 'sneal@sbu.edu', 'mpappano@sbu.edu', 'scurran@sbu.edu', 'twoodall@sbu.edu', NA, 'dbroadus@ozarks.edu'),
phone = c('716-375-2207', '716-375-2257', '716-375-2218', '716-375-2258', '716-375-2259', '479-979-1325', '479-979-1325'),
stringsAsFactors = FALSE
)
name title email phone
1 Mark Schmidt Head Men's Basketball Coach <NA> 716-375-2207
2 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu 716-375-2257
3 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu 716-375-2218
4 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu 716-375-2258
5 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu 716-375-2259
6 <NA> Head Women's Basketball Coach <NA> 479-979-1325
7 Dominique Broadus Assistant Women's Basketball Coach dbroadus@ozarks.edu 479-979-1325
由于以下几个原因,这给我们带来了问题:
- 在这两个页面上,数据都没有保存在 table 中,而是保存在每个人的个人
divs
中。 - 缺少一些数据。还缺少 2 个电子邮件地址和一个姓名。
到目前为止我们得到的是:
# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page1_bios_df <- unlist(page1_list) %>% as.data.frame()
page2_list <- lapply(page2_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page2_bios_df <- unlist(page2_list) %>% as.data.frame()
我们并没有那么接近,事实上我们甚至不确定这是否有可能做到。我认为我们需要先将数据放入数据框中,即使列名是错误的,然后检查列的内容(例如,为电子邮件寻找@符号,为 phone 数字寻找#s,为标题等的单词“教练”)尝试正确命名它们。
实现您想要的结果的一个选项可能如下所示。基本上,我的方法是使用特定的 CSS 选择器逐条提取所需的信息:
library(rvest)
library(magrittr)
# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
get_bios <- function(x) {
data.frame(
name = x %>% html_node("span.name") %>% html_text(),
title = x %>% html_node("p:nth-of-type(2)") %>% html_text(),
email = x %>% html_node("p.email a") %>% html_attr("href"),
phone = x %>% html_node("p:last-of-type") %>% html_text()
)
}
# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, get_bios)
page2_list <- lapply(page2_bios, get_bios)
bios_df <- do.call("rbind", c(page1_list, page2_list))
bios_df$email <- gsub("^mailto:(.*)$", "\1", bios_df$email)
bios_df$phone <- gsub("^Phone:\s(.*)$", "\1", bios_df$phone)
bios_df
#> name title email
#> 1 Mark Schmidt Head Men's Basketball Coach <NA>
#> 2 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu
#> 3 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu
#> 4 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu
#> 5 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu
#> 6 Head Women's Basketball Coach <NA>
#> 7 Dominique Broadus Assistant Women's Basketball Coach dbroadus@ozarks.edu
#> phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218
#> 6 479-979-1325
#> 7 479-979-1325
我只能打开第一个url,所以我的解决方法如下
library(rvest)
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
name <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix span.name") %>%
html_text()
title <- page1 %>% html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div > p:nth-child(2)") %>%
html_text()
email <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
html_text() %>%
gsub(".*\n(.*@.*)\nPhone.*","\1",.)
email[grep("@",email,invert = T)] <- NA
phone <- page1 %>%
html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
html_text() %>%
gsub(".*\nPhone: (.*)\n.*","\1",.)
df <- data.frame(name,title,email,phone)
# df$email[which(!grepl("@",df$email))] <- NA
df
#> name title email
#> 1 Mark Schmidt Head Men's Basketball Coach <NA>
#> 2 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu
#> 3 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu
#> 4 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu
#> 5 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu
#> phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218
由 reprex package (v2.0.0)
于 2021-07-17 创建