如何使用 rvest 将文本分类到不同的列中?
How do I use rvest to sort text into different columns?
我正在使用 rvest
(尝试)从名为 RePEc 的学术出版物数据库中抓取所有作者隶属关系数据。我有作者的短 ID,我用它来抓取从属关系数据。但是,每次我尝试时,都会出现 404 错误:Error in open.connection(x, "rb") : HTTP error 404
这一定是我使用 sapply
的问题,因为当我使用个人 ID 对其进行测试时,它可以正常工作。这是我正在使用的代码:
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))
我实际上需要为六列作者执行此操作,并且我想跳过 NA
个值,所以如果有人也知道如何执行此操作,我将非常感激(但不是如果我不这样做,那有什么大不了的)。预先感谢您的帮助!
编辑:我刚刚发现错误出在网站的公式中。有时应该是 df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
有时应该是 df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")
有谁知道如何让 R 尝试两者并给我一个有效的方法?
您可以拥有这两个链接并在它们的底部使用 try
。我假设只有 1 个可以提供有效的网站。否则我们总是可以编辑代码以接受所有有效的东西:
library(rvest)
library(purrr)
df = data.frame(id=1:6)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation <- sapply(df$author_reg, function(x){
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempt
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
查看结果:
df
id author_reg
1 1 paa6
2 2 paa2
3 3 paa1
4 4 paa8
5 5 pve266
6 6 pya500
affiliation
1 Statistisk SentralbyråGovernment of Norway
2 Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4 Centraal Planbureau (CPB)Government of the Netherlands
5 Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6 Business SchoolSwinburne University of Technology
我正在使用 rvest
(尝试)从名为 RePEc 的学术出版物数据库中抓取所有作者隶属关系数据。我有作者的短 ID,我用它来抓取从属关系数据。但是,每次我尝试时,都会出现 404 错误:Error in open.connection(x, "rb") : HTTP error 404
这一定是我使用 sapply
的问题,因为当我使用个人 ID 对其进行测试时,它可以正常工作。这是我正在使用的代码:
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))
我实际上需要为六列作者执行此操作,并且我想跳过 NA
个值,所以如果有人也知道如何执行此操作,我将非常感激(但不是如果我不这样做,那有什么大不了的)。预先感谢您的帮助!
编辑:我刚刚发现错误出在网站的公式中。有时应该是 df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
有时应该是 df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")
有谁知道如何让 R 尝试两者并给我一个有效的方法?
您可以拥有这两个链接并在它们的底部使用 try
。我假设只有 1 个可以提供有效的网站。否则我们总是可以编辑代码以接受所有有效的东西:
library(rvest)
library(purrr)
df = data.frame(id=1:6)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation <- sapply(df$author_reg, function(x){
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempt
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
查看结果:
df
id author_reg
1 1 paa6
2 2 paa2
3 3 paa1
4 4 paa8
5 5 pve266
6 6 pya500
affiliation
1 Statistisk SentralbyråGovernment of Norway
2 Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4 Centraal Planbureau (CPB)Government of the Netherlands
5 Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6 Business SchoolSwinburne University of Technology