rvest html_nodes() returns 空字符
rvest html_nodes() returns empty character
我正在尝试抓取网站 (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281)。特别是,我试图抓取所有 281 个“发布日期”(第一个是“2006 年 10 月 30 日”)
为此,我使用了 R 包 rvest
和 SelectorGadget Chrome 扩展。我正在使用 Mac 版本 10.15.6.
我尝试了以下代码:
library(rvest)
library(httr)
library(xml2)
library(dplyr)
link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()
但是,这个 returns 'character(0)'.
我使用代码 td:nth-child(4) ul
因为这是 SelectorGadget 为 281 个发布日期中的每一个突出显示的内容。我也试过“查看源页面”,但在源页面上找不到这些年份。
我了解到 rvest
并不总是有效,具体取决于网站类型。在这种情况下,可能的解决方法是什么?谢谢。
此站点从 API 调用 https://genelab-data.ndc.nasa.gov/genelab/data/study/all returns JSON 数据中获取数据。您可以使用 httr 获取数据并解析 JSON :
library(httr)
url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
output <- content(GET(url), as = "parsed", type = "application/json")
#sort by glds_id
output = output[order(sapply(output, `[[`, i = "glds_id"))]
#build dataframe
result <- list();
index <- 1
for(t in output[length(output):1]){
result[[index]] <- t$metadata
result[[index]]$accession <- t$accession
result[[index]]$legacy_accession <- t$legacy_accession
index <- index + 1
}
df <- do.call(rbind, result)
options(width = 1200)
print(df)
输出样本(不含所有列)
accession legacy_accession public_release_date title
[1,] "GLDS329" "GLDS-329" "30-Oct-2006" "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
[2,] "GLDS322" "GLDS-322" "27-Aug-2020" "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
[3,] "GLDS320" "GLDS-320" "18-Sep-2014" "Gamma radiation and HZE treatment of seedlings in Arabidopsis"
[4,] "GLDS319" "GLDS-319" "18-Jul-2018" "Muscle atrophy, osteoporosis prevention in hibernating mammals"
[5,] "GLDS318" "GLDS-318" "01-Dec-2019" "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
[6,] "GLDS317" "GLDS-317" "19-Dec-2017" "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"
[7,] "GLDS311" "GLDS-311" "31-Jul-2020" "Part two: ISS Enterobacteriales"
[8,] "GLDS309" "GLDS-309" "12-Aug-2020" "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
[9,] "GLDS308" "GLDS-308" "07-Aug-2020" "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
[10,] "GLDS305" "GLDS-305" "27-Aug-2020" "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
[11,] "GLDS304" "GLDS-304" "28-Aug-2020" "Global gene expression in response to X rays in mice deficient in Parp1"
[12,] "GLDS303" "GLDS-303" "15-Jun-2020" "ISS Bacillus Genomes"
[13,] "GLDS302" "GLDS-302" "31-May-2020" "ISS Enterobacteriales Genomes"
[14,] "GLDS301" "GLDS-301" "30-Apr-2020" "Eruca sativa Rocket Science RNA-seq"
[15,] "GLDS298" "GLDS-298" "09-May-2020" "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
...........................................................................
我正在尝试抓取网站 (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281)。特别是,我试图抓取所有 281 个“发布日期”(第一个是“2006 年 10 月 30 日”)
为此,我使用了 R 包 rvest
和 SelectorGadget Chrome 扩展。我正在使用 Mac 版本 10.15.6.
我尝试了以下代码:
library(rvest)
library(httr)
library(xml2)
library(dplyr)
link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()
但是,这个 returns 'character(0)'.
我使用代码 td:nth-child(4) ul
因为这是 SelectorGadget 为 281 个发布日期中的每一个突出显示的内容。我也试过“查看源页面”,但在源页面上找不到这些年份。
我了解到 rvest
并不总是有效,具体取决于网站类型。在这种情况下,可能的解决方法是什么?谢谢。
此站点从 API 调用 https://genelab-data.ndc.nasa.gov/genelab/data/study/all returns JSON 数据中获取数据。您可以使用 httr 获取数据并解析 JSON :
library(httr)
url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
output <- content(GET(url), as = "parsed", type = "application/json")
#sort by glds_id
output = output[order(sapply(output, `[[`, i = "glds_id"))]
#build dataframe
result <- list();
index <- 1
for(t in output[length(output):1]){
result[[index]] <- t$metadata
result[[index]]$accession <- t$accession
result[[index]]$legacy_accession <- t$legacy_accession
index <- index + 1
}
df <- do.call(rbind, result)
options(width = 1200)
print(df)
输出样本(不含所有列)
accession legacy_accession public_release_date title
[1,] "GLDS329" "GLDS-329" "30-Oct-2006" "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
[2,] "GLDS322" "GLDS-322" "27-Aug-2020" "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
[3,] "GLDS320" "GLDS-320" "18-Sep-2014" "Gamma radiation and HZE treatment of seedlings in Arabidopsis"
[4,] "GLDS319" "GLDS-319" "18-Jul-2018" "Muscle atrophy, osteoporosis prevention in hibernating mammals"
[5,] "GLDS318" "GLDS-318" "01-Dec-2019" "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
[6,] "GLDS317" "GLDS-317" "19-Dec-2017" "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"
[7,] "GLDS311" "GLDS-311" "31-Jul-2020" "Part two: ISS Enterobacteriales"
[8,] "GLDS309" "GLDS-309" "12-Aug-2020" "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
[9,] "GLDS308" "GLDS-308" "07-Aug-2020" "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
[10,] "GLDS305" "GLDS-305" "27-Aug-2020" "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
[11,] "GLDS304" "GLDS-304" "28-Aug-2020" "Global gene expression in response to X rays in mice deficient in Parp1"
[12,] "GLDS303" "GLDS-303" "15-Jun-2020" "ISS Bacillus Genomes"
[13,] "GLDS302" "GLDS-302" "31-May-2020" "ISS Enterobacteriales Genomes"
[14,] "GLDS301" "GLDS-301" "30-Apr-2020" "Eruca sativa Rocket Science RNA-seq"
[15,] "GLDS298" "GLDS-298" "09-May-2020" "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
...........................................................................