html_nodes returns 一个空列表
html_nodes returns an empty list
我正在抓取包含某些词的报纸文章的数量。比如1929年CA的“Republican”这个词,来自这个网站:
url = https://www.newspapers.com/search/#query=republican&dr_year=1929-1929&p_place=CA
我想复制命中数(例子中是23490),我用的是这个代码:
hits <- url %>%
read_html() %>%
html_nodes('.total-hits') %>%
html_text()
但是 html_text() returns 一个空列表。我将不胜感激任何帮助。谢谢!
这是一种方法。看页面源码,好像是要针对td
。然后,进行一些字符串操作并创建输出。我把前 10 行留在下面。
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes("td") %>%
html_text() %>%
gsub(pattern = "\n", replacement = "") %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
rename(state = V1, count = V2)
state count
1 California 23,490
2 Pennsylvania 51,697
3 New York 35,428
4 Indiana 23,199
5 New Jersey 22,787
6 Missouri 20,650
7 Ohio 15,270
8 Illinois 14,920
9 Iowa 14,676
10 Wisconsin 13,821
另一种方式如下。我进一步指定了我想要获取文本的位置。有两个目标。所以我用了map_dfc()
。这样,我就直接创建了一个数据框。然后,我做了类似的工作。这次,我将字符转换为数字。
map_dfc(.x = c("td.tn", "td.tar"),
.f = function(x){
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes(x) %>%
html_text()}
) %>%
rename(state = `...1`, count = `...2`) %>%
mutate(state = gsub(x = state, pattern = "\n", replacement = ""),
count = as.numeric(sub(x = count, pattern = ",", replacement = "")))
state count
<chr> <dbl>
1 California 23490
2 Pennsylvania 51697
3 New York 35428
4 Indiana 23199
5 New Jersey 22787
6 Missouri 20650
7 Ohio 15270
8 Illinois 14920
9 Iowa 14676
10 Wisconsin 13821
问题是您抓取了错误的 URL,将其更改为 https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA
并将 html_nodes
更改为 html_node
然后您的代码就可以工作了。
我正在抓取包含某些词的报纸文章的数量。比如1929年CA的“Republican”这个词,来自这个网站:
url = https://www.newspapers.com/search/#query=republican&dr_year=1929-1929&p_place=CA
我想复制命中数(例子中是23490),我用的是这个代码:
hits <- url %>%
read_html() %>%
html_nodes('.total-hits') %>%
html_text()
但是 html_text() returns 一个空列表。我将不胜感激任何帮助。谢谢!
这是一种方法。看页面源码,好像是要针对td
。然后,进行一些字符串操作并创建输出。我把前 10 行留在下面。
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes("td") %>%
html_text() %>%
gsub(pattern = "\n", replacement = "") %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
rename(state = V1, count = V2)
state count
1 California 23,490
2 Pennsylvania 51,697
3 New York 35,428
4 Indiana 23,199
5 New Jersey 22,787
6 Missouri 20,650
7 Ohio 15,270
8 Illinois 14,920
9 Iowa 14,676
10 Wisconsin 13,821
另一种方式如下。我进一步指定了我想要获取文本的位置。有两个目标。所以我用了map_dfc()
。这样,我就直接创建了一个数据框。然后,我做了类似的工作。这次,我将字符转换为数字。
map_dfc(.x = c("td.tn", "td.tar"),
.f = function(x){
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes(x) %>%
html_text()}
) %>%
rename(state = `...1`, count = `...2`) %>%
mutate(state = gsub(x = state, pattern = "\n", replacement = ""),
count = as.numeric(sub(x = count, pattern = ",", replacement = "")))
state count
<chr> <dbl>
1 California 23490
2 Pennsylvania 51697
3 New York 35428
4 Indiana 23199
5 New Jersey 22787
6 Missouri 20650
7 Ohio 15270
8 Illinois 14920
9 Iowa 14676
10 Wisconsin 13821
问题是您抓取了错误的 URL,将其更改为 https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA
并将 html_nodes
更改为 html_node
然后您的代码就可以工作了。